Socio-technically based risk assessment and management
19 Sep 2014
Dealing with personal risks is often straightforward, even instinctive. If a fire breaks out you move away from it; if a car swerves towards your car, you take avoiding action or brake or both. But managing risk in large organisations can be far more complex; risks are not always obvious and can manifest themselves in many ways, some of them unexpected.
Know your risks
It is the job of the safety director or safety manager to oversee the assessment of risks in complex organisations. The audience for that assessment is in the first instance the CEO and the executive board of the organisation, who need to understand the risks being managed, and broadly how they are being managed. They can then take the decisions necessary to set up and operate an effective safety management system which copes with the uncertainties and complexities.
Risks vary according to the organisation’s function and processes. In a steel mill risk management may focus on avoidance of burns and a lot worse, eg falling into an open furnace. For a transportation system, risk management will often focus on collisions of any type. For a nuclear power plant, there are many risks but the primary one is the loss of containment resulting in release of radioactivity into the atmosphere. But all of these organisations also have many other types of risk, which all need managing, or they may catch you out with a fatality or serious injury. As Amalberti and Vincent say in their accompanioning essay, the model of safety may differ between these industries. In what follows we address mainly their ’ultra-safe systems model’, but incorporate aspects of their ’high reliability organizations’ model. Risk assessment is also important in their ’ultra-resilient model’, but is likely there not to be done as explicitly as in the other two models and to be concentrated on novices absorbing knowledge about the risks of the occupation at the feet of experienced operators.
If an organisation is to manage its risks competently there must be a common understanding in the organisation of what risks it faces, and what actions and activities it can deploy to keep those risks under control. This may not be easy because different professions value and handle complex quantitative and qualitative information differently and therefore also interpret risk assessments differently. An organisation should have a comprehensive risk picture that encompasses all its safety risks, including how they can interact to make a bad situation worse. In risk assessments, safety and threat to survival of a company have to be combined. In this way the organisation can be prepared to face its daily challenges.
Hazards and risks
The general definition of a hazard is a situation that poses a threat to life, health, property or the environment. Most hazards in this general sense are considered ’dormant’, unless they become ’active’ in which case their threat is realised. Generic examples are kinetic energy from moving or flying objects, potential energy from working at heights where a fall of a person or object converts the potential into kinetic injury before impact, etc. The formal definition of hazard when carrying out risk assessment and management is more precise. A hazard is any biological, chemical, mechanical, environmental or physical agent that is reasonably likely to cause harm to humans or damage to the system or the organisation, in the absence of control. This means that the organisation must determine its hazards and establish controls for them.
The risk associated with the hazard is then the probability of loss of control of the hazard, combined with the resultant consequences. This definition of risk encompasses a wide range of possible consequences, ranging across injury, disease, theft, physical damage, production loss, poor product quality and safety, environmental pollution, business interruption, bankruptcy, cybercrime, etc.
A company needs to decide which of these risks it needs to manage and how. If the organisation is regulated for safety, then some or all of these risks must be managed by law in order to maintain an operating license. Ultimately, at the level of the Board and the CEO all risks have to be managed and trade-offs decided where necessary between them and their controls. However at the level of line and staff departments and advisers, the responsibility for ensuring that the organisation controls its different risks may be allocated to different parts of the organisation. In some companies there is, for example, a separate quality manager, health and safety manager and environmental manager, whilst in others all of these areas may fall under one manager or director. The CEO and Board must decide how this allocation of responsibility will be made for their organisation. This can be based on the similarities and differences in the causes or controls for different types of risks (process and workplace), or according to organisational levels. Ultimately these different ’risk managers’ can only be advisers to and monitors of the line management and Board, where the ultimate responsibility lies.
How safe do you need to be?
Risk is first and foremost concerned with an unwanted event, usually considered as an ’accident’, eg an air crash, a nuclear meltdown, a factory fire, a ship sinking, a train derailment, a man falling to his death from a ladder or a gantry, or being struck by a vehicle. Many industries have formal risk targets imposed and monitored by a regulatory authority. For example, a nuclear power plant core meltdown should have a probability of less than once in a million years that a nuclear power plant might be operating. This sounds comforting, until a quick analysis based on say, three hundred reactors worldwide and fifty operating years, means that we should have seen no meltdowns, but in fact there have been three (Three Mile Island, Chernobyl, and Fukushima). For commercial aviation (carrying passengers and freight), air crashes generally occur at a rate of one in ten million flights. This also sounds good, but because we fly a lot the number of fatal crashes annually worldwide is typically in double figures.
The consequences part of the risk concept used to concern just the number of fatalities, eg one person killed, several people killed, up to hundreds killed. If an organisation kills many people it is unlikely to survive. However, today this has become more complex. Damage to the reputation of a company may be a consequence for an organisation, eg where there is loss of confidence in the company resulting in loss of revenue and eventual bankruptcy. This element of risk is of serious concern for many organisations, especially given the rising influence of social media networks such as Facebook and Twitter.
How do you know your ’level’ of risk?
There are formal methods for calculating safety risk. They provide answers to the following questions that the CEO of an organization should be asking:
• What are my hazards?
• What can go wrong with the controls in place?
• How likely is it to go wrong?
• Can I see these risks all together, qualitatively or quantitatively?
• What is my overall risk, and what are my top risks?
• How sure am I of the answers?
• Are we meeting the safety target?
• What can we do to reduce risks?
Below is an example of one approach, called a fault tree (Figure 1). This particular extract comes from the field of aviation, looking at the risk of a mid-air collision; such events are very rare, thanks to multiple independent safety systems in the air and on the ground, in terms of both automation and pilots and controllers. The probability (Q) figures are derived from experience or databases, feeding up from the base events or failures (the circular icons), through either ’and-gates’ (icons with straight undersides) or ’or-gates’ (icons with concave undersides) to the top event with its thus-calculated probability.
Figure 1: Example of a fault tree
Tools such as fault trees help with the difficult part of ‚determining what controls are or should be in place and how they might fail’ and of ’putting the risks together’ so that an overall risk picture can be gained, and total risk calculated. Typically the events in the fault trees lead up to the eventual loss of control of the hazard, and another ’tree’, called an ’Event Tree’ determines the range of consequences that are likely to occur. A generic illustration of this, called the ’bow tie diagram’ is given below (Figure 2), with ’business upset’ as the central event, which in safety terms usually equates to loss of control of a hazard.
Figure 2: Example of a bow tie diagram
Manage your risk controls
The output of risk assessments consists of a description of risk controls which will have to be implemented to prevent and mitigate the unwanted consequences of the risk scenarios. Risk control mostly happens via barriers that keep the hazard under control or mitigate its consequences once control is lost. Barriers may be physical (machinery guards, edge protection on roofs, chemical bunds around storage tanks, ear defenders, safety goggles, pressure relief valves, sprinkler systems, fire extinguishers, etc.) or behavioural (skilled fire fighters, skill with a boning knife in an abattoir, keeping away from moving machinery, evacuation before an encroaching fire, etc.), or a combination of both (competent drivers of vehicles, activating a fire alarm, diagnosing an equipment failure and taking remedial action, etc.). Preventive barriers stop the scenario before the loss of control; mitigating barriers intervene afterwards to lessen the seriousness of the consequences. Risk management entails keeping the barriers on either side of the loss of control event effective, and looking for ways they could fail. In many activities there needs to be ’defence in depth’, with many barriers controlling a given serious risk, so that, if one fails there are others still working.
The concept of barriers is crucial also for one of the most influential safety models of the past three decades, the so-called ’Swiss Cheese’ model promoted by James Reason, as illustrated in Figure 3 below. The barriers are likened to blocks of Swiss cheese which have holes in them, meaning there are gaps which have appeared in the system’s defences; the barriers are not working effectively. If the holes ’line up’, meaning that a number of defects or deviations come together in an unpredicted way, an accident occurs. The successive layers may be technical barriers, eg in aviation there is a system aboard most aircrafts to detect another aircraft on an intercept course; or they can be organisational, eg the training and selection processes that deliver a safe and competent train driver. The model has been particularly useful in demonstrating the systemic nature of accidents.
Successful management of risk controls, once they have been decided on, consists of managing the life cycle of those technical and behavioral elements making up the risk control; making sure that they are developed, fit for purpose, installed, used and maintained. In the descriptions below we start from the point where the required risk control and its hardware, software and behavioural elements have already been specified to the best of the ability of the organisation. However even the best risk analysis can never anticipate and control all future risks. We are not prescient or inventive enough to guarantee that. Hence management of the risk controls must include monitoring and improvement to respond to learning opportunities when new risks or new ways for the risk controls to fail are discovered.
- Purchase/construct. Is a suitable high quality hardware/software risk control available on the market, or can it better be fabricated in-company? In the first instance the procurement function needs to specify the requirements and find suitable suppliers meeting the design specification and dependability requirements. In the latter instance the construction function does that work and may need to call in adequate expertise on dependability.
- Install/commission. This work may be done by a company department or sub-contracted to a supplier. It requires competence, coordination and monitoring.
- Use. This is the link to the management of the behavioural elements of the risk control.
- Inspect. The functioning of the hardware requires monitoring either continuously, or at planned intervals.
- Maintain. If inspection shows deterioration of its functionality the hardware needs maintaining.
- Monitor/modify/improve. If the hardware fails to live up to its planned performance, or fails to control all risk scenarios experienced it may need modifying or replacing.
- Specify procedures. The behaviour in using the hardware/software controls, and the additional required behaviour forming part of the risk controls themselves need to be captured in procedures which can be communicated to the users. If they are procedures for using technology, the technology and the procedures need to be designed together to maximise their usability.
- Select/train: manage competence. Suitably qualified people need to be recruited and trained in the procedures until they test as competent. For unexpected risk scenarios, which are not able to be captured in procedures, a more general competence to improvise may need to be trained.
- Provide manpower & communication. Being competent is not enough to ensure that the right behaviour is shown. The organisation must plan its manpower to be available when that behaviour is needed and must make sure that different individuals communicate and collaborate when the risk control depends on more than one person working in unison, such as at shift handovers or where risk depends on control room staff and field operators (maintenance engineers, pilots, train drivers, etc.) collaborating effectively.
- Motivate. The organisation must also go beyond just ensuring competence. It must motivate people to choose the correct risk control behaviour over conflicting behaviour aimed at production, quality, individual effort/comfort, etc., which may seem more attractive at the time.
- Maintain. As with hardware, behaviour may degrade or deviate over time, requiring inspection and feedback (behavioural safety), refresher training and behavioural campaigns.
- Modify/improve. If the behaviour fails to live up to its planned performance, or fails to control all risk scenarios experienced it may need modifying or replacing.
The detailed management of the performance of the risk controls which emerge from risk assessment has been dealt with above. The overall safety management system of an organisation needs to provide and coordinate those management processes. At a higher level of abstraction safety management consists of organising two interlocking management cycles (Figure 3):
• The first is the operational cycle (red arrows) that carries out the risk assessment, decides on risk controls, implements and monitors them and feeds back to modify the risk assessment and control.
• The second is the policy cycle (blue arrows) which sets the strategy, provides resources and allocates responsibilities to run that risk assessment and control cycle, monitors how it works and proposes and manages changes to achieve continuous improvement.
Figure 3: Two cycles for organizing safety
The second cycle needs to be driven in detail by the CEO and the Board of Directors because CEO buy-in into safety is key to risk-related decision-making in the organization. Moreover, risk and safety are always about trade-offs which need to be endorsed by the CEO and for which the CEO has to be held accountable. The first cycle is more the realm of the safety manager, by which board decisions are fed and supported. The setting up and monitoring of that cycle can be delegated to the safety manager’s competence to collaborate with the line and staff in its implementation. The Board and CEO then monitor that first cycle and ensure it keeps on taking place. The CEO and Board also need to assess and review the competence of the safety manager to fulfil this role.
Board level management of safety
The main role of the CEO with respect to safety is in asking questions, for example:
• What are our top safety risk concerns?
• Are they increasing/decreasing/stable?
• What do the latest quarterly safety trend analyses show?
• Are we as safe, or safer, than our competitors?
These questions send the message that safety is important to the other board members. This does not need to be ’heavy-handed’, but should simply reflect a genuine concern and understanding that safety is key to business health. However, this does need to be authentic; otherwise it is known as ’lip service’, which people see as being insincere.
The CEO should also appoint a director who is the safety ’champion’ (the safety manager will report directly to this person). This Safety Director, may also be the Director of Quality, and/or Environment, and/or Security. There remains debate about whether such a director should have only safety as his or her responsibility: sole responsibility allows clear focus and fewer conflicts, whereas joint responsibilities may enable better integration of safety into business models and decisions. The CEO should be able to challenge the Safety Director (eg over facts, figures and assumptions) to avoid other directors feeling that the Safety Director can ’play the safety card’ all the time. Similarly, if the Safety Director is going to raise something significant at board level, the CEO should be able to expect that the Safety Director has ’done his/her homework’, both in ensuring that there is enough evidence and/or concern to warrant raising the issue, and in talking with other relevant Directors, so that they are not surprised at the meeting. The second aspect means that the CEO should pick someone with a degree of ’political acumen’. As some CEOs have put it, they do not want a ’safety clerk’ at board level. When safety issues are raised at board level and determined to be important, the CEO should galvanise the board to explore them and put in place actions (not only on the Safety Director, but engaging other directors as appropriate), and demand progress on those actions at each successive Board meeting until the potential safety ’threat’ has been reduced or eliminated.
The CEO should also speak to line and staff managers about safety, and what is being done about safety. This can be done directly, or indirectly, whether by email, or more commonly by ’Blog’, or video, etc. These CEO-staff communications, aside from representing good safety culture, also affect the other board members. Other directors will be more likely to ’follow suit’ and talk more openly with their subordinate managers and staff about safety. In fact this will be to some degree expected, as otherwise it looks odd to staff that the organisation’s leader appears to care about safety but the directors do not. Put simply, this requires leadership by example.
One crucial prerequisite for board action on safety issues is adequate information. Beside the informal communication channels just mentioned, which are very important as the people „at the sharp end“ often have an acute insight into safety, there are two main sources of safety information. The first will be events that have happened – incidents or accidents. For these there will be formal procedures and probably associated reporting requirements to a regulatory body, as well as analysis mechanisms looking for causes and contributory factors. Such systems offer relatively reliable indicators (called ’lagging indicators, since they relate to the past and are ’forensic’ in nature) on how safety is doing. Secondly, there will be sources of safety information from safety cases, safety audits and surveys (including safety culture surveys), evaluations of Safety Management Systems, and steps to achieve safety certification levels associated with systems or projects. Such information can be seen as indicating present and future performance, and so indicators derived from such examinations are often called ’leading indicators’.
Safety Dashboards are an integration of key safety metrics that can be used by the CEO and Board to gain a quick overview of the safety health of their organisation and its operations. They contain both lagging and leading indications (ie data from past events, and information on current and future performance) of safety from metrics relevant to the organisation’s operational and regulatory context. Sometimes they also contain indications of business volume or operational ’level’ (whether it is up or down) since such degrees of ’output’ are often (but not always) correlated with safety risk (eg in air traffic, if the amount of traffic rises, generally so do the safety risks). The dashboard should contain what needs to be there, and not just what is easy to measure and has ready-made statistics. Similarly it is important not to get into a pure ’target-chasing’ approach by focusing solely on the numbers in an effort to increase or decrease certain statistics – this is because such an approach may lead to suppression of true statistics or reducing risks in one area by ’exporting’ it to another. The Dashboard is a tool to be used in understanding and improving safety, and only represents the ’top tier’ of safety information, much of which will be qualitative rather than quantitative.
One great worry: ’Drift into danger’
Drift into danger is like the proverbial frog who fails to jump out of a pan full of water that is brought to the boil slowly. A clear example of drift into danger is when, under economic pressure, resources for safety are slowly eroded, including operational safety personnel (through staff cutbacks or non-replaced staff losses) and equipment, as well as resources to carry out safety work. The problem is that, initially, nothing goes wrong, and all seems well. People adjust so that the new less-resourced system becomes the norm. Staff begin to adapt procedures and do things differently – they get the job done, but there are perhaps less safeguards. And then one day an accident happens, and it probably also escalates because there is less equipment and trained staff to deal with it.
It is not easy to detect that drift into danger is happening. It requires that the Board and Safety Manager look differently at safety indicators, with a longer term view, alert to trends, an attitude of ’creative mistrust’, questioning in particular when all the indicators appear to be indicating that all is well. If there are good safety indicators these may detect it, although this may not happen if the system has ’normalised’ around its new parameters of operation. Older operational staff are more likely to perceive the drift than younger and newer staff, and independent and external observers may also see the risk more clearly as they are ’outsiders’. Safety culture surveys can often pick up signals from comments or during workshops, and observational surveys or audits may similarly raise questions about whether safety margins have eroded or not.
Andrews, J.D. & Moss, T.R. (2002). Reliability and risk assessment (2nd ed.) 2002. London: MechE Professional Engineering Publishing.
Haddon, W. (1973). Energy damage and 10 countermeasure strategies. Human Factors, 15, 355-366.
Hale, A.R., Ale, B.J.M., Goossens, L.H.J., Heijer, T., Bellamy, L.J, Mud, M.L., Roelen, A., Baksteen, H., Post, J., Papazoglou, I.A., Bloemhoff, A. & Oh, J.I.H. (2007). Modeling accidents for prioritizing prevention. Reliability Engineering & System Safety, 92, 1701-1715.
Hale, A.R., Goossens, L.H.J., Ale, B.J.M., Bellamy, L.A., Post, J., Oh, J.I.H. &Papazoglou, I. A. (2004). Managing safety barriers and controls at the workplace.
In C. Spitzer, U. Schmocker & V.N. Dang (eds.), Probabilistic Safety Assessment & Management (pp. 608 – 613). Berlin: Springer.
Health and Safety Executive (2011). Five steps to risk assessment. http://www.hse.gov.uk/pubns/indg163.pdf
Maguire, R. (2006). Safety cases and safety reports. Aldershot, UK: Ashgate. Reason, J.T. (1997). Managing the risks of organisational accidents. Aldershot, UK: Ashgate.
Safety Intelligence for CEOs – A White Paper. http://www.eurocontrol.int/sites/default/ files/content/documents/nm/safety/safety_intelligence_white_paper_2013.pdf
Safe Work Australia (2012). Guide for major hazard facilities - safety assessment. http://www.safeworkaustralia.gov.au/sites/SWA/about/Publications/Documents/669/Safety%20Assessment.pdf
Download full article (PDF 437 KB)