The Hysteresis of Hysteria — The Gradients of Terror in a System Crisis
TL;DR
Project crises follow non-linear pathways. Anecdotally, system crises, such as a system outage of unknown cause or origin, follow a non-linear hysteresis curve. During these crises, teams are riding the “Gradients of Terror” and experiencing the “Hysteria of Hysteresis”.
Experiencing a Project Crisis
If you’ve been a project manager for any reasonable amount of time, you will have experienced a major crisis or emergency on your watch. I’d say the definition of a “seasoned” PM is someone who has gone through at least one “Extinction Level Event” project crisis.
Sometimes it happens during project execution, say during system testing.
For example, your test configuration is working fine, and then “bam!” something changes usually after a new version is installed. Or when you attempt to “rebuild” the test platform. The system flips over to being wholly unworkable or unusable. Backing out recent changes doesn’t work, and no one seems to know the root cause of the problem.
Or it happens in a production environment, perhaps after a new release. Or perhaps there is some gradually emerging problem, such as a minor memory leak or degrading component. There may be some subtle early-warning signs in system metrics or alarms, but no one has the time or experience to investigate or recognise the developing problem.
But then “boom!”
The system flips over into instability and alarm storms rage. Customers start complaining, executives start calling, and soon enough, it makes its way into the media.
Brand damage becomes real.
Is there a pattern in the noise?
Although no two crises ever feel like they are running the same course (and I’ve been through a few) it seems to me they all follow a generic pattern.
A slow start (visible or otherwise), then a rapid flip from stability to instability, followed by a mad scramble to get on top of it. After a while, things start to turn around, followed by a rapid flip back to stability (after you’ve found and fixed the root cause). And then a slow and steady progression towards normality as you clean up and recover.
This rapid flip between states is typical of hysteresis functions like the change in polarity of a magnet and plenty of other phenomena, even mood swings in someone suffering from a bipolar condition.
The rapid change in a hysteresis curve is highly non-linear. In project environments we know these are hard to handle and so they qualify as a “Gradient of Terror”.
What does the team experience during that project crisis cycle? Well, generally there is a lot of angst, stress, roller-coaster emotions and sometimes fear: “What if we can’t fix this?” — in other words, “hysteria”.
So, the team experiences the “Hysteresis of Hysteria”.
A Generic Plot for System Crises
I looked at a few of my own crises and the apparent hysteresis behaviour — it seemed to fit so I started mapping key events and system states to the generic curve above. It dropped out nicely onto a two-dimensional chart. But what dimensions?
Let’s look at those first and then walk through the crisis cycle.
(Note: You can read a more detailed description of each step after the end of this yarn, after the section called “The Bottom Line”).
The Dimensions
Below is the basic plot space for the analysis.
Management Intensity
The horizontal scale is “Management Intensity”, i.e. the strength of management focus and the weight of resources engaged in addressing the system’s state.
“Low Intensity” on the right is the steady-state / BAU. The number and seniority of people working on the system are relatively low (the lowest they can be).
“High Intensity” on the left represents a state in which a much larger group of people is engaged with the system, and are much more senior. At its worst, multiple levels of management are included as well as the senior levels of skill and additional resources (such as vendors) who are pulled in to find and fix the problem.
System Coherence
The vertical dimension of System Coherence measures how well the system coheres to its intended design and operation.
“High Coherence” at the top means that the solution is operating normally, as per design. There are few, if any, system problems, and it responds to commands and configuration changes.
“Low Coherence” at the bottom indicates that the system is either completely down or operating in a degraded state.
A Generic Crisis Pathway
By my reckoning, the cycle of degradation back to restoration follows a hysteresis curve as follows:
The high-level description of each step is below. You can read more detail after “The Bottom Line”.
1. Steady State: the maximum level of coherence and the minimum level of management.
2. Emergence: The system has an underlying problem. Maybe there are no symptoms. Local teams (e.g. admins and operators ) are handling the issue.
3. Tipping Point: The local team has applied “standard” diagnoses and remediations. The system may respond, but unseen problems are accelerating. Senior team members have been called in or at least alerted.
4. The Drop: The instability grows faster than the team can observe and report. Problem reports are streaming in: support calls/emails/chat sessions are spiking. Social media reports reach executives. Management control escalates in multiple directions until the director or VP is hands-on.
5. Early Traction: If a war room were ever on the cards, it would be operating by now. Teams by now are in full crisis mode and have abandoned all non-critical tasks. People are working long hours and looking ragged.
6. Low Point: Stale pizza and tired faces are everywhere, but the team has some fixes. Maybe the teams forecast non-instant recovery tasks, e.g. developing software or standing up new infrastructure).
7. Encouraging Response: The collective team has now planned out the recovery process and started work. The system is responding to the recovery work. The management process is descaling gracefully.
8. Problem / Solution Locked In: The intense focus begins to pay off. The level of management force and focus starts to seem a bit like overkill. Extreme measures are wound back
9. Rapid Recovery: The trend to recovery is clear — the team sees significant progress in developing fixes and deploying them successfully.
10. Light at the end: Rapid continued improvement results in an almost complete de-focus on the problem. The local team (admins or ops)
11. All the way back (Almost?): The system should be back to a steady-state of maximum coherence.
The Bottom Line
All big system crises that I’ve observed have followed a similar pattern. They seem to start with a “sleepwalking” perspective on systems that are operating well. If a system is working well (or perceived to be working well — not the same), then an organisation will show some or all of these symptoms:
- Management tends to ignore things that are in control and spend time on other problems.
- Given that the system is operating within narrow bounds, expertise appears high — there are few situations needing people to look at manuals or work instructions.
- Local “myths” develop about how the system operates. Like any myth, it has a core of truth but misinforms.
- Expectations that the system “works well” and will continue in this way leads to a reduction in resource budgets.
- People’s skill levels in operating the solution decline as cheaper resources are brought in, without apparent impact. Efficiency dividends are achieved and claimed.
- Constant re-organisation and personnel loss hollow out system knowledge amongst more senior and/or long-term employees.
Once this stasis situation creeps in, any unexpected or discontinuous event can rapidly exceed the experience and knowledge of the BAU team.
What is your experience? Is it any different?
If you’re interested in more detail, check out the “blow-by-blow” below.
Steps in the Crisis: the blow-by-blow
See the below for anyone interested in the full blurbs for each step.
1. Steady State
This point is at the maximum level of coherence and the minimum level of management. Exception-based monitoring is the norm: any built-in alarms are quiet, and pro-active checks are infrequent.
2. Emergence
At this point, the system becomes affected by an underlying problem, but there are few symptoms. An expert with years of experience might recognise the early warning signs, but they were promoted or are working on “more important” projects.
Early warning signals are easily misdiagnosed or ignored by day-to-day staff, or automated systems (which may be outdated) do not detect the problems. Without management focus, the problem spreads its effects very slowly at first.
3. Tipping Point
At this point, “standard” management responses have been applied by those closest to the system, e.g. admins and operators. Superficially, the system may respond, but the problems are growing and the trend is about to accelerate invisibly.
The project manager probably won’t have the complete picture due to delays in reporting clear symptoms/root causes as the situation emerges.
Alarms and escalations at this point are becoming more visible outside the core teams and immediate management.
More experienced project managers may recognise the symptoms of an emerging crisis, and s/he might suggest more aggressive responses. But there is often resistance to formalising the response and a reluctance to disturb the current task load and service crisis response actions.
But things are about to get worse.
4. The Drop
At this point, the problem impacts/system instability are now growing much faster than the team’s ability to observe and report the status. Impacts are experienced across a broader community of users, and problem reports are streaming in via non-technical channels: calls/emails/chat sessions to support channels and begin to swamp their capacity.
Social media channels start to light up with complaints, often not about the system issues but the lack of responsiveness to their problem reports or even getting through.
The problem is now visible to multiple management layers, and executives often hear about it from outside the company, not through internal escalations. The team handling the problem begins to get swamped by status queries from multiple points within the company.
The teams can’t set up status meetings quickly enough before the problem has spread or evolved.
And so:
- the information in the discussion is out of date; and
- not all the necessary people are at the meeting
Crisis meetings grow and become more frequent, and the problem’s symptoms spread. Daily status sessions become twice daily, then three times and more.
Talks of “war rooms” have begun, but whereas previously such measures were seen as a disruption, everyone is now just too frantic.
5. Early Traction
Teams are now in full crisis mode, and all pretence of maintaining normal activities has been dropped. People are working long hours and looking ragged.
Eventually, the system responds. You find root causes and apply fixes and bring the system under control. You can restore it to normal operating parameters.
And then the team has to restore itself to normal: time off, thank-yous and make-ups.
If a war room was ever going to be set up, it has been set up by now.
Whereas in previous stages, coordination of problem response was left to technical leads, by now, project managers have been brought in to do the heavy lifting of organising the flow of information, tracking issues and responses, and running crisis meetings.
The PM is also working with more senior managers to coordinate communications with outside parties. If the problem results in customer impacts and is large enough, coordination will include executive level, corporate comms and potentially legal offices.
But, within the noise, patterns are emerging. Some teams start to see responses to their fixes and remediation work. The situation is not resolved, but there are green shoots.
Nobody has really started to think about recovery yet, just finding the root cause and arresting the slide.
6. Low Point
The problems have been identified, fixes have been identified and are being applied or possibly developed. Suppose the teams forecast long periods to recover (e.g. to put in new physical infrastructure). In that case, they get tied up in justification sessions to trade-off short term (but ugly) fixes vs longer-term solutions.
There is more work to do to recover, but the slide has been stopped. At this point, we find the maximum number of people engaged in the crisis management process. Crisis management has probably developed a bit of a rhythm.
Meetings are running relatively smoothly, and people are responding to actions quickly and reliably.
Just when everything is operating smoothly, we start to reduce the need.
The decay has stopped, but we need to recover.
7. Encouraging Response
At this point, the recovery process is planned out and has started. The system is responding to the recovery work.
The management process is descaling gracefully. Fewer people are required at meetings, and those meetings become less frequent.
All those involved are operating well due to the repeated cycles and the emerging success. People are getting sleep and becoming less fractious.
Management feels vindicated that the extra efforts have been effective and brought the situation back
8. Problem / Solution Locked
At this point, the focused and intense activity begins to restore balance.
There is a plan in place for the remediation activities, and that plan is under close management.
The level of management force and focus begins to seem a bit like overkill. Extreme measures are rewound. Senior management /Executive focus has moved on to other problems, and any senior management reporting on the problem is rolled back into regular reporting cycles.
9. Rapid Recovery
At this point, the trend to recovery is clear — significant progress has been made — and the system is responding to the interventions.
If there were procurement or development activities needed to restore the system fully, all or nearly all have been completed, and the rest are seen as routine repetitions of solution steps already completed.
10. Light at the End
At this point, rapid continuous improvement results in an almost complete de-focus on the problem.
Pretty much everything is turned over to the local admin or operations team to complete the job unless there are continuing technical changes needed to complete the restoration.
11. All the Way Back (Almost?)
At this point, the system should be back to a steady state of maximum coherence.
Did we get “All the way back”, or “almost all the way back”? Are there any lasting effects on the system that remain, or is it actually in better shape than before?
For example, has the management system added new alarming infrastructure or updated procedures or even modified the system to make it more resistant to whatever happened last time?
Or did it decide that was the last time and it was time for a new system to replace the old?
Originally published at https://adamonprojects.com. Edited and extended for this version