The Hysteresis of Hysteria — The Gradients of Terror in a System Crisis

TL;DR

Experiencing a Project Crisis

Sometimes it happens during project execution, say during system testing.

For example, your test configuration is working fine, and then “bam!” something changes usually after a new version is installed. Or when you attempt to “rebuild” the test platform. The system flips over to being wholly unworkable or unusable. Backing out recent changes doesn’t work, and no one seems to know the root cause of the problem.

Or it happens in a production environment, perhaps after a new release. Or perhaps there is some gradually emerging problem, such as a minor memory leak or degrading component. There may be some subtle early-warning signs in system metrics or alarms, but no one has the time or experience to investigate or recognise the developing problem.

But then “boom!”

The system flips over into instability and alarm storms rage. Customers start complaining, executives start calling, and soon enough, it makes its way into the media.

Brand damage becomes real.

Is there a pattern in the noise?

A slow start (visible or otherwise), then a rapid flip from stability to instability, followed by a mad scramble to get on top of it. After a while, things start to turn around, followed by a rapid flip back to stability (after you’ve found and fixed the root cause). And then a slow and steady progression towards normality as you clean up and recover.

This rapid flip between states is typical of hysteresis functions like the change in polarity of a magnet and plenty of other phenomena, even mood swings in someone suffering from a bipolar condition.

A generic hysteresis curve

The rapid change in a hysteresis curve is highly non-linear. In project environments we know these are hard to handle and so they qualify as a “Gradient of Terror”.

What does the team experience during that project crisis cycle? Well, generally there is a lot of angst, stress, roller-coaster emotions and sometimes fear: “What if we can’t fix this?” — in other words, “hysteria”.

So, the team experiences the “Hysteresis of Hysteria”.

A Generic Plot for System Crises

Let’s look at those first and then walk through the crisis cycle.

(Note: You can read a more detailed description of each step after the end of this yarn, after the section called “The Bottom Line”).

The Dimensions

Two-dimensional context for the Hysteria of Hysteresis Crisis Plot

Management Intensity

Low Intensity” on the right is the steady-state / BAU. The number and seniority of people working on the system are relatively low (the lowest they can be).

High Intensity” on the left represents a state in which a much larger group of people is engaged with the system, and are much more senior. At its worst, multiple levels of management are included as well as the senior levels of skill and additional resources (such as vendors) who are pulled in to find and fix the problem.

System Coherence

High Coherence” at the top means that the solution is operating normally, as per design. There are few, if any, system problems, and it responds to commands and configuration changes.

Low Coherence” at the bottom indicates that the system is either completely down or operating in a degraded state.

A Generic Crisis Pathway

The high-level description of each step is below. You can read more detail after “The Bottom Line”.

1. Steady State: the maximum level of coherence and the minimum level of management.
2. Emergence: The system has an underlying problem. Maybe there are no symptoms. Local teams (e.g. admins and operators ) are handling the issue.
3. Tipping Point: The local team has applied “standard” diagnoses and remediations. The system may respond, but unseen problems are accelerating. Senior team members have been called in or at least alerted.
4. The Drop: The instability grows faster than the team can observe and report. Problem reports are streaming in: support calls/emails/chat sessions are spiking. Social media reports reach executives. Management control escalates in multiple directions until the director or VP is hands-on.
5. Early Traction: If a war room were ever on the cards, it would be operating by now. Teams by now are in full crisis mode and have abandoned all non-critical tasks. People are working long hours and looking ragged.
6. Low Point: Stale pizza and tired faces are everywhere, but the team has some fixes. Maybe the teams forecast non-instant recovery tasks, e.g. developing software or standing up new infrastructure).
7. Encouraging Response: The collective team has now planned out the recovery process and started work. The system is responding to the recovery work. The management process is descaling gracefully.
8. Problem / Solution Locked In: The intense focus begins to pay off. The level of management force and focus starts to seem a bit like overkill. Extreme measures are wound back
9. Rapid Recovery: The trend to recovery is clear — the team sees significant progress in developing fixes and deploying them successfully.
10. Light at the end: Rapid continued improvement results in an almost complete de-focus on the problem. The local team (admins or ops)
11. All the way back (Almost?): The system should be back to a steady-state of maximum coherence.

The Bottom Line

  • Management tends to ignore things that are in control and spend time on other problems.
  • Given that the system is operating within narrow bounds, expertise appears high — there are few situations needing people to look at manuals or work instructions.
  • Local “myths” develop about how the system operates. Like any myth, it has a core of truth but misinforms.
  • Expectations that the system “works well” and will continue in this way leads to a reduction in resource budgets.
  • People’s skill levels in operating the solution decline as cheaper resources are brought in, without apparent impact. Efficiency dividends are achieved and claimed.
  • Constant re-organisation and personnel loss hollow out system knowledge amongst more senior and/or long-term employees.

Once this stasis situation creeps in, any unexpected or discontinuous event can rapidly exceed the experience and knowledge of the BAU team.

What is your experience? Is it any different?

If you’re interested in more detail, check out the “blow-by-blow” below.

Steps in the Crisis: the blow-by-blow

1. Steady State

2. Emergence

Early warning signals are easily misdiagnosed or ignored by day-to-day staff, or automated systems (which may be outdated) do not detect the problems. Without management focus, the problem spreads its effects very slowly at first.

3. Tipping Point

The project manager probably won’t have the complete picture due to delays in reporting clear symptoms/root causes as the situation emerges.

Alarms and escalations at this point are becoming more visible outside the core teams and immediate management.

More experienced project managers may recognise the symptoms of an emerging crisis, and s/he might suggest more aggressive responses. But there is often resistance to formalising the response and a reluctance to disturb the current task load and service crisis response actions.

But things are about to get worse.

4. The Drop

Social media channels start to light up with complaints, often not about the system issues but the lack of responsiveness to their problem reports or even getting through.

The problem is now visible to multiple management layers, and executives often hear about it from outside the company, not through internal escalations. The team handling the problem begins to get swamped by status queries from multiple points within the company.

The teams can’t set up status meetings quickly enough before the problem has spread or evolved.

And so:

  1. the information in the discussion is out of date; and
  2. not all the necessary people are at the meeting

Crisis meetings grow and become more frequent, and the problem’s symptoms spread. Daily status sessions become twice daily, then three times and more.

Talks of “war rooms” have begun, but whereas previously such measures were seen as a disruption, everyone is now just too frantic.

5. Early Traction

Eventually, the system responds. You find root causes and apply fixes and bring the system under control. You can restore it to normal operating parameters.

If a war room was ever going to be set up, it has been set up by now.

Whereas in previous stages, coordination of problem response was left to technical leads, by now, project managers have been brought in to do the heavy lifting of organising the flow of information, tracking issues and responses, and running crisis meetings.

The PM is also working with more senior managers to coordinate communications with outside parties. If the problem results in customer impacts and is large enough, coordination will include executive level, corporate comms and potentially legal offices.

But, within the noise, patterns are emerging. Some teams start to see responses to their fixes and remediation work. The situation is not resolved, but there are green shoots.

Nobody has really started to think about recovery yet, just finding the root cause and arresting the slide.

6. Low Point

There is more work to do to recover, but the slide has been stopped. At this point, we find the maximum number of people engaged in the crisis management process. Crisis management has probably developed a bit of a rhythm.

Meetings are running relatively smoothly, and people are responding to actions quickly and reliably.

Just when everything is operating smoothly, we start to reduce the need.

The decay has stopped, but we need to recover.

7. Encouraging Response

The management process is descaling gracefully. Fewer people are required at meetings, and those meetings become less frequent.

All those involved are operating well due to the repeated cycles and the emerging success. People are getting sleep and becoming less fractious.

Management feels vindicated that the extra efforts have been effective and brought the situation back

8. Problem / Solution Locked

There is a plan in place for the remediation activities, and that plan is under close management.

The level of management force and focus begins to seem a bit like overkill. Extreme measures are rewound. Senior management /Executive focus has moved on to other problems, and any senior management reporting on the problem is rolled back into regular reporting cycles.

9. Rapid Recovery

If there were procurement or development activities needed to restore the system fully, all or nearly all have been completed, and the rest are seen as routine repetitions of solution steps already completed.

10. Light at the End

Pretty much everything is turned over to the local admin or operations team to complete the job unless there are continuing technical changes needed to complete the restoration.

11. All the Way Back (Almost?)

Did we get “All the way back”, or “almost all the way back”? Are there any lasting effects on the system that remain, or is it actually in better shape than before?

For example, has the management system added new alarming infrastructure or updated procedures or even modified the system to make it more resistant to whatever happened last time?

Or did it decide that was the last time and it was time for a new system to replace the old?

Originally published at https://adamonprojects.com. Edited and extended for this version

--

--

Where people think and work naturally together, projects succeed. http://www.adamonprojects.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store