Monday, August 14, 2023

Today’s the anniversary!

August 14 is the 20-year anniversary of the 2003 Northeast blackout. This event had a profound impact on the world we live in today in many different ways; one of them was mandatory regulation of the power industry. Here is my summary of what happened:

·        The Wikipedia article states, “The blackout's proximate cause was a software bug in the alarm system at the control room of FirstEnergy.” Frankly, this is a stupid statement. It’s kind of like saying that the cause of World War II was the German invasion of Poland. There’s a big difference between a triggering event and a cause. This bug was a triggering event.

·        The causes were multiple: An important one was the fact that compliance with the NERC reliability standards at the time was voluntary. First Energy, as well as probably other utilities, had violated the NERC requirements for trimming trees under high voltage transmission lines. Since high levels of load cause lines to sag and the load levels were high on that hot day, the lines sagged into treetops, which caused them to be tripped by their protective relays. When one line shorted out, its load was automatically distributed to other lines, which themselves began to sag, encountered trees, and tripped. And so on.

·        Finally, when the last major line into the area tripped, northern Ohio became a virtual black hole for electric power, trying to suck in as much power as possible from all its neighbors. In turn, they tried to get all the power they could from their neighbors – and voila!, the cascading outage began. Within six minutes after the last line failed, the event was over. Much of the Northeastern US (excluding New England), Detroit and eastern Michigan and most of Ontario up to Hudson Bay had blacked out. 508 generating units at 265 power plants shut down during those six minutes.

·        But lack of tree trimming wasn’t the only cause. What made the blackout much worse than it had to be was that the protective relays protecting many generating units from grid instability were (in retrospect) set to trigger at a much lower level of instability than necessary. NERC’s monumental six-volume study of the blackout (which I skimmed through once, not that I could have understood most of it anyway) pointed out that (again in retrospect) many of those units would not have had to shut down if the settings had been more forgiving.

·        The resulting total blackout required activation of “blackstart plans” in many areas. Since most generating units require some amount of external power to start up, the blackstart plans lay out a complicated path for utilizing the small amount of power still available (e.g. power from hydroelectric plants or power from generating units that have diesel-powered emergency generators available) to one-by-one energize lines, then more generating units, then more lines, etc. – until the generation in the area covered by the blackstart plan is fully operational.

·        However, all of this could have been avoided if there had been better visibility into what was going on in Ohio. The software bug in the First Energy control center caused the alarms – which would otherwise have been flashing deep red – to be suppressed. The operators in the control center saw nothing but green screens and were happy as clams…until they weren’t. But even that might not have mattered, had a technician at what was called at the time the Midwest Independent Transmission System Operator (now the Midcontinent ISO) not left for lunch after fixing a problem with the “state estimator” system.

·        That system would have detected the growing divergence between electricity load and supply in northern Ohio (exacerbated by a couple of major plant outages in Cleveland) had it been active. However, the technician forgot to turn the system back on when he left. When it finally was turned on, it immediately became clear that northern Ohio was going to hell in a handbasket very quickly.

·        On the other hand, it’s not clear that the human beings monitoring these systems would have reacted correctly if they knew how bad things were. To end the imbalance between load and supply, they probably would have had to black out the entire city of Cleveland and the surrounding area. Would they have been authorized to do this and if not, would they have been able to get that authorization quickly enough to do any good, given how rapidly the imbalance between load and supply was increasing? Of course, Cleveland and many other cities ended up being blacked out, so it would clearly have been better if they system operators had taken this step.

Of course, all these causes (and others besides them) were correctable and since have been addressed with policies, standards, etc. But that required something else: mandatory standards for utilities to follow. That’s what Section 215 of the Electric Power Act (EPAct) of 2005 provided. It required the Federal Energy Regulatory Commission (FERC) to enforce mandatory reliability standards for the electric power industry.

To facilitate this, Section 215 ordered FERC to engage an “electric reliability organization” (ERO) to draft mandatory reliability standards as well as audit compliance with them, all under FERC’s oversight. Of course, there wasn’t much question that NERC was the only organization that could possibly be the ERO, and FERC chose them.

Section 215 also ordered that mandatory cybersecurity standards be developed for the power sector. At that time, the main NERC cybersecurity effort was a set of voluntary requirements called Urgent Action 1200, which applied just to large control centers. NERC was working on an update called Urgent Action 1300 but in 2016, NERC rolled that effort into development of version 1 of the CIP standards.

So the CIP standards, which were (I believe) the first cybersecurity standards that applied to Operational Technology (OT), outside of perhaps military standards, were very much a consequence of the Northeast blackout. While I have certainly expressed my disagreement with many aspects of CIP over the years (you might go back and read the at least 400 posts I’ve written on CIP, starting with my first post on this blog in early 2013), and while I continue to believe that compliance with the CIP standards is much more expensive than it should be, there’s no question in my mind that the standards have done a great job of securing the North American power grid against cyberattacks.

Moreover, I think all the NERC reliability standards (which cover all aspects of operating the power grid, including – thanks to the blackout – relay settings) have proven very important, not just the CIP standards. While local outages happen all the time (with squirrels being one of the main causes), there has not been a cascading outage since 2003, or even just a large outage that wasn’t caused by a natural disaster like a hurricane. The dedicated people of NERC (and the six NERC Regional Entities that explain and audit the standards) deserve all our thanks.

Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.

No comments:

Post a Comment