Sunday, October 4, 2020

When will a ransomware attack impact the Bulk Electric System? 2018

If you're looking for my pandemic posts, go here.

You probably saw the news story last week about the massive ransomware attack on Universal Health Services, a large chain of 400 hospitals. About 250 of those hospitals lost partial or complete use of their computer and phone systems.  While the official announcement said that no patient data or services were disrupted, Bleeping Computer reported, based on examination of an online employee bulletin board, that there were at least four deaths due to lab results arriving too late to take actions required to save patients (since the results had to be delivered by hand, not electronically).

Moreover, I was in an online meeting with a number of healthcare cyber security people last Thursday, when they started discussing this. They all agreed that four deaths is probably an underestimate of the total due to this attack, given how many hospitals were involved and the many ways in which loss of computer or phone systems could lead to a death, even if it isn’t the immediate cause (for example, the patient who was turned away two weeks ago from a hospital in Germany – due to the hospital being crippled by a ransomware attack – didn’t directly die due to the attack; she died of whatever her illness was. However, her death could have been avoided had the hospital been able to receive her, since she died on the way to the next-nearest hospital). In fact, these people said that, statistically speaking, there must have already been a number of patient deaths due to cyberattacks on hospitals, such as the worldwide Wannacry attack of 2017, which had a devastating impact on the UK’s National Health Service yet was officially reported not to have led to any deaths. It was only because the hospital in Germany directly attributed the death there to the attack that the unfortunate lady became the first person who officially died from a cyberattack anywhere in the world.

So it won’t surprise you if I say that ransomware is almost without a doubt the number one cyber threat worldwide, including in the US. But since this post is about the Bulk Electric System, let me ask you: Have you ever heard of a ransomware attack affecting the BES? Like most of us in the industry – and like I myself would have said until a year ago – you would probably point out that utilities might have been compromised on the IT side of the house, but it’s virtually impossible that a ransomware attack could affect the BES, at least at a Medium or High impact BES asset. After all, there’s no email within Electronic Security Perimeters, and it’s almost impossible for ransomware to spread into an ESP through an Electronic Access Point (which has lots of protections); moreover, since all interactive remote access needs to go through an Intermediate System, that wouldn’t be a likely attack vector either.[i]

But what would you say if I told you that a few years ago, there was a huge ransomware attack on a major electric utility that did in fact impact the BES, although it didn’t lead to an outage? And what would you think of me if I told you that the ransomware had a huge BES effect on two High impact Control Centers, yet the utility was correct when they asserted that the ransomware didn’t actually penetrate those Control Centers? Would you want to have me locked up (you might want to see that for other reasons, but please confine yourself to the example at hand)? Would you think I was describing something like quantum tunneling in physics, where just the laws of quantum mechanics allow a particle (or wave, same thing) to penetrate through a barrier – in fact, to be on both sides of the barrier at the same time?

No, I’m not (excessively) crazy when I say this. Just listen to my story:

In 2018, a major utility reported publicly that they had been the victim of a large malware attack (they didn’t use the term ransomware, but it wasn’t as fashionable then to be a ransomware victim as it is nowadays) that had affected a large number of systems on their IT network. However, they swore up and down that there had been no operational impact. They issued a statement saying "There is no impact on grid reliability or employee or public safety. The systems used to operate…(our)…transmission and distribution systems are on separate networks, and are not impacted by this issue.”

I read that and thought “Well, it’s certainly honorable that they went the extra mile and reported this incident, since they really didn’t have to. I’m glad the BES wasn’t impacted.” I imagine most of you thought the same thing.

However, just about a year ago a friend of mine told me an interesting story. At the time of this incident, he worked for a smaller utility that was in the control area of the utility that was attacked. Of course, his utility was always in contact with the large one, since they were constantly exchanging data, both electronically and verbally.

He said that on the day this attack happened, they were called by people in the main Control Center of that utility and told that all communications with them would need to be by cell phone for the time being, since all of the systems in the Control Center – as well as the backup Control Center – were down; moreover, the VOIP system was down as well - hence the cell phones. And indeed, it wasn’t until the next day that all systems seemed to finally be restored.

Fortunately, an event where a Control Center is totally down, or loses all connectivity, is something that utilities rehearse for all the time. There don’t seem to have been any serious operational issues caused by the fact that the Control Center operated by cellphone for 24 hours. So why do I call this a BES incident?

Because, as anybody familiar with NERC CIP compliance knows, the Guidance and Technical Basis section of CIP-002-5.1a (the version currently in effect) lists nine “BES Reliability Operating Services”, affectionately known as BROS. “Impact on the BES” - in the CIP-002 sense - means a loss or compromise of one of the BROS. If this impact is caused by a cyber incident, it needs to be reported to the E-ISAC (and probably to DoE on form OE-417) as a BES cyber incident.

One of the nine BROS is “monitoring and control”. Of course, this is what Control Centers do, and these CCs lost the ability to fulfill this BROS during the outage; ergo this was a BES cybersecurity incident. You might argue that the utility still had control of the BES, since they continued to be able to – and did – call their partner utilities to issue instructions. But they had definitely lost the capability to monitor the grid in real time in their control area. 

A year later, a renewables operator in the West reported to DoE on Form OE-417 that they had lost connection with their wind or solar farms for brief periods during one day, due to what appeared to be a random cyberattack on their Cisco™ routers – in other words, they briefly lost the ability to “monitor and control” their remote assets. Unlike the 2018 incident, this was reported to DoE (and also the E-ISAC), so it was made public. Note that in both cases there was a loss of the ability to "monitor and control", although in only one of those cases was this reported as the BES incident that it was.

At the time, most of us thought this was the first true cyber attack on the grid, yet it turns out that the first attack was really a year earlier. What was lost in 2018 was real time monitoring of the grid within a multi-state area, not just monitoring of some wind or solar farms as in 2019 (also, wind and solar farms are usually quite happy to operate completely on their own, and being limited to phone communications with the control center wouldn't usually cause a problem). 

You might wonder why, given that there was no grid event in the case of either attack, either of them should have been reported. The problem is that, had the right event come along in 2018 (e.g. some disturbance that cut off two important transmission lines), it might have overwhelmed the control center staff’s ability to control it – or even understand it – simply through talking on cell phones. Fortunately, that didn’t happen.

Why do I say the 2018 BES incident was caused by ransomware, since the ransomware probably never touched the IT network? Here’s what my friend told me about the incident:

1.     A user on the utility’s IT network clicked on a phishing email, and ransomware quickly spread throughout the IT network. The IT department decided there was no alternative but to wipe over 10,000 systems on that network, then re-image them and restore key systems from backups. They also had to require all of the thousands of employees in the entire company to log off their corporate (IT) computer accounts for 24 hours during the restore process. Furthermore, they had to deploy malware scans of thousands of end user computers for employees and contractors during the 24 hours of down time. Of course, this was a huge, expensive operation.

2.     The primary and backup Control Centers didn’t appear to have been affected by the ransomware, but here’s the problem: IT realized that, if the ransomware had spread to just one system in the Control Center, that system alone might end up reinfecting the whole organization again – both the IT and the OT (ESP) networks - once the IT network came up again.

3.      So IT decided they had to wipe all of the Control Center systems and restore them as well, even though there was no indication that any of them had been compromised; not doing so was too big of a risk to take.

4.      They also decided they had to do this to the systems in the backup Control Center as well, since it’s likely that any ransomware in the primary CC would have been quickly replicated to the backup CC. Again, it would simply be taking too big a risk if they didn’t do this.

And that, Dear Reader, is why the Control Center staff had to run the grid by cell phone for many hours that day. Their employer was technically correct in the first sentence of their statement: “There is no impact on grid reliability or employee or public safety.” However, the second sentence - “The systems used to operate…(our)…transmission and distribution systems are on separate networks, and are not impacted by this issue.” – is definitely not true.

Unless, of course, you think that being shut down for 12-24 hours is the same as not being “impacted”. The fact is, this utility is damn lucky there wasn’t a big outage due to this incident. And being lucky isn’t one of the BES Reliability Operating Services, at least the last time I checked.

Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.


[i] There is still the possibility that ransomware could spread into the ESP by means of a computer that was conducting machine-to-machine remote access into the ESP, since until three days ago that was specifically exempted from Requirement Parts CIP-005-5 R2.1 – R2.3. However, on October 1 CIP-005-6 R2.4 and R2.5 went into effect, which offer at least some protection against compromise through machine-to-machine remote access.

No comments:

Post a Comment