If you're looking for my pandemic posts, go here.
You probably saw the news
story
last week about the massive ransomware attack on Universal Health Services, a
large chain of 400 hospitals. About 250 of those hospitals lost partial or
complete use of their computer and phone systems. While the official announcement said that no patient
data or services were disrupted, Bleeping
Computer reported, based on examination of an online employee bulletin board,
that there were at least four deaths due to lab results arriving too late to
take actions required to save patients (since the results had to be delivered
by hand, not electronically).
Moreover, I was in an online meeting with a number of
healthcare cyber security people last Thursday, when they started discussing
this. They all agreed that four deaths is probably an underestimate of the
total due to this attack, given how many hospitals were involved and the many
ways in which loss of computer or phone systems could lead to a death, even if it
isn’t the immediate cause (for example, the patient who was turned away two
weeks ago from a hospital in Germany – due to the hospital being crippled by a ransomware
attack – didn’t directly die due to the attack; she died of whatever her
illness was. However, her death could have been avoided had the hospital been
able to receive her, since she died on the way to the next-nearest hospital).
In fact, these people said that, statistically speaking, there must have
already been a number of patient deaths due to cyberattacks on hospitals, such
as the worldwide Wannacry
attack of 2017, which had a devastating impact on the UK’s National Health
Service yet was officially reported not to have led to any deaths. It was only because the hospital in Germany directly attributed the
death there to the attack that the unfortunate lady became the first person who officially died from a
cyberattack anywhere in the world.
So it won’t surprise you if I say that ransomware is almost
without a doubt the number one cyber threat worldwide, including in the US. But
since this post is about the Bulk Electric System, let me ask you: Have you
ever heard of a ransomware attack affecting the BES? Like most of us in the
industry – and like I myself would have said until a year ago – you would probably point
out that utilities might have been compromised on the IT side of the house, but
it’s virtually impossible that a ransomware attack could affect the BES, at
least at a Medium or High impact BES asset. After all, there’s no email within Electronic
Security Perimeters, and it’s almost impossible for ransomware to spread into
an ESP through an Electronic Access Point (which has lots of protections);
moreover, since all interactive remote access needs to go through an
Intermediate System, that wouldn’t be a likely attack vector either.[i]
But what would you say if I told you that a few years ago, there
was a huge ransomware attack on a major electric utility that did in fact impact
the BES, although it didn’t lead to an outage? And what would you think of me
if I told you that the ransomware had a huge BES effect on two High impact Control
Centers, yet the utility was correct when they asserted that the ransomware
didn’t actually penetrate those Control Centers? Would you want to have me
locked up (you might want to see that for other reasons, but please confine yourself
to the example at hand)? Would you think I was describing something like quantum tunneling
in physics, where just the laws of quantum mechanics allow a particle (or wave,
same thing) to penetrate through a barrier – in fact, to be on both sides of
the barrier at the same time?
No, I’m not (excessively) crazy when I say this. Just listen to
my story:
In 2018, a major utility reported publicly that they had been the
victim of a large malware attack (they didn’t use the term ransomware, but it
wasn’t as fashionable then to be a ransomware victim as it is nowadays) that
had affected a large number of systems on their IT network. However, they swore
up and down that there had been no operational impact. They issued a statement
saying "There is no impact on grid reliability or
employee or public safety. The systems used to operate…(our)…transmission
and distribution systems are on separate networks, and are not impacted by this
issue.”
I read that and thought “Well, it’s certainly honorable that
they went the extra mile and reported this incident, since they really didn’t have
to. I’m glad the BES wasn’t impacted.” I imagine most of you thought the same
thing.
However, just about a year ago a friend of mine told me an
interesting story. At the time of this incident, he worked for a smaller
utility that was in the control area of the utility that was attacked. Of
course, his utility was always in contact with the large one, since they were
constantly exchanging data, both electronically and verbally.
He said that on the day this attack happened, they were
called by people in the main Control Center of that utility and told that all
communications with them would need to be by cell phone for the time being, since
all of the systems in the Control Center – as well as the backup Control Center
– were down; moreover, the VOIP system was down as well - hence the cell phones. And indeed, it wasn’t until the next day that all systems seemed
to finally be restored.
Fortunately, an event where a Control Center is totally
down, or loses all connectivity, is something that utilities rehearse for all
the time. There don’t seem to have been any serious operational issues caused
by the fact that the Control Center operated by cellphone for 24
hours. So why do I call this a BES incident?
Because, as anybody familiar with NERC CIP compliance knows,
the Guidance and Technical Basis section of CIP-002-5.1a (the version currently
in effect) lists nine “BES Reliability Operating Services”, affectionately known
as BROS. “Impact on the BES” - in the CIP-002 sense - means a loss or compromise of one of the BROS. If
this impact is caused by a cyber incident, it needs to be reported to the
E-ISAC (and probably to DoE on form OE-417) as a BES cyber incident.
One of the nine BROS is “monitoring and control”. Of course,
this is what Control Centers do, and these CCs lost the ability to fulfill this
BROS during the outage; ergo this was a BES cybersecurity incident. You might argue that the utility still had control of the BES, since they continued
to be able to – and did – call their partner utilities to issue instructions. But
they had definitely lost the capability to monitor the grid in real time in their control
area.
A year later, a renewables operator in the West reported
to DoE on Form OE-417 that they had lost connection with their wind or solar
farms for brief periods during one day, due to what appeared to be a random
cyberattack on their Cisco™ routers – in other words, they briefly lost the
ability to “monitor and control” their remote assets. Unlike the 2018 incident, this was reported to DoE (and also the E-ISAC), so it was made public. Note that in both cases there was a loss of the ability to "monitor and control", although in only one of those cases was this reported as the BES incident that it was.
At the time, most of us thought
this was the first true cyber attack on the grid, yet it turns out that the
first attack was really a year earlier. What was lost in 2018 was real time
monitoring of the grid within a multi-state area, not just monitoring of some wind
or solar farms as in 2019 (also, wind and solar farms are usually quite happy to operate completely on their own, and being limited to phone communications with the control center wouldn't usually cause a problem).
You might wonder why, given that there was no grid event in the case of either attack, either of them should have been reported. The problem is that, had the right event come along in 2018 (e.g. some disturbance that cut off two important transmission lines), it might have overwhelmed the control center staff’s ability to control it – or even understand it – simply through talking
on cell phones. Fortunately, that didn’t happen.
Why do I say the 2018 BES incident was caused by ransomware, since the ransomware probably never touched the IT network? Here’s
what my friend told me about the incident:
1. A user on the utility’s IT network clicked on
a phishing email, and ransomware quickly spread throughout the IT network. The
IT department decided there was no alternative but to wipe over 10,000 systems on that network, then re-image them and restore key systems from backups. They also had to require all of the thousands of employees in the entire company to log off their corporate (IT) computer accounts for 24 hours during the restore process. Furthermore, they had to deploy malware scans of thousands of end user computers for employees and contractors during the 24 hours of down time. Of course, this was a huge, expensive operation.
2. The primary and backup Control Centers didn’t
appear to have been affected by the ransomware, but here’s the problem: IT
realized that, if the ransomware had spread to just one system in the Control Center,
that system alone might end up reinfecting the whole organization again – both the
IT and the OT (ESP) networks - once the IT network came up again.
3.
So IT decided they had to wipe all of the Control
Center systems and restore them as well, even though there was no indication that any
of them had been compromised; not doing so was too big of a risk to take.
4.
They also decided they had to do this to the systems
in the backup Control Center as well, since it’s likely that any ransomware in
the primary CC would have been quickly replicated to the backup CC. Again, it would simply
be taking too big a risk if they didn’t do this.
And that, Dear Reader, is why the Control Center staff had
to run the grid by cell phone for many hours that day. Their employer was
technically correct in the first sentence of their statement: “There is no
impact on grid reliability or employee or public safety.” However, the second sentence
- “The systems used to operate…(our)…transmission and distribution systems are
on separate networks, and are not impacted by this issue.” – is definitely not
true.
Unless, of course, you think that being shut down for 12-24
hours is the same as not being “impacted”. The fact is, this utility is damn
lucky there wasn’t a big outage due to this incident. And being lucky isn’t one
of the BES Reliability Operating Services, at least the last time I checked.
Any opinions expressed in this
blog post are strictly mine and are not necessarily shared by any of the
clients of Tom Alrich LLC. If you would like to comment on what you have read here, I would
love to hear from you. Please email me at tom@tomalrich.com.
[i]
There is still the possibility that ransomware could spread into the ESP by
means of a computer that was conducting machine-to-machine remote access into
the ESP, since until three days ago that was specifically exempted from Requirement
Parts CIP-005-5 R2.1 – R2.3. However, on October 1 CIP-005-6 R2.4 and R2.5 went
into effect, which offer at least some protection against compromise through machine-to-machine
remote access.