I was surprised to read this story from ZDNet this morning, describing yet another devastating cyberattack on a critical infrastructure organization, this time an electric utility. Of course, even though the utility, Delta-Montrose Electric Association (DMEA) in Colorado, never used the word “ransomware” in their announcement of the attack, everyone interviewed for the article seemed to think it was a ransomware attack.
But even if it wasn't, what I'm
most interested in is the fact that, by the utility's own reckoning, 90% of its
internal systems (which I interpret as "IT network") were down. Yet
the utility says their electric operations weren't affected at all. This simply
shows that the utility followed a cardinal principle for
critical infrastructure: Complete separation of the IT and OT networks, so
there is no direct logical path by which an infected IT system might infect the
OT network.
I know of two other devastating ransomware
attacks on critical infrastructure in the US. In both of these, the IT network was
pretty much destroyed by the ransomware, while the OT network wasn’t touched by
it. Yet in both of these cases, the OT network had to be completely shut down
in the wake of the attack, along with the IT network. How could that happen,
assuming the networks really were separated?
The more recent of these two
attacks was, of course, Colonial Pipeline. In that attack, after most or all of
the IT network was brought down, the OT network (and therefore the pipelines
themselves) were also brought down. Colonial said at the time that they did
this out of the usual "abundance of caution". However, a WaPo editorial pointed
out that, with Colonial’s billing system (which was on the IT network, a normal
practice even in critical infrastructure) being down, Colonial couldn't invoice
for gas deliveries.
Even more importantly (and this
was a fact I learned from somebody who commented on one of my posts
on Colonial – my longtime friend Unknown), since Colonial is a common carrier
and doesn't own the gas it delivers, they would literally have been on the hook
for the entire cost of all the gas they delivered but didn't invoice, if they'd
continue to run their pipeline network. So the OT network had to come down as
well, even though it wasn't directly impacted by the ransomware.
The previous attack was in 2018,
when a very large US electric utility was hit with a devastating ransomware attack.
As with Colonial and DMEA, the IT network was completely down but the OT
network hadn't been directly affected by the ransomware. The IT department
decided they had to wipe over 10,000 systems on their IT network and rebuild
them from backups. According to two independent sources, the original plan was
to leave the two grid control centers (part of the OT network) running during
the approximately 24 hours it would take to do this.
However, the utility then decided
that if they left the OT network running, they would run the risk that even a
single system in the control center might have been infected, and then might
re-infect the IT network as soon as the latter was brought back up – meaning they’d
have to repeat the entire process of wiping and rebuilding. So they decided they
had no choice but to wipe and rebuild the control center systems (about 2,000),
including their VOIP phone system.
The result was that for 24 hours,
the power grid in a multi-state area was run by operators using cell phones. It
was pure luck that a serious incident didn't occur during this time, because
power system events usually happen too quickly for humans to react properly;
the event would probably have been over long before the control center staff
would have been able to diagnose the problem and the solution through phone
calls.[i]
In both of these previous attacks,
the OT network was logically separated from the IT network, but from a larger
point of view it wasn’t. In Colonial’s case, the problem was that there was a
system on the IT network (the billing system), which had to be up in order for
operations to continue. How could the OT shutdown have been prevented? Clearly,
the billing systems should either have been on the OT network, or (since that
might itself have caused problems), they should have been on a network segment
of their own. The ransomware would never have reached them, and they would presumably
have continued to operate after the attack. Thus, operations wouldn’t have had
to be shut down.
And what could the utility in the
second attack have done differently, to prevent having to shut down their
control centers? It seems to me that the root cause of that shutdown was the fact
that the utility didn’t trust its own controls, put in place to prevent exactly
the sort of traffic between networks that they were worried might get through.
In case you’re wondering, these
were high impact Control Centers under NERC CIP-002 R1. The CIP requirements that
govern separation of networks, CIP-005 R1 and R2, should have been sufficient
to prevent spread of malware between the networks. However, it’s also very easy
to violate those requirements, and my guess is that somebody in IT didn’t want
to take the chance that someone had slipped up in maintaining the required
controls, no matter how unlikely it was that they had.
I guess the moral of this second
story is that, if you’re ever in doubt about whether your IT and OT networks
are really separated, you should take further steps to remove those doubts.
That way, if this incident happens to your organization (utility, pipeline, oil
refinery, etc.), you’ll be able to leave the OT network running without suffering
a nervous breakdown.
Thus, the fact that DMEA was able
to continue delivering electric power to their customers (although with delayed
billing) shows they not only had the required separation between their IT and
OT networks, but they also
a)
Didn't have any system
dependencies linking their IT and OT networks, as in the case of Colonial, and
b)
Had enough controls in
place that they didn’t doubt that the two networks were really logically
separated.
Good for them!
Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC. Nor are they shared by CISA’s Software Component Transparency Initiative, for which I volunteer as co-leader of the Energy SBOM Proof of Concept. If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com.
[i] This
incident was reported in the press at the time, but the utility’s announcement
said that operations weren’t “affected”. They weren’t affected in the narrow
sense that there was no outage. However, the loss of visibility in their
control center itself was an “effect” on the grid, and the utility should have
reported it to DoE as such.
Kevin Perry, retired Chief CIP Auditor of the NERC SPP Regional Entity, had this comment:
ReplyDeleteSomething utilities should do is verify they can readily and easily disconnect their OT networks from their corporate (IT) networks and continue to operate. If they cannot, they need to reevaluate their BES Cyber Systems (BCS) determinations and make some network connectivity changes.
Here's my slight expansion of what Kevin said: Any system whose loss could affect the operation of the Bulk Electric System is a BCS. This means - at least at high and medium impact BES assets - there should never be a case in which the loss of an IT system would require bringing down the OT network. That IT system should be declared a BCS and moved into the ESP (i.e. the OT network).
Kevin wants to add this comment:
DeleteTo clarify, Tom's expansion is not exactly what I intended. The intention of my comment was to determine exactly what must be in place and operating in order to maintain operations over a sustained period of time and not focus solely on the 15-minute or less BCS threshold. If you disconnect your BES systems from the rest of the company networks as a result of a cyber incident, will you remove access to a Cyber Asset/system that is needed to maintain operations over a potentially lengthy outage? If that is the case, you need to reconsider (1) should it be a BCS or at least inside the OT network and (2) what changes do you need to make to your BES operations network to ensure the availability of the subject system.
You wrote, "Clearly, the billing systems should either have been on the OT network, or (since that might itself have caused problems), they should have been on a network segment of their own."
ReplyDeleteNot necessarily. They should have a RTO approved at the appropriate level and be confident they can meet the RTO whether it is on IT, it's own segment on IT, or in OT.
The knee jerk reaction is what can we do so this never happens again. Not bad to look at ways to reduce likelihood, but a more pressing question is if this happens again is the consequence acceptable. If not, what do we need to do to make it acceptable.
An alternate option is for billing / metering related systems, the Owner insures they have an auditable local system which can differentiate and track the required customer accounts for at least the largest customers and other critical customer base. When the main billing system is returned online, the local meters are queried for there stand-alone operation data.
DeleteThanks, Dale. I don't think your RTO idea is a good one when it comes to ransomware - since meeting the RTO would probably require paying the ransom (and hoping the attackers are honest thieves). And on the OT side of the power grid, where literally no outage is acceptable, I don't think RTO has any real meaning at all (it does for IT, of course).
ReplyDeleteNo OT outage for critical infrastructure is acceptable (and if it is, I'd argue it's not CI). Network separation is of course the baseline that all CI should meet, and it was met (probably) by both CI organizations in the post. However, in both cases the OT network came down anyway. My point is CI industries (not just power) need to think about how they can avoid shutting their OT down for one of these reasons (and I'm sure there are others as well), if their IT network gets shut down by ransomware.
Then I just had an unacceptable situation Sunday and Monday in Maui where we had no power for over 36 hours because a storm took out part of the transmission system. Basing cyber risk management based on achieving a likelihood of zero is a flawed strategy, imo. #agreetodisagree
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteDale, I think any power outage - or outage of critical infrastructure in general - is unacceptable, period. I realize that a 5 minute outage in Maui might seem like something that could be accepted, but if someone is using a ventilator at home and has no backup power, that 5 minutes might be fatal.
ReplyDeleteBut I don't understand what you mean by the likelihood being zero - I'm not saying that. Ransomware attacks have a far-above-zero likelihood. What matters is impact. In the case of your Maui outage, perhaps say a 4 hour outage would have been acceptable, but that would definitely not have been the case if you'd experienced a 4 hour outage in Texas last February 15.
The impact of a CI attack can never be known until after the attack, so it always needs to be considered unacceptably high for risk management purposes. The RTO has to be zero.
Tom, what I meant by "cyber risk management based on achieving a likelihood of zero is a flawed strategy" is that the belief that any set of controls will eliminate all future outages is a flawed strategy. I've found much more efficient and effective risk reduction is achieved in reducing consequence, across most sectors, after a set of basic security controls to reduce likelihood are in place.
ReplyDeleteAlthough I'd also say that reducing consequence of a power outage to nil is equally flawed, I appreciate your desire for perfection, no outage. We should recognize this is not achieved or there never would be outages.
Taking it to the absurd level. If you say that no outage is acceptable and reducing RTO is not a worthy risk reduction effort, why do we bother with a black start capability?
Thanks, Dale. Saying that no outage is acceptable isn't the same thing as saying there will never be outages. We certainly have to be prepared for them. But that doesn't mean they're acceptable. Just that if we have one, we need to try to end it as quickly as possible.
ReplyDeleteMost distribution cooperatives rely on their G&T (Generation and Transmission Cooperative) for SCADA operations. I believe a hero may be the Coop G&T for having proper segmentation and allowing only their dispatch center to have limited access into the G&T OT environment.
ReplyDelete