Monday, June 25, 2018

An Auditor gives advice on Event Logging



A CIP compliance analyst with a large electric utility wrote in recently with the following question:

'I am curious as to your experience with solutions for CIP7R4.2.2 “Detected failure of Part 4.1 event logging”.  We have heard from program managers that other companies in the industry use either an ICS vendor solution or rig up a “heartbeat” or “polling” proprietary solution. In this particular situation, the platforms are Intel and Linux. I’m curious as to what is accepted as a solution to this interesting requirement.'

I passed this question on to an auditor who usually has something interesting to say on anything having to do with NERC CIP. He didn’t disappoint this time – in fact, he obviously devoted about an hour on a gorgeous weekend day (at least it was here in Chicago – I don’t know about the city where the auditor lives) to putting together the following answer:

“The answer is long and complicated.  It all depends on the capabilities of both the monitored and monitoring systems.  First of all, the entity needs to fully understand the expectation of the requirement.  The requirement is not to determine that the Cyber Asset generating the logs is up and running, or even solely that it is generating logs locally.  The expectation is to detect a failure of the logging process from start to finish.  There are numerous potential points of failure.  Something could happen on the Cyber Asset generating the logs that causes it to stop logging (perhaps the log file is full).  If the device cannot natively send its logs to the log server/SIEM, it will need an agent to perform this function; something could happen to cause the agent to fail.  Perhaps the IP address of the log server is incorrectly configured and the logs are being sent to the bit bucket.  Perhaps there is a networking issue and the log server is not reachable from the Cyber Asset generating the logs.  And then there are the issues that crop up on the log server/SIEM to contend with, especially when the log service and SIEM are different applications on the same or different servers.

“Here is what I have seen that does not work:

“-       Some entities have simply monitored the Cyber Asset generating the logs using a simplistic method such as pinging the system.  That approach fails because it can only detect when the system is either completely down or unable to be reached over the network.  The problem with the ping approach is that it cannot detect when the device is up but the logging service has failed.  As a side note, a Cyber Asset that is down is not generating logs.  That is not a failure of event logging as envisioned by CIP-007-6 R4 Part 4.2.2.  When the system is down, there is nothing to log.  That does not mean that monitoring system availability is not important; it just does not accomplish what is expected in this instance.

“-       A variation of the above is to monitor the logging agent on the Cyber Asset that cannot natively send its logs to a log server/SIEM.  Quite often this is accomplished by seeing the service is “running.”   This approach fails because of several reasons.  The service could be hung; while it is “running,” it is not doing anything.  The destination IP address of the log server/SIEM could be incorrect.  There could be a networking issue making the log server/SIEM unreachable.  And, if the only monitoring is of the source and not also of the log server/SIEM itself, the log server/SIEM could be down.  The problem with monitoring the logging service on the source Cyber Asset is that this cannot detect a failure in the path between the source log and the destination log server/SIEM.

“OK, so what can work?  Here is what I have seen:

“-       Some systems are normally “chatty,” meaning that they generate a lot of log traffic in the normal course of operation.  If the SIEM is capable, an event trigger could be configured that would generate an alert if the source system has not been heard from in a reasonable period of time.  For example, a Windows or Unix/Linux system normally generates many logs per minute.  The entity could determine how long it typically takes for the source system to reboot, add a buffer, and set the event trigger to alert if nothing has been received from the source system within the timeout window.  For example, let’s say the source Windows system normally generates an average of ten event log messages per minute when idle and takes five-to-ten minutes to reboot after applying patches.  If the entity defined a trigger event that would alert if no log messages have been received from the Windows system in fifteen minutes, that would accomplish the Part 1.4 requirement while minimizing false alerts.  If the system generates only one message an hour and takes five-to-ten minutes to reboot, a two- or three-hour timeout might be appropriate.

“-       Some entities cause their Windows and Unix/Linux Cyber Assets to issue a specifically crafted “heartbeat” event log message on a defined periodicity rather than simply monitoring for any log traffic.  In this case, the SIEM is configured to generate an alert if the heartbeat message is not received as expected.  Again, allowing for normal outages, such as the reboot timing, the failure to receive the heartbeat message indicates a failure somewhere along the path that needs to be investigated.  This is relatively easy to implement, using a cron job in Unix/Linux or an AT scheduled task in Windows.  The periodically scheduled task uses the appropriate operating system features to generate an event log message that is then picked up and sent to the log server/SIEM.  In Windows, this can be done from a .bat file that uses the command line interface to execute the “eventcreate” command.  Again, the timeout is based on the periodicity of the periodic event message creation.

“-       Some Cyber Assets are very quiet, especially network switches.  These devices usually have no native capability to generate an event log message on demand.  There are several options here.  If the switch is a managed switch with external IP accessibility, the entity might be able to use a remote management system to periodically connect to and log into the switch.  This could be as simple as relying on a third-party solution that is already being used to periodically back up the configuration (e.g., CiscoWorks or Industrial Defender).  The switch is expected to log the access event per CIP-007-6 R4 Part 4.1.1 and 4.1.2 anyhow.  The login attempt message can be used in lieu of a specially crafted heartbeat.  If the switch is not externally reachable for management purposes, the entity might be able to trigger the log in event from another Cyber Asset within the ESP and accomplish the same thing.

“-       As a last resort, the entity staff need to manually check on the device, perhaps as part of the daily system checks, to see if there are recent log messages in its buffer that were not sent out.

“If the entity is using multiple log servers and/or redundant SIEMs, the monitoring should include all of them.  That way, the entity does not find itself unexpectedly in a single point of failure situation.

I am sure there are other options, but these are the typical ones I have seen and none of them require extensive programming effort or expensive vendor support.”


Any opinions expressed in this blog post are strictly mine and are not necessarily shared by any of the clients of Tom Alrich LLC.

If you would like to comment on what you have read here, I would love to hear from you. Please email me at tom@tomalrich.com. Please keep in mind that if you’re a NERC entity, Tom Alrich LLC can help you with NERC CIP issues or challenges like what is discussed in this post – especially on compliance with CIP-013. And if you’re a security vendor to the power industry, TALLC can help you by developing marketing materials, delivering webinars, etc. To discuss any of this, you can email me at the same address.         
               


No comments:

Post a Comment