September 28, 2024
Report Blames Antivirus Software for EMS Deadlock
Shutterstock
|
Software designed to provide computer security had the ironic consequence of rendering an electric utility’s essential systems unusable, a NERC report said.

Software designed to provide computer security had the ironic consequence of rendering an electric utility’s essential systems unusable for over a half-hour, and then for more than an hour the following weekend, according to a new Lessons Learned report published by NERC on Wednesday.

The incident outlined in the Loss of Energy Management System Functionality due to Server Resource Deadlock document began with an antivirus software suite installed on production servers for the utility’s energyy management system (EMS). When exposed to certain malware signatures found in the EMS production environment, the software deadlocked — meaning that a process or thread entered a waiting state that it could not exit because it needed resources in use by another waiting process.

As with all of NERC’s Lessons Learned reports, specific details about the event — including the location, date, utilities and regional entities involved — were not provided in order to keep the focus off individual companies. The goal of the documents is not to identify wrongdoing or deficiencies but to educate entities on potential issues that may not be covered by existing reliability standards. At the same time, NERC emphasizes that “implementation of these lessons learned is not a substitute for compliance with” NERC’s standards.

The “flaw was latent in the engine” of the antivirus software, but the registered entity’s staff did not detect it in testing because the production servers had “extremely high file I/O [input/output]” compared to the test environment, resulting in “more opportunity for the antivirus engine to deadlock.”

After the servers locked they were “effectively … unavailable to operators” and left the entity unable to control bulk electric system elements at affected substations. As a result, operators could not calculate reporting area control error (ACE), control performance standards, or implement automatic generation control. In addition, the entity’s state estimator and real-time contingency analysis (RTCA) were not solving, and the EMS could not perform its real-time monitoring and alarming functions.

The first deadlock lasted 31 minutes. NERC said in the report that the entity was able to implement alternative processes for most, if not all, of the affected functions. For example, personnel at regional dispatch centers were able to monitor BES substations and notify the entity of any unusual conditions, while the reliability coordinator took over solving the RTCA and calculated reporting ACE until the utility could restore the system.

The utility’s incident response team (IRT) resolved the first deadlock by disabling non-critical services on the active EMS servers. By this point the provider of the antivirus software had identified a flawed signature as the likely root cause. The IRT applied a new signature and re-enabled non-critical services.

The second deadlock occurred five days later, and the entity implemented its response measures again. After failing to restart the EMS processes through a soft reboot, the IRT decided “to quarantine the impacted server for forensic analysis and to perform a hard reboot of the servers.” Meanwhile, the team again disabled non-critical services on the active EMS servers, which were not re-enabled until the IRT had tested and verified their safety.

This time the deadlock lasted 81 minutes before EMS functions were fully restored. The IRT’s forensic analysis of the affected server determined that the deadlock occurred when the EMS processes, backup services processes, and anti-malware processes ran simultaneously. In addition, the vendor discovered the flaw in the malware engine that ultimately caused the problem.

NERC identified several lessons from the utility’s experience: first, after the initial deadlock the entity accepted the antivirus vendor’s misdiagnosis of the root cause as a signature flaw at face value rather than performing its own testing. Because the actual underlying issue was not addressed, the system remained vulnerable to the same flaw. NERC suggested that entities implement “a more rigorous testing process … during incident response to verify the root cause.”

Next, the report suggested that having staff present — both in the central control center and in the “war room” where the IRT coordinated its response — provided a major help, with a “shared physical space [that] allowed the team to decompose a complex problem [and] maintain momentum and energy during the long response.” Conversely, the ability to remotely and securely connect to the EMS systems, implemented in 2018, “shaved significant time off the response” compared to having to drive onsite to gain physical access.

NERC also noted that with “modern EMS technology environments [making] incidents more complex to respond to,” multiple teams had to coordinate their actions to restore stability to the system. The organization said responders should be prepared with a central response toolkit and scenario-based training so that they can be prepared to use their tools in the proper way. The ability to quarantine the impacted server was also noted approvingly because it allowed the IRT to perform in-depth analysis without affecting active production equipment.

NERC & Committees

Leave a Reply

Your email address will not be published. Required fields are marked *