(C) 2010 Hank Wallace
Wanna build a bomb? A BIG bomb? Well, you are going to need some weapons grade uranium. What? Can’t find any on the shelf at WalMart? I suppose you will just have to make your own!
That, of course, involves enriching mined uranium through a process involving centrifuges, and said centrifuges are not easy to come by. You’ll have to build those, perhaps with the help of some friends in North Korea or Pakistan.
Once you have the parts and have assembled a few spinning units, you’ll need a control system to be sure the centrifuges rotate at just the right angular velocity. The control system is important because it determines the purity of the final result, and ultimately the yield of your nuclear bomb.
We’re finding out that the Iranian uranium enrichment program undertook just these steps in their quest to become a nuclear power. We’re also finding out that an unknown group of (likely governmental) computer geeks went to great lengths to sabotage the Iranian program, using a computer virus that is being called Stuxnet. The story illustrates some important lessons in the design of computer systems. Let’s explore them briefly.
We need know only a few tidbits of information, culled from this article on the Iranian efforts and the virus.
The article outlines how the Iranians started out with a noble goal, a centrifuge operation running on a network totally disconnected from the outside world, in order to avoid computer attacks and viruses. The virus creators knew, however, that some poor slob underling would most certainly shuttle a USB flash drive between that network and his home computer, to move some files. The Stuxnet virus took advantage of that to move into the target network space. Their “air gap security condom” failed due to lack of common sense on the part of government employees without a lot of common sense.
But that’s not the worst part. Read the entire article and see if you have the convulsive reaction I did. You should get no further than the phrase, “…the Windows 7 operating system that controlled the overall operation of the plant.”
That’s right! The Iranians used PCs running Windows 7 to control their entire uranium enrichment plant!
The article goes on to describe how the creators of the Stuxnet virus found four previously unknown vulnerabilities in Windows and exploited them to propagate the virus. That took all of ten minutes, I expect.
Picture yourself sitting around the conference room table in Tehran with ten or twelve Iranian nuclear scientists, a few mullahs, and El Presidente himself (pardon the mixed ethnic contexts). This is the kickoff meeting, and the boys down in graphics design have whipped up some cool 3D models of the plant, with thousands of faithful centrifuges pictured doing the work of the kingdom. The presentation is almost over, and El Presidente himself interjects a question: “Are you going to base the entire computer control system on a poorly designed, buggy, unreliable, designed-by-committee piece of junk operating system?”
All eyes slide to the chief scientist, who responds through a sweaty upper lip, “Why, yes, El Presidente, we have purchased several legal copies of Windows 7 to run our whole plant, to avoid copyright infringement issues.”
“Marvelous!” responds El Presidente, and we are off to the races. The chief scientist says a silent prayer of thanks.
That’s like basing the design of a skyscraper on tinker toy technology, or attempting cancer chemotherapy with aspirin, or cutting timber with a pocket knife, etc.
The article referenced outlines how the Stuxnet virus weasled its way into the system through Windows and varied the rotational rate of the centrifuges to wear out the bearings and reduce the quality of the enriched uranium. Those geeks were clever, if not original.
(Does this mean that Bill Gates is indirectly responsible for keeping America safe?)
What say you and I do a rough design of a centrifuge controller that will be immune to the damage attempted by the Stuxnet virus, just in case the Iranians (or any other evil empire) desire to rebuild its capabilities? Take a look at the following diagram (click to enlarge):
The first thing we must do is isolate the centrifuge controller from any and every Microsoft product. I suggest doing this through a simple wired interface, say RS-232 or RS-485. We have here a controller that is built for security and to protect the equipment, and it talks through one and only one pipe to the outside world. The vertical dashed line separates the reliable controller from the sloppy world around it.
The outside software world is populated by everything from high reliability RTOS implementations to GNU and public domain slimeware. You would be appalled at the trash that runs many products today. I have had the rare privilege of looking into various systems over the years, and it is surprising how willing programmers are to insert whatever code they can find into their products, with little testing. They are like crows, picking up shiny objects to line their nests.
Windows is perhaps the worst offender, but following close behind are Linux and the whole of the GNU whirling ball of gas software mess. TCP/IP stacks are the source of many software vulnerabilities as programmers ignore their responsibility to check buffer lengths before copying packet data. Other control protocols and wireless systems have layer upon layer of poorly tested, or poorly designed code. I guarantee you that I could find an unsecure or poorly secured WiFi network in that Iranian enrichment plant.
We have to insulate our precious centrifuges from this mass of software silliness.
The main way to do this is through the use of 100% source code visible embedded programming. No third party packages should be used, and that will require simplifying the controller to do only the most essential tasks. It is not required that the controller send emails to scientists. It is not required that the controller send graphs of performance data to mobile phones. It is not required that the controllers be firmware upgradable over a network. None of these glitzy features are needed. And the code should be running out of one time programmable (OTP) or mask programmed devices, to eliminate the possibility of program errors or malicious code rewriting the program. Under no circumstances should any code run from RAM. (This also means no FPGAs loading their configurations from RAM.)
Why RS-232 or RS-485? Why not USB or some other more advanced protocol at the physical and link layers? The answer is simplicity and reliability, and confidence in the hardware. It’s hard to argue with the reliability of an RS-232 driver, and it’s easy to comprehend how it works. There is no internal micro running code, no invisible protocol, and little room for designer error.
Contrast that with USB devices. Having used some integrated microcontroller/USB parts, I can tell you with certainty that the designers of those parts do not understand completely how their own parts work. The documentation is poor and I’ve had to find workarounds to multiple issues not addressed in the data sheets or errata. What happens when a descriptor is corrupted? Have corrupted descriptors been tested on the hardware or microcode? Not a chance! Have these USB controllers been tested against aggressive, malicious attacks? Is it possible to put a set of bytes in at one end and receive different data at the other? Would I want to use these devices in my ultra-reliable system?
Of course, use of 40 year old technology offends our inner geek. But the raw simplicity of RS-232 and RS-485 is compelling.
If one is worried about on-site, in-person compromise, the hardware interface must also be protected against hardware based attacks, for example, subjecting the drivers to excessive voltages. We’ll worry here only about software oriented attacks, for the Iranian military knows how to enforce personal discipline!
A software driver uses the hardware to communicate to the outside world. This driver catches bytes arriving through the hardware, and also sends complete messages to the outside world. This may be as simple as a circular buffer based routine running off an interrupt.
The driver must be tested for vulerabilities. It should be tested for data rate capacity, intercharacter timing sensitivity, illegally formatted data response (by the byte), incomplete message response, timeout response, response to random character streams, etc. The driver in this application may be simple, but it still needs to be tested against unusual input. The CPU loading of the driver should also be measured, if it is interrupt driven.
The errata sheet(s) for the serial device or microcontroller should be consulted to determine what gotchas await, and each known errata should be tested to determine whether it applies to our system. Be careful, because some manufacturers have numerous errors in each revision of each device, so you have to be sure that the revision you tested against is that to be used in production. This is important because these bugs can cause all manner of errors from no data transfer to infinite interrupt loops because data ready flags do not reset.
The packet parser takes data from the driver and determines the packet boundaries, verifying the packet. This block handles error checking (say, using a CRC), error correction (if used), encryption and authentication, and packet formatting errors. The parser should limit messages to less than the size of internal buffers, so no overruns can occur, prohibiting unexpected execution of code from the buffer, or stack overruns. (If the programmer omits strong error detection methods, he’ll receive a nighttime visit from the Iranian Secret Service. That means no checksums!)
The command qualifier examines the contents of a validated packet to make sure it is consistent with the command set for the centrifuge controller. This includes additional tests on the command type(s) and data fields, specifically determining whether the passed data is within valid ranges. I have seen numerous TCP/IP communications implementations where there is NO checking of message lengths by message type, and this is an open invitation for security compromise.
Whatever Stuxnet was doing to change the centrifuge speed, I presume it was a significant change, not just noise. This apparently had the effect of wearing out bearings and causing excessive down time. The command qualifier could easily limit speed changes using a simple lowpass filter, or a timer to only allow so many changes per hour or day. This is trivial and the Iranian “scientists” collectively earn a Three Stooges dope slap for missing that one.
Stuxnet could easily use the external Windows systems to disconnect the centrifuge controllers, or shut them down, with the proper command. However, the controller can be designed to avoid all damage to the mechanisms, permitting the centrifuges to be connected to clean computers and restarted quickly.
The control system for the centrifuge is perhaps a typical PID loop, presumably with tuning constants common to each centrifuge unit, so they can be hard coded in the program. The article noted that Stuxnet targeted a certain Siemens controller. Critical error, Mr. President. You have no idea what’s in that box, and it is apparently the same loose software that is shipped in so many products today. We’re conquering the world here, not playing in a lab at the university. Note to Iran: Build your own motor controller. Stuxnet could have easily tweaked one of the PID constants to cause the controller to hunt for the preset speed continuously, leaving not even any data traffic to alert the hapless Iranian scientists that something was wrong.
The centrifuge is most certainly instrumented, returning the speed at least, with perhaps some other health data. That information is captured in the status message formulation block. Messages must be formatted sanely and predictably, even in the presence of nonsense sensor values. Though this is at first glance not as important as testing the input data stream, if trash is returned for centrifuge status, then the external high level control system will likely respond with trash, resulting in no nuclear bombs for supper.
Stuxnet reportedly also spoofed the instrumentation, returning data indicating a working system. Ouch. “I’m looking at the screen and all reads normal, so it must be normal.” Didn’t that happen at Three Mile Island?
There! Wasn’t that easy? Just getting rid of Windows increased the reliability of our system by several orders of magnitude. I expect the entire code base for the controller would be some few thousand lines at most, and likely less.
Keeping things simple, and avoiding overdone abominations such as Windows ensures your tyrannical terrorist regime a supply of fun nuclear weapons for generations to come. Proper design and isolation of the critical from the noncritical ensures every scientist on the team eternal paradise, and limits trips to the chopping block. Remember these lessons the next time you are heading a clandestine nuclear weapons program!