Monday, March 10, 2014

Review: Mars Code



Despite a few spectacular failures (such as the Mars Climate Orbiter and Mars Polar Lander), NASA is one of the top organizations worldwide in Systems and Software Engineering.  The Mars rover Opportunity has recently celebrated ten years of operation on Mars, and the newer Curiosity has been operating since August 2012, after a landing maneuver of unprecedented complexity.  This is quite a feat, considering the fact that the nearest repair technician is over 30 million miles away, and communications delay is about 15 minutes on average.  How do they do it?

The paper Mars Code by Gerard J. Holzmann (Communications of the ACM,  Feb. 2014) gives a rare glimpse into the process used in programming the Mars rovers.  It focuses on three methods used to reduce the risk of failure.  An important feature of all three methods is the use of automation.  The first consists of a set of risk-related coding rules, based on actual risks observed in previous missions.  For example, one rule requires the code to pass compilation and static analyzer checks without any warnings.  Another requires the code to have a minimum assertion density of 2%.  The assertions are enabled not only for testing, but also during the mission, when violations place the rover into a safe mode, during which failures can be diagnosed and fixed.  All the rules are checked automatically when the code is compiled. 

The second method consists of tool-based code review.  Four different static-analysis tools are used, with an automated system to manage the interaction between developers and reviewers in response to tool warnings.  In over 80% of the cases, developers agreed with the warnings and fixed the code without reviewer intervention.  An interesting observation Holzmann makes is that there was surprisingly little overlap between the between the warnings issued by the four tools. Apparently, each tool has its own areas of strength, and they are different enough that it is useful to use them all.

The third and strongest method is the use of a model checking tool for race conditions, one of the most difficult types of bugs to discover.  This cannot be done for the whole code base, but was used extensively in the Curiosity mission to verify critical multithreaded algorithms.   In some cases, the tool works directly from the C source code.  For larger subsystems, input models had to be prepared manually in the language of the model checker; this can reduce the input size by over one order of magnitude.

The challenges faced by NASA are obviously unique, and are very different from those involved in cloud-based mobile applications, to take a popular example.  There are many papers and books describing methodologies for developing such systems.  However, developing safety-critical systems for cars and aircraft, medical devices, and power grids, or financial systems that must be resilient to faults and deliberate attacks, requires reliability levels similar to those of NASA.  For developers of such systems, papers such as Holzmann’s should be required reading.

A related article you may want to read is They Write the Right Stuff, by Charles Fishman.  This article, from 1996, describes the culture of the group that wrote the software for the space shuttle, and how their development process led to the very high quality of their product.  For example, whenever a mistake was found in the code, it was traced to the corresponding problem in the process that allowed the mistake to be written and pass reviews and testing.  Then, the team investigated what other bugs could have been caused by the same process failure.  In this way, a bug in the control of the shuttle's robotic manipulator arm was traced back to a certain type of problem.  When investigating other effects of the same problem, a much more serious bug was found in the control of the high-gain antenna.  The bug was fixed before it had a chance to manifest itself.

It takes a lot of discipline and investment to work this way, but the result is a much better, more reliable product.  Sometimes it isn't all that important; other times, it's the difference between life and death.