Despite a few spectacular failures (such as the Mars Climate Orbiter and Mars Polar Lander), NASA is one of the top
organizations worldwide in Systems and Software Engineering. The Mars rover Opportunity has recently
celebrated ten years of operation on Mars, and the newer Curiosity has been
operating since August 2012, after a landing maneuver of unprecedented
complexity. This is quite a feat,
considering the fact that the nearest repair technician is over 30 million
miles away, and communications delay is about 15 minutes on average. How do they do it?
The paper Mars Code by Gerard J. Holzmann (Communications of the ACM, Feb. 2014) gives a rare glimpse into the process used in programming the Mars
rovers. It focuses on three methods used
to reduce the risk of failure. An
important feature of all three methods is the use of automation. The first consists of a set of risk-related
coding rules, based on actual risks observed in previous missions. For example, one rule requires the code to
pass compilation and static analyzer checks without any warnings. Another requires the code to have a minimum
assertion density of 2%. The assertions
are enabled not only for testing, but also during the mission, when violations
place the rover into a safe mode, during which failures can be diagnosed and
fixed. All the rules are checked
automatically when the code is compiled.
The second method consists of tool-based code review. Four different static-analysis tools are
used, with an automated system to manage the interaction between developers and
reviewers in response to tool warnings.
In over 80% of the cases, developers agreed with the warnings and fixed
the code without reviewer intervention.
An interesting observation Holzmann makes is that there was surprisingly
little overlap between the between the warnings issued by the four tools. Apparently, each tool has its own areas of strength, and they are different enough that it is useful to use them all.
The third and strongest method is the use of a model
checking tool for race conditions, one of the most difficult types of bugs to
discover. This cannot be done for the
whole code base, but was used extensively in the Curiosity mission to verify
critical multithreaded algorithms. In
some cases, the tool works directly from the C source code. For larger subsystems, input models had to be
prepared manually in the language of the model checker; this can reduce the
input size by over one order of magnitude.
The challenges faced by NASA are obviously unique, and are
very different from those involved in cloud-based mobile applications, to take
a popular example. There are many papers
and books describing methodologies for developing such systems. However, developing safety-critical systems
for cars and aircraft, medical devices, and power grids, or financial systems
that must be resilient to faults and deliberate attacks, requires reliability
levels similar to those of NASA. For
developers of such systems, papers such as Holzmann’s should be required
reading.
A related article you may want to read is They Write the Right Stuff, by Charles Fishman. This article, from 1996, describes the culture of the group that wrote the software for the space shuttle, and how their development process led to the very high quality of their product. For example, whenever a mistake was found in the code, it was traced to the corresponding problem in the process that allowed the mistake to be written and pass reviews and testing. Then, the team investigated what other bugs could have been caused by the same process failure. In this way, a bug in the control of the shuttle's robotic manipulator arm was traced back to a certain type of problem. When investigating other effects of the same problem, a much more serious bug was found in the control of the high-gain antenna. The bug was fixed before it had a chance to manifest itself.
It takes a lot of discipline and investment to work this way, but the result is a much better, more reliable product. Sometimes it isn't all that important; other times, it's the difference between life and death.