REALLY REMOTE Debugging: A Conversation with Glenn Reeves
In this classic 1999 Dr. Dobb's Journal interview, contributing editor Jack Woehr talks with NASA's Glenn Reeves about the programming behind the Mars Surveyor program. At the end of the post, read Reeves' first-hand account of the still-amazing fix of the team's priority inversion problem -- talk about REALLY REMOTE debugging for real-time systems!
A Conversation with Glenn Reeves
By Jack J. Woehr
The Mars Pathfinder Mission was the first mission of NASA's Mars Surveyor program -- a decade-long program of robotic exploration that focused on the search for evidence of past life on Mars, understanding the Martian climate and its lessons for the past and future of Earth's climate, and understanding the geology and resources that could be used to support future human missions to Mars. As such, the mission (managed by the Jet Propulsion Laboratory; http://www.jpl.nasa.gov/) was one of the most ambitious and closely watched space missions in history.
Pathfinder essentially consisted of a stationary lander (the "Lander") and surface rover (called "Sojourner"), which together had the primary objective of demonstrating the feasibility of low-cost landings on and exploration of the Martian surface. The Pathfinder system itself was built around a variety of off-the-shelf hardware and software components. At the heart of the system was a single CPU -- the RS6000 -- running the Wind River Systems VxWorks real-time operating system.
Pathfinder was launched on December 4, 1996, and had a seven-month cruise to Mars. It landed on Mars on July 4, 1997, where Sojourner almost immediately began conducting experiments and collecting data. In the first month of surface operations the mission returned to NASA about 1.2 gigabits of data, including 9669 Lander and 384 Rover images and about 4 million temperature, pressure, and wind measurements.
Suddenly, on September 27, 1997, communication with Sojourner was lost as the system began experiencing total resets, along with the resulting loss of data. The mainstream press promptly -- and generally inaccurately -- pounced on the problem, referring to "software glitches" that were due to the computer "trying to do too many things at once."
The problem, as JPL engineers like Glenn Reeves (the "Flight Software Cognizant Engineer for the Attitude and Information Management Subsystem, Mars Pathfinder Mission"), quickly discovered involved priority inversion -- a phenomenon familiar to real-time operating-systems engineers for more than 20 years. Contrary to most published reports, the problem was hardly extraterrestrial: A low-priority task (such as one for meteorological data gathering) occasionally grabbed a semaphore needed by a top-priority task (a bus management task, for example), then got preempted by medium-priority tasks (for instance, a communications task). Therefore, the high-priority task was not able to complete its work by the specified time. When the blocked top-priority task didn't do its work in time, a watchdog thread reset the system. This reset reinitialized all of the hardware and software. It also terminated the execution of the current ground commanded activities. Data that had already been collected wasn't lost. However, the remainder of the activities for that day could not be accomplished until the next day.
Not only were JPL engineers aware of priority inversion from the outset, but they had carefully planned for instances of it. Consequently, they were able to quickly reproduce the problem on a duplicate system in the lab, then transfer a fix -- changing the system default priority for select() semaphore creation -- to the spacecraft about 100 million miles away. (Talk about remote debugging!) This was doable because they had originally created two writable images of the control software, and either one could be deleted and reloaded from Earth. The team chose to never delete one of the two original copies while patching from the ground, for fear of losing something clean to boot with.
In the end, all went well. The spacecraft completed its mission. Reeves, who is now the Chief Engineer, Mission Data System Project, recently took time to chat with DDJ contributing editor Jack Woehr about Pathfinder and other issues involved in writing software for extraterrestrial exploration.
DDJ: You're programming computers that operate on the surface of another planet. That's weird!
GR: It's weird for me, let me tell you. You're in this microcosm of people where your whole life is concentrated on making sure the next day's work of activities are going to go just right. You're walking home at 3:00am still wired from staying up all night looking at information coming back. Your wife and the kids are asleep, there's no way you can sleep and you've go to be back at 8:00am anyway. So you turn on CNN and there's your smiling face in the control room and you're looking at the same images you looked at 12 hours ago. It's a strange feeling.
It was a small project by a small team, yet it had a phenomenal impact overwhelming to many of us. On the technical side, it seemed straightforward at the time.
DDJ: I guess when you're doing the code, it's like any other real-time project.
GR: When you have a ballistic entry, there's not much you can do about it. There's a little unknown, where you're relying on information gathered 22 years ago, but in general, we tried to simplify that job as much as we could.
We spun the spacecraft, we had inertia on our side...the control part was fairly straightforward. The hard part was thinking what things could go wrong, what do we do if something breaks...There are very few things that are redundant. How are we to go about testing it? How do we make sure that we can build an environment that matches what we want the spacecraft to fly through? Things like that turn out to be much harder than the spacecraft development itself.
DDJ: That's always a problem when one designs a large system, how one tests it in anything resembling the deployment environment.
GR: Space is pretty forgiving, few forces, everything action and reaction. Coming down through the atmosphere, that's a little different ballgame, but we can still do decent simulations. We gave up on trying to model bouncing and orientation of the vehicle and the simulation of the accelerometer values during the movements. Those were too hard to do. We tested the thing as hard as we can in a brute force sort of way.
DDJ: You dropped it from a tall building?
GR: Some drop testing. We did a lot of testing of the airbags. We proved to ourselves they could withstand the forces of the bounces, made sure they wouldn't bottom out, would retain the right pressure, that the airbag material itself wouldn't rip prematurely or catastrophically. We did that for two or three years, it was either the most expensive portion of our test program or pretty close.
The actual orientation, opening the petals and turning the vehicle over, was very much brute force. Put the thing on the ground, read the accelerometers, see that it opened the right petal in the right order.
We did testing of the airbags as they are retracted. They get pulled in by cables within the bag. We did a lot of testing with rock placement, "Given the fact that bag's going to wrap around this rock, are the motors going to stall and leave the bag extended and flapping in the breeze?"
In the end, the darn thing bounced 16 times, landed right on its base, and all the airbags retracted right in, just like it was supposed to. The perfect scenario happened in the actual mission.
DDJ: It seems almost petty to ask about the computer programming problem in comparison with the flow of the entire enterprise.
GR: In this mission, most of the navigation was done from the ground. The attitude of the spacecraft was controlled onboard.
JPL used to build their own computers; in fact, at a point in time, we built our own CPUs. Not any longer, that's an art form that's better left to industry. Single-board computers are an off-the-shelf item.
So for the second time in history, JPL went out to buy a flight computer on contract. We put in a provision that specified an operating system on top of that. I wrote a lot of the flight computer specifications.
I was very familiar with [Mentor Graphics] VRTX and [Wind River Systems] VxWorks, and have done some [Integrated Systems] pSOS work before. So I specified a fairly generic real-time operating system. What was proposed by IBM, the winner of the flight computer contract, was OSOpen, done by IBM's Raleigh, North Carolina division. It was in early beta.
We looked at it, and they had some more work to do on it. Then suddenly around 1993-1994, IBM Federal Systems got sold to Loral. What had been one company became two divisions of two large companies that were mutually antagonistic. It became obvious that we weren't going to get a real-time operating system from these guys in Raleigh within the time frame we needed it.
That's when we decided to go to Wind River and see if they would come in and port for us to this RAD6000 computer.
DDJ: The original IBM RISC mask, rad-hard?
GR: Right. In a nutshell, we sold Wind River on the neat publicity they would get porting VxWorks onto this processor. They bit, and did it fairly inexpensively and they did it quickly. We had a working version of VxWorks within four months.
Not only were they receptive to the idea, but they put people on it right away. They were very generous with time, there was no nickel-and-diming on this contract. When there were problems...I had a team of seven people doing the flight software for this (the rover was entirely separate)...the technical interchange was engineer-to-engineer, there was no contract management in the middle. It was a nice interface. They did a good job.
DDJ: And when you had to hot-patch in flight?
GR: That's standard procedure. You always build in the ability to change it.
DDJ: Just in case.
GR: Just in case, but JPL and a bunch of decent-sized companies have had the problem where you can't get all the software done in time for launch. You always make sure you build the capability to change things.
DDJ: When you send something 300,000,000 miles there's always the chance something you hadn't anticipated will happen.
GR: Well, as we said, space is forgiving, but on the surface of Mars and other places, you're right. But what we'd really like to do is build spacecraft that are much smarter about how to take care of things they don't precisely expect yet still achieve the things they are intended to do without interaction from the ground.
DDJ: Real robots! Were you an Asimov fan when you were a kid?
GR: Still am! Both a kid and an Asimov fan.
DDJ: What did your title on the Pathfinder mission -- "Flight Software Cognizant Engineer" -- signify?
GR: I was ultimately responsible for development and operation of flight software on the spacecraft itself. My butt on the line.
DDJ: Software project management, the final frontier! How do you guys actually get the product out?
GR: Here's my opinion on why Pathfinder worked as well as it did: We were focused. We had a specific goal, to get this spacecraft launched, get it to Mars, get it through the atmosphere, deliver the rover onto the surface. We managed to focus everyone on the team on that objective.
Number two, JPL was in this Total Quality Management phase, so they talked a lot about empowerment, authority, and responsibility. Some of us, despite being more than a little skeptical about TQM itself, took that and ran with it. If we were empowered, that meant we really had the ability to make the decisions. The whole project worked that way, and the management of the project structured it. They really did trust the people working on it. You really did take responsibility for the people you put on the project.
We were the small project at JPL. Still ongoing was the back end of all the Cassini development, a 3000-person development. Pathfinder was 300 persons or fewer. So, the team was very tightly knit.
All these things contributed to the success from the management point of view. I'd love to say that JPL took that lesson and applied it to all the subsequent projects, however, I would say the exact opposite was true. A lot of people look at Pathfinder as successful, yet, "We don't want to do it that way again."
Recognize, however, that one of the reasons that's true is that, in the past, we at JPL had these huge projects and were able to build up a lot of infrastructure institution-wide within a project. But since we have a bunch of much tinier projects now, we have to build an infrastructure across the board, so there's a lot more interdependency between projects than there has been in the past.
So the Pathfinder approach of putting yourselves out, a sort of skunkworks type of environment, just doesn't work anymore at JPL.
DDJ: As space travel becomes routine, you don't need geniuses to do it.
GR: When that occurs!
DDJ: But that's your job, year after year, trying to make it routine.
GR: That's correct. And once we make it routine, we'd like to make sure that industry is doing the routine part and we go off to the stuff that's still on the frontier.
DDJ: Glenn, what are you doing now?
GR: I'm working on one of those infrastructure things. JPL is developing some avionics hardware that's radiation hardened to the point it can survive at Europa, which has a very severe radiation environment. We are pushing the technology in this arena, looking at the part level at one megarad type of problems.
This is a lot of contractual work going outside of JPL. We're trying to fly as much commercial heritage stuff as we possibly can. The flight computer we've currently selected is PowerPC 750 based at 200 MHz.
DDJ: You're putting a Macintosh into space!
GR: Isn't that convenient? The processor speed of the thing we flew on Pathfinder was about 22 MHz. We've gone up an order of magnitude. That's a phenomenal increase. JPL has, historically, flown processor technologies that are 10 to 12 years behind...
DDJ:...As you wait for the rad-hard version of the part.
GR: Yup. What I'm specifically working on is a project that is sort of mislabelled as "Mission Data System." It's really an institution-wide effort to do a couple things. JPL has had several autonomy-oriented things occurring. We're trying to come up with an architecture that builds a foundation where we can have much more autonomous spacecraft than we've had in the past.
There are a lot of things involved with that, not just software, but also process, normal, up-to-date software engineering practices. I think JPL is moving from a hardware-centric organization to a more software-centric organization. The computer systems that we're building now look a lot like little local area networks. Computer systems both fast and less fast connected by things like FireWire and I2C.
DDJ: I2C is still going strong!
GR: It's not fast, but it's low power. That's often a real determinant. For high volume, high speed, we have 1394 FireWire -- 100 MBit/sec.
DDJ: I guess the NASA environment is pretty "toy-rich" for the programmer?
GR: We're trying to make it toy-rich in the sense that we're pushing hard for the commercial, off-the-shelf standardized things.
On Pathfinder we said, "We're going to fly the VME bus, because look at all the cards we can buy instead of doing in-house design and building every piece of test equipment."
We're still following that path.
DDJ: We started discussing the challenge of your current assignment in the context of trying to preserve the best of the Pathfinder skunkworks methodology in the current horizontal arrangement of smaller teams and standardized parts. How are you addressing that challenge?
GR: JPL is in the throes of trying to address team organization and interdependency between teams effectively. It's a big problem for an organization to go from a very command and control-oriented to a very interdependency-oriented structure.
DDJ: Is it left to JPL internally, or are you one eddy in the NASA river?
GR: At this point it's pretty much JPL internal. Our interdependency issues are certainly known to NASA.
DDJ: You personally are being slurped into management. Do you get to write code anymore these days?
GR: Not these days...I have a position called Mission Data System Chief Engineer. It sounds like a technical decision making position, but it's primarily a programmatic activity. I'm resisting as best I can!
DDJ: All the techies reading DDJ are cheering, "Don't give in to the Dark Side!"
GR: Someone has to put the plan together. That deficiency has dragged me to the dark side.
DDJ: Yes, really, the management is the final frontier. Development environments are shrink-wrap now. Figuring out how to divide work suitably between people, how to interleave so the product arrives on time...
GR: Those are second-order effects. The first-order effects are methodology. UML or not UML? Languages, classic things, the ability to do things on the type of computers we're now buying...Some of the things we had to worry about in the past we don't have to worry about any more. A 200-MHz PowerPC is about 200 times faster than the processors that run in Cassini. A lot of efficiency issues are starting to disappear.
DDJ: So you just write "good old code" like anybody writes?
GR: Well, wouldn't that be nice? Wouldn't it be nice if we could get close to the maturing edge of some software engineering? JPL, because we've been relegated to targets about 10 to 15 years back, we ended up having to use whatever's available, but we're no longer in that ballgame.
DDJ: What about Linux?
GR: We haven't evaluated it from a flight perspective. It has a fairly decent presence here on desktop computers. Every engineer has his opinion on the One True Way. Languages turn out to be a religious issue, too.
DDJ: What religions are sweeping JPL?
GR: Most of the work JPL has done has been primarily in procedural languages. We did a fair amount of Ada on Cassini, but from a true object-oriented perspective, we haven't been there yet, not on the flight side. Pathfinder and others are all C-based.
But moving towards an object-oriented direction takes you to an object-oriented language, C++ or Java...We have a whole bunch of artificial intelligence folks for whom LISP is a language of choice. Factor in a multitasking, preemptive real-time environment, even a multilanguage environment, and you have a whole bunch of religious conflicts.
DDJ: You're going to need memory to run that stuff.
GR: Look at the memory densities we're starting to see. On Pathfinder we had 128 megabytes of RAM. We moved into the realm where my desktop machine has more than the spacecraft. We're continuing in that direction. Let's put it this way, the software guys will be able to use all of that space. "If it's there, we can fill it up."
DDJ: When are we going to Europa?
GR: The hardware and the software come together in the two-year time frame for the two missions that will be supported. The first one is a mission called "Space Technology 4," or ST4. It's going off to a comet [to] take some samples [and] possibly eventually return. I'm not sure about the return part now, they've done some descoping lately.
The second mission is to Europa. They're launching possibly in 2003. There are other ongoing missions that will also probably use this technology, but those are the only two signed up now.
DDJ: What keeps people going to Congress for funding to shoot tin cans at Europa? Why does this happen?
GR: My personal opinion is that you have to recognize this is a very small planet. At some point in the future, man is going to move off into space, there's no doubt about it. People recognize there has to be some level of exploration that's ongoing to support that goal.
Perhaps the question is, "How can we spend billions in space while there are starving children in the world?" I think it's a balance, where you're balancing the short-term goals with the long-term goals.
I can't speak for the manned exploration side of NASA, but we on the robotic exploration side have witnessed great changes, going from billion-dollar spacecraft to $200 million spacecraft. NASA applies fairly small resources to greater and greater discovery and exploration missions.
DDJ: I grew up with Robert Heinlein's books about mining the asteroids. Is that in the foreseeable future?
GR: That's interesting, because this week or next week there's a guy coming from a company called SpaceDev Inc., to talk to us about commercial deep-space exploration. How there will be a shift from government-sponsored space exploration, whether Russian, Japanese, German, French, or American. I'm not sure he's going to address mining, but he's going to address that there are institutions both research and commercial that are willing to pay for images and spectroscopy in the hopes of discovering minerals, resources...
DDJ: The classic SciFi goals of space exploration.
GR: I can tell you that it's coming. Any time soon? Maybe in our lifetime, late...My kids will see it.
DDJ: Any advice for software engineers interested in space?
GR: My brother-in-law says, "Ya write the software but ya never actually know about the thing it's doing." I think that's wrong. I think you can still be a computer scientist and do phenomenally interesting applications of that art. Pathfinder is an example of how exciting that can really get.
Priority Inversion: How We Found It, How We Fixed It
By Glenn E. Reeves
The Mars Pathfinder spacecraft had a single RS6000-based single-board computer residing on a VME bus to control the spacecraft. The single VME chassis also contained interface cards for the radio, camera, and an interface to a 1553 bus. The 1553 bus, in turn, connected to the "cruise stage" and the "lander" part of the spacecraft. The hardware on the cruise part of the spacecraft controlled the thrusters, valves, sun sensor, and star scanner. The hardware on the Lander provided an interface to accelerometers, radar altimeter, and an instrument for meteorological science (ASI/MET). The hardware that was used to interface to the 1553 bus (at both ends) was inherited from the Cassini spacecraft.
To support the Mars Pathfinder Mission, Wind River Systems ported its standard VxWorks for the 680x0 to the RS6000. The RS6000 is the same single-chip CPU that can be found in some (now older) IBM AIX workstations. The Mars Pathfinder flight software also had several debug features that were used in the lab, but not on the actual flight spacecraft because they produced more information than we could send back to Earth. These features were not enabled, but remained in the software by design.
One of these tools was a trace/log facility that was originally developed to find a bug in an early version of the Wind River Systems VxWorks port. David Cummings, one of the JPL software engineers, built the trace/log facility. Lisa Stanley (at Wind River Systems) took this facility and instrumented the pipe services, msgQ services, interrupt handling, select services, and the tExec task. The facility initialized at startup and continued to collect data (in ring buffers) until told to stop. The facility produced a voluminous dump of information when asked.
When the "repeated reset problem" (for details, see RISKS Digest, Volume 19 Issue 54, January 10, 1998; http://catless .ncl.ac.uk/Risks/19.54.html) occurred on Mars, we ran the same set of spacecraft activities over and over again in the lab. Since the flight software had the trace/log facility enabled and the failing task was already coded so as to stop the trace/log collection and dump the data (even though we knew we could not get the dump in flight) for this error, we went into the lab to test whether we would have to change the software.
In less than 18 hours, we were able to repeat the problem, isolate it to an interaction of the pipe() and select() mechanisms, diagnose it as a priority inversion problem, and identify the most likely fix. In fact, the fix seemed straightforward: We had to change the creation flags for the semaphore used within the select() facility so as to enable priority inheritance. This change was possible because Wind River Systems supplied global variables for parameters, such as the "options" parameter for the semMCreate used by the select service (although this was not documented and those who do not have VxWorks source code or have not studied the source code might be unaware of this feature). Still, the fix was not that straightforward for several reasons:
1. The code for this was in the selectLib() and was common for all device creations. When this global variable is changed, all of the select semaphores created after that point will be created with the new options. There was no easy way in our initialization logic to only modify the semaphore associated with the problem.
2. If we made this change and applied it on a global basis, we didn't know how it would affect the behavior of the rest of the system.
3. Because Wind River Systems deliberately left the priority inversion option out of the default selectLib() service for optimum performance, we didn't know if performance would degrade if we turned the priority inversion on.
4. Finally, we didn't know if there was some intrinsic behavior of the select mechanism itself that would change if the priority inversion was enabled.
In the end, we modified the global variable to include the priority inversion. This corrected the problem. We asked the Wind River Systems engineers to analyze the potential impacts for (3) and (4) above. They concluded that the performance impact would be minimal and that the behavior of select() would not change so long as there was always only one task waiting for any particular file descriptor (this was true in our system). I believe that the debate at Wind River Systems still continues over whether the priority inversion option should be on as the default. As for the aforementioned (1) and (2), the change did alter the characteristics of all of the select semaphores. We concluded, both by analysis and test, that there was no adverse behavior. We tested the system extensively before we changed the software on the spacecraft.
We weren't able to catch the problem before launch because the problem would only manifest itself when ASI/MET data was being collected and intermediate tasks were heavily loaded. Our before-launch testing was limited to the "best case" high data rates and science activities. The fact that data rates from the surface were higher than anticipated and the amount of science activities proportionally greater served to aggravate the problem. We did not expect nor test the "better than we could have ever imagined" case.
We did see the problem before landing, but could not get it to repeat when we tried to track it down. It was not forgotten nor was it deemed unimportant. Yes, we were concentrating heavily on the entry and landing software. Yes, we considered this problem lower priority. Yes, we would have liked to have everything perfect before landing. However, I didn't see any problem, other than that we ran out of time to get the lower priority issues resolved.
We did have one other thing on our side -- we knew how robust our system was because that is the way we designed it. We knew that if this problem occurred, we would reset. We built in mechanisms to recover the current activity so that there would be no interruptions in the science data (although this wasn't used until later in the landed mission). We built in the ability (and tested it) to go through multiple resets while we were going through the Martian atmosphere. We designed the software to recover from radiation induced errors in the memory or the processor. The spacecraft would have even done a 60-day mission on its own, including deploying the rover, if the radio receiver had broken when we landed. There were a large number of safeguards in the system to ensure robust, continued operation in the event of a failure of this type. These safeguards allowed us to designate problems of this nature as lower priority. We had our priorities right.
Did we (the JPL team) make an error in assuming how the select/pipe mechanism would work? Probably. But there was no conscious decision to not have the priority inversion enabled. We just missed it. There were several other places in the flight software where similar protection was required for critical data structures and the semaphores did have priority inversion protection. A good lesson when you fly commercial off-the-shelf stuff -- make sure you know how it works.