Working at the Rocket Factory for 30 Years

When I started at Rocketdyne I was told that the only way to “get ahead,” i.e. get promotions or salary increases, was to move around among different companies for the first few years. I never did that. I stayed with the company for just over 30 years, though over that time I worked in several different roles, and the facility had four different corporate owners. (The Rocketdyne division of Rockwell had at least five sites, three of them in California: the main one in Canoga Park, at the corner of Victory and Canoga, where rocket engines were manufactured, and where senior management and some departments had office space; a smaller site a few miles north in Chatsworth, at the corner of De Soto and Nordhoff, with office space and, later, considerable manufacturing space; third, a test facility where rocket engines were actually fired, in the hills at the west end of the San Fernando Valley, called the Santa Susanna Field Laboratory. Two other sites were in the US South: another test facility at Stennis Space Center near Slidell, Louisiana; and finally, a software testing facility on the Redstone Arsenal outside Huntsville, Alabama.)

When I started Rocketdyne was part of Rockwell International, itself derived from an earlier company called North American Rockwell. In the mid-1990s  Rockwell sold the division (all five sites) to Boeing, which someone thought made sense at the time; Boeing had no previous involvement in building rockets, though by the ‘90s Rocketdyne did considerably more than actual rocket work. For one, the site designed and built the electrical power system for the International Space Station (ISS). Boeing kept Rocketdyne for about a decade, then sold it to Pratt & Whitney, known for making jet engines for aircraft, itself a division of United Technologies. In the year after I was laid off, at the end of 2012, UTC sold Rocketdyne to Aerojet, a much smaller company then (still?) based in Sacramento.

When I started at Rocketdyne in June 1982 it was the very end of the keypunch age but not quite the age of the desktop computer terminal. My first desk (pic) had no keyboard of any type. We would fill in documents or coding sheets by hand, and hand them over to a typist or keypunch operator for computer input. The first computer terminal system we had was, IIRC, one designed strictly for word processing, called Wang (https://en.wikipedia.org/wiki/Wang_Laboratories#Word_processors) – I don’t remember which version, but to use it entailed walking over a dedicated work station and logging in. Obviously you didn’t use it very often, because others needed their turns.

Over the decades, Wang gave way to a mainframe computer, a VAX, with terminals on everyone’s desks, and eventually to Windows PCs on everyone’s desks and connected into a network for file sharing, e-mail exchange, and eventually connection to the internet. I’m likely forgetting several intermediate steps. By the mid-1990s, if not before, the Microsoft Office Suite was the standard toolset on everyone’s PCs, including Word, Excel, PowerPoint, Access, and Outlook.

For about the first third of my career, I supported specific projects, first for Space Shuttle, then for ISS. For the second two thirds, I moved into process management and process improvement. Both activities were fascinating, in different way, and interesting to summarize as basic principles.

Software Engineering

Software engineering is, in a sense, the bureaucratic overhead of computer programming, without the negative connotation of bureaucracy. Software engineering is the engineering discipline that includes the management structure, the coordination of individuals and teams, the development phases, and the controls that are necessary to get the customer’s desired computer program, software, into the target project and to be sure it works correctly and works as the customer intended.

At the core are several development phases. First of these is system requirements. These are statements by the customer (in these cases NASA) about what they want the software to do. These statements are general, in terms of the entire “system” (e.g., SSMEs, the Space Shuttle Main Engines), and not in software terms. An example might be, the software will monitor the temperature sensors and invoke engine shutdown should three of the four sensors fail.

The next phase is software requirements. This is where software engineers translate the system requirements into a set of very specific requirements about what the software should do. These statements are numbered and typically invoke the word “shall” to indicate a testable requirement. Examples might be: The software shall, in each major cycle, compare the temp sensor to a set of qualification limits. If the sensor reading exceeds these limits for three major cycles, the sensor shall be disqualified. If three sensors become disqualified, the software shall invoke Emergency Shutdown.

These requirements entail identifying the platform where the software where will run; the size of the memory; the specific inputs (sensor data, external commands) and outputs (commands to the hardware, warnings to the astronauts), and so on.

The next phase is design. Design is essentially everything that has to happen, given all the inputs, to produce the required outputs. The traditional method for documenting design was flowcharts (https://en.wikipedia.org/wiki/Flowchart), with various shapes of boxes to indicate steps, decisions, inputs, outputs, and so on.

Next was code. When I began we were still writing in assembly language! That was the language of the particular computer we were writing for, and consisted of various three-letter abbreviations for each command, where some commands were indications to move the flow of execution to some position above or below the current one. Within a couple years after I started, the SSME software transitioned to “Block II,” where the software was rewritten in a higher level language, C+, much easier to maintain.

The final phase was test. The code was run in a lab where the target platform was simulated inside a hardware framework that faked commands and sensor inputs. Each set of fake inputs was a test case, and each test case was designed to test and verify a particular item back in the software requirements.

The key all this was traceability. The software requirements were numbered; the design and then code documented at each step the s/w requirement(s) it implemented. The test phase was conducted without knowledge of the design and code; the testers looked only at the requirements, and created test cases to verify every single one.

This was the core sequence of developing software. There were two other attendant aspects.

One area consisted of quality assurance, QA, and another configuration management, CM. QA people are charged with monitoring all the development phase steps and assuring that the steps for doing them are complete; they’re monitoring the process, essentially, without needing to know about the product being developed. CM folks keep track of all the versions of the outputs of each development phase, to make sure that consistency and correctness are maintained. You might not think this is a significant task, but it is. As development continues, there are new versions of requirements, of design and code, of test procedures, all the time, and all these versions need to be kept track of and coordinated, especially when being released to the customer!

An attendant task of CM is, that after a release of the software to the customer, there are inevitably changes to be made. Change requests can come in from anyone—the customer especially. for requirements changes, but also any software developer who spots an error or simply has an improvement suggestion (a clarification in the requirements; a simpler implementation in code) to make. And so there is an infrastructure of databases and CM folk to keep track of change requests, compile them for periodic reviews, record decisions on whether to implement each or not, and track them to conclusion and verification.

A supporting process to all these phases of software development was that of peer review, which became something of my specialty (I maintained the training course materials for the subject, and taught it a bunch of times, both onsite and at other sites). While these days “peer review” is tossed around as an issue of the credibility of scientific papers, the process had a very specific definition and implementation for software development. First, the context is that there’s a team of software engineers all working parallel changes to the same master product. When I started originally, working Block I, within a whole team of 10 or 12, a particular team member would work all phases of a particular change: changes to requirements, to design, to code, to test plans. Later, Block II was large enough to allow specific engineers to specialize in one of those phases. In either case, a change would be drafted as markups to existing requirements documents, etc., etc., and these markups were distributed to several other team members for review. After a couple days, a formal meeting would be held when each reviewer was charged with bringing their comments, including errors found or suggestions for improvement. This meeting was conducted by someone other than the author of the changes. A member of the quality team attended, but management was specifically not invited—the intent of the meeting was to get honest feedback without fear of reprisal. As the meeting was conducted, it was not a presentation of the material; the reviewers were expected to have become familiar with it in advance of the meeting. And so the meeting consisted of paging through the changed documents. Anyone have comments on page 1? No? Page 4? (If no changes were made on pages 2 and 3.) OK, what is it, let’s discuss. And so on. The meeting participants would arrive at a consensus about whether each issue needed to be addressed by the change author. The number of such issues was recorded, the change author sent off to address them, with a follow-up by the meeting coordinator, and QA, to assure all the issues were addressed.

There was a crucial distinction between what were called errors and defects. The associated term “external defects” were the worst news imaginable – a flaw found by the customer in a delivered product (even if an intermediary product). Such problems were tracked at the highest levels by NASA review boards. The whole point of peer reviews was to identify flaws as early as possible in the development process. Within the context of a peer review, a problem made by the change author, that could be fixed before the change products were forwarded to the next phase of development, was an “error.” And technically, a problem found in, say, a code review, that was actually due to a flaw in the design or requirements, was also a defect, an internal one.

Counts of errors and defects, per peer review, and product, were ruthlessly documented, and analyzed, at least in later years when process management and improvement took hold (more about that below). It was all about finding problems as early as possible in to avoid later rework, and expense.

This all may seem incredibly complex and perhaps overly-bureaucratic – but modern computer systems are complex, from the basic software in the Saturn V and the Space Shuttle, to the decades-later iPhones, whose functionality is likely a million times the Shuttle’s, which depend on similar or analogous practices for developing software.

Aside: Coding

Every phase of the software development process can be done haphazardly, with poorly written requirements, design flowcharts with arrows going every which way and crossing over one another, spaghetti code with equivalent jumps from one statement to another, up or down the sequence of statements. Or, elegantly and precisely, with clean, exact wording for requirements (much as CMMI has continually refined; below), structured flowcharts, and structured, well-documented code. (With code always commented – i.e., inserted lines of textual explanation in between the code statements, delimited by special symbols so the compiler would not try to execute them, explaining what each group of code statements were intending to do. This would help the next guy who might not come along to revise the code for years; even if you’re that guy, years later.)

But for code, more than the other phases, there is a certain utter certainty to its execution; it is deterministic. It’s digital, unlike the analog processes of virtually every other aspect of life, where problems can be attributed to the messiness of perception and analog sensory readings. So if there’s a problem, if running the code doesn’t produce the correct results, or if running it hangs in mid-execution, you can *always* trace the execution of one statement after the next all the way through the program, find the problem, and fix it. Always. (Except when you can’t, below.)

I keep this in mind especially since, outside industry work, I’ve done programming on my own, for my website and database projects, since the mid-1990s, at first writing Microsoft Word “macros” (to generate an Awards Index in page perfect format for book publication.. which never happened) and then moving on to writing Microsoft Access “macros,” to take sets of data from tables or queries and build web pages, for my online Awards Indexes (which did happen). (Also, to compile the annual Locus poll and survey, and similar side tasks.)

With highly refined code used over and over for years (as in my databases), when running a step hangs in mid-execution, it is always a problem with the data. The code expects a certain set of possible values; some field of data wasn’t set correctly, didn’t match the set of expected values; you find it and fix the data. But again, you always find the problem and fix it.

There’s a proviso, and exception, to this thesis.

The proviso is that it can be very difficult to trace a problem, when running a piece of code hangs. Sophisticated compilers give error warnings, and will bring up and highlight the line of code where the program stopped. But these error warnings are rarely helpful, and are often misleading, even in the best software. The problem turns out to be one of data, or of a step that executed correctly upstream but produced incorrect results. And so you have to trace the path of execution and follow every piece of data used in the execution of the code. This can be difficult, and yet – it always gets figured out.

Interruptions

The exception that I know of to this perfect deconstruction, likely one of a class of exceptions, is when the software is running in a live environment and is subject to interruptions based on new inputs (sensor, command), which can interrupt the regular periodic running of the software at any instant. My experience of this is from the Space Shuttle controller software (which ran at something like 60 times per second), and it was a key issue in the FMEA analysis following the first shuttle disaster. The software was built to respond to various kinds of “interrupts” – again, command inputs, but also sensor warnings from the engine – that would transfer the regular execution of the software to a special response module. Whatever internal software settings that existed thus far might be erased, or not erased but no longer be valid. This was an unpredictable situation that I don’t think ever was, or perhaps ever could be, resolved. There is always an element of indeterminacy in real-time software.

Data vs. Algorithm

I have one other comment to make about coding, in particular (not so much about requirements or design). This was especially important back in the SSME Block I days when memory space was so limited, but it also still informs my current database development. Which is: the code implementation is a tradeoff, and interplay, between data and logic. When there is fixed data to draw upon, the way the data is structured (in arrays or tables, say) greatly affects the code steps that implement it. You can save lots of code steps if you structure your sets of data appropriately at the start. Similarly, when I rebuilt the sensor processing module, writing a large section of the code from scratch, replacing earlier versions that had been “patched” (perhaps I should explain that), the savings in memory came partly from avoiding the overhead of patched software, but of rebuilding data tables (of, for example, minimum and maximum qualification limits for sensors) in ways that made the writing of the code more efficient.

Patching

This will seem extremely primitive by modern standards, but that’s how it was done in the ‘80s. I’ll invent a simple example. Suppose you’re asked to modify the code for a simple comparison of a sensor reading with its qualification limits. The original code ran like this (not real code, a mock-code example):

If current_sensor_reading > max_qual_limit then

Increment disqual_count by 1

If disqual_count >2 then

Set sensor_disqualification tag

Endif

Endif

Now suppose a new requirement came along to, in addition to incrementing the disqual count by 1, also set an astronaut_warning_flag. Now the point here is that, in the earliest, Block I software, these instructions were coded in assembly code, with every code step loaded into a specific location of memory. The code was not “compiled” in the later sense every time it was run, or modified, because the qualification of the code applied only to the original coding in those particular locations of memory. Thus, to make this change, you would affect as few stable pieces of code as possible, add the new steps in some previously unused section of memory, and use “jumps” to implement the new sequence of steps, like this:

If current_sensor_reading > max_qual_limit then

Increment disqual_count by 1  LABEL1: Jump to INSERT1

If disqual_count >2 then

Set sensor_disqualification tag

Endif

Endif

 

(down in previously empty memory):

INSERT1: Increment disqual count by 1

Set astronaut_warning_flag to yes

Jump to LABEL1 +1

So to add, in effect, one line of code, you had to spend two lines of code to jump and out of and back into the existing execution flow. You can see how repeating patching of different areas of the software made the aggregate less and less efficient, in terms of memory locations used.

Object oriented

One more principle that we gradually employed for SSME, and which I later employed in my database designs, was the idea of object-oriented design. This was a generalization of the idea of subroutines, or functions. Super-simple example:

Input next name from input list

Perform steps to capitalize every letter

Input next name from input list

Perform steps to capitalize every letter

(and so on)

Actually this can be optimized thusly:

Do until end-of-list:

Input next name from input list

Perform steps to capitalize every letter

Move to next position in input list

Loop

But suppose you need to do the capitalization from many different places in a large program? Instead of repeating the several steps to capitalize every letter, you isolate those steps in a separate subroutine, or function, that can be invoked from anywhere, not just the one Do-loop:

 

Do until end-of-list:

Input next name from input list

Call Subroutine Cap_all()

Move to next position in input list

Loop

 

Once Cap_all() is written, it can be used from anywhere else in the entire program.

And the extension of this, object-oriented programming, is to divide the entire program into separate, self-sufficient modules, that call each other as needed, and make every one of them independent, with its own inputs and outputs that don’t depend on the sequence of execution of any other modules. In my database development for my online awards sites (there was an earlier one on the locusmag.com site before I created sfadb.com), I took the database I’d developed for the earlier site, which had many repeated sections of almost-identical code (e.g. to format a title or a byline, from the base set of data, for different output pages), and rewrote it from scratch for sfadb.com, using object oriented techniques, to format titles and bylines in one module called “assemble” before a later module was executed to “build” various webpages.

These software examples are extremely basic and even then I am probably oversimplifying them. But perhaps they provide a taste of the kind of conceptual thinking that goes into software engineering. Rigorous, logical, remorseless. But once they work – this is the kind of engineering that has built our modern world.

My Experience

When I started at Rocketdyne, the first three Space Shuttles had already flown. So the software for the SSME main engine “controllers” (pic) had already been written (initially by Honeywell, in Florida) and installed. The software having been turned over to Rocketdyne for maintenance, it was my group’s job to process changes and updates. I did well, and became an advocate of cleaning up code, and documentation, that had suffered too many haphazard updates and had thus become inefficient.

It’s critical to remember that in this era, we were writing code for a *very small*, by modern standards, computer—it had something like 16K words of memory [check!!!}. So it was extremely important to code efficiently. But the accumulated result of individual changes and updates to that code had used up much of the available margin. So the biggest project I did, in my earliest years at Rocketdyne, was to redesign and recode the entire module for sensor processing – some 25% of the total code — making the result more efficient and saving 10 or 20% of memory space.

[aside with photos about details of that project]

Yet I learned some lessons in those early years—mainly, that even intelligent engineers can become accustomed to tradition and resistant to change. One case was a proposition, by our counterparts in Huntsville, to transition to “structured” flow charts, rather than flowcharts that merely captured to “spaghetti” code being written. (The advantage of structured flow charts, aside from being more understandable, is that they corresponded to the various logical proofs that computer programs accomplished what they were designed for.) There was resistance among the older staff, to my consternation; still, the reform was implemented.

A second case was when I tried to reformat a chart in the requirements document that I thought messy; this was a chart of FID (Failure Identification) codes and responses. Again, it had been amended and revised over the years, and had become messy. I drafted a revised form and sent it out, and got pushback from the senior system engineer, simply because he was used to the current chart and didn’t want to deal with a change, even if it was an improvement…

Potted History

The lead-up history to the early 1980s when I began working for Rocketdyne, supporting the Space Shuttle, might be prehistory to those of you, if anyone, reading this account. As concisely as possible: rockets were conceived centuries ago (initially by the Chinese I believe) but not implemented until the 1940s, when Germany used U2 rockets to bomb London; these rockets traveled in arcs of a few hundred miles. After World War II, the Soviet Union and the US competed to build rockets that could achieve orbit. Throughout this period, futurists (like Willy Ley) and science fiction authors (like Arthur C. Clarke), imagined the use of rockets to place satellites in orbit, or to send men to the moon or other planets. (It was a commonplace assumption in science fiction, from the 1940s and beyond, that human exploration of the planets and even the galaxy was inevitable—a sort of projection into the far future of the Manifest Destiny that informed American history.) The Soviets won the first round, launched Sputnik, a satellite that orbited the earth, the first man-made object to do so. The following decade was a competition between the two countries to send men into space. The US launched Mercury flights (one man per capsule), Gemini flights (two men), and finally Apollo flights, with three men each, and designed ultimately to reach the moon. After several preliminary flights, Apollo 11 landed on the moon, in July 1969 (my family and I watched the live-feed from the spacecraft on grainy black and white TV). Several more Apollo missions landed at other spots on the moon

So the US won the competition with the USSR – they seem to have given up around the mid-1960s, though of course they didn’t admit it. What next? Well, there was the first attempt at a space station, called Skylab (https://en.wikipedia.org/wiki/Skylab), for a year or so in 1973. Then, greatly collapsing the following decades, two big US projects: the Space Shuttle, intended as a re-usable method of getting into orbit, which first launched in 1982, and the International Space Station, which launched in 1998, and which is still going.

All the components of the Mercury, Gemini, and Apollo missions were used once and then lost (burned up in the atmosphere, sunk into the sea, or sent to museums). The Space Shuttle was an odd hybrid, the result of numerous compromises, but it entailed re-usable components: a central plane-like shuttle with three rocket engines at its base that were reusable, and two solid-fuel boosters to lift the ensemble for the first few minutes as it got into orbit, that fell off and landed in the sea.

My job at Rocketdyne was maintaining the software that monitored and controlled the software for the SSMEs, the Space Shuttle Main Engines. The engines were reused, as often as possible, though I’m thinking there were two or three dozen engines, each used multiple times, that were installed in the 135 missions, on the five orbiters (Columbia, Challenger, Discovery, Atlantis, Endeavour).

The last shuttle flight occurred in 2011, long after I’d left the program.

SSME Highlights

  • Business trips: In that first decade at Rocketdyne, supporting Space Shuttle, I had occasion to go on business trips only a handful of times, and not to anyplace glamorous. For purposes of general orientation, newer employees got trips to Huntsville, to see the testing labs where the SSME software went through verification and validation, and to Stennis Space Center in Mississippi, to see the huge test stands where actual SSMEs were mounted and fired. The latter trip did involve flying in and out of New Orleans, so I guess I did get to someplace glamorous, if only for a couple hours on the last afternoon before heading to the airport. (On that trip I was with two new female employees, who were attracted to every tourist shop in the French Quarter.)
  • Shuttle landings: The shuttles almost always landed in the Mojave Desert, on a dry lake bed at Edwards Air Force Base, beginning with the very first one, the Enterprise, a prototype that was lifted into the air on a 747 but never launched into space. Two or three times I and some friends would make the trip the afternoon before a landing, and camp out on the dry lake bed along with thousands of others. The actual landings occurred fairly early in the morning, and happened pretty quickly – from first sighting of the shuttle as a tiny dot way up in the sky, to the touchdown a couple miles across the lake bed from where observers were kept, took about five minutes—and were utterly silent. You saw the plume of dust when the wheels hit, and the roll-out of the lander as it coasted for a minute or so then came to a stop. And then it took hours for everyone to get in their cars and creep out the two-lane road back to the interstate.
  • Shuttle launches: I never saw a shuttle launch; the opportunity never arose. The launches were across the country, at Kennedy Space Center in Florida, from where I lived and worked. Rocketdyne did have a program to send a couple people to each launch, based on some kind of lottery or for meritorious service, but I never applied or was chosen. The practical difficulties of attending launches were that the scheduled launches were often delayed due to weather, sometimes for days, so you couldn’t plan a single trip to last a couple nights; you’d have to extend your stay, or give up and come home. …However, I did snag a trip to KSC, on my own time. In 1992, the annual World Science Fiction Convention was in Orlando. A coworker at Rocketdyne in Canoga Park had moved back east and gotten a job at KSC, so I contacted him to see if he wanted to meet. He got me a pass and took me on a private tour. We stepped briefly into the famous Mission Control room, and then went up onto the actual launch pad where an actual space shuttle sat ready to launch. (This was 2 Sep 1992, so it was Endeavour, STS-47, on the pad.) We took at elevator up the level of the base of the shuttle, with the three main engines to one side, and the tail of the shuttle was directly above us. (You can see where we would have stood in the opening shot of this video, https://www.youtube.com/watch?v=GREwspcOspM) We could have reached up and touched the tail. I was told not to. I didn’t. And then we took the elevator up further, to the level of the beanie cap at the very top, then back down to the astronaut level where escape baskets awaited. And then a walkthrough of the enormous VAB, the Vehicle Assembly Building.
  • Vandenberg and Slick Six. I had a chance to visit a new launch site sometimes in the mid-1980s at Vandenberg Air Force Base, on the west coast north of Santa Barbara. This was called Space Launch Complex 6, SLC-6 (https://en.wikipedia.org/wiki/Vandenberg_Space_Launch_Complex_6),  pronounced “slick six.” It was intended to be a second site to launch space shuttles, in addition to Kennedy, but for various reasons was never completed. I recall this visit especially because I had my camera with me and took a bunch of photos. I was with a big group of Rocketdyne employees that spent a long-day trip from Canoga Park traveling in a charter bus a couple hours to the site, getting a tour and walking around, then taking the bus back to work.
  • Challenger. The first space shuttle disaster (https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster) happened in January 1986; I’d been working at Rocketdyne for less than four years. It happened about 8:30 in the morning west coast time, and fortunately or unfortunately, I was at home sick with a cold, watching it on TV. No doubt everyone at work was watching it too, and I can only imagine the collective reaction of everyone there seeing the shuttle explode on live TV. What followed was months of analysis and investigation to understand the cause (which turned out to be rubber O-rings sealing joints in the SRBs, solid rocket boosters, that had gone brittle in the chilly morning air, letting the burning fuel in the booster escape out through the side). Rocketdyne was relieved to find itself innocent of any role in the disaster—but if NASA was by nature risk-averse, it became even more so, and every contractor for every component of the shuttle assembly spent months doing what was called FMEA, Failure Mode and Effects Analysis (https://en.wikipedia.org/wiki/Failure_mode_and_effects_analysis), an intensive examination of every component looking for any possible failure scenario. In particular was the emphasis on single point failures, a catastrophe that would result if a single component, say a sensor, failed. (This kind of single-point failure of a sensor is what brought down two 737-MAX passenger jets in 2019…) The SSMEs were full of redundancies; two command busses, two of most sensors in the engine except for where we had four (fuel flow), and much of the software function in our software was to constantly – the program cycled at 60 times a second—evaluate these sensor readings against qualification limits and then against each other, with various command reactions should a sensor seem to have gone wrong.) This involved overtime work, sometimes late in the evening after dinner, over many weeks.
  • Columbia. The second space shuttle disaster (https://en.wikipedia.org/wiki/Space_Shuttle_Columbia_disaster) occurred in 2003, on the mission’s re-entry rather than take-off, and again I saw it from home on TV. It was a Saturday morning, and the catastrophe happened around 9am eastern, so had already happened by the time I turned on the TV news. I followed the investigation and resolution of the incident over the next months, but was no longer working on the shuttle program at that time.

Space Station Support

About a decade into my career, i.e. in the early 1990s, I drifted from Space Shuttle support to one other program, and then into process management.

The program was the International Space Station, then under construction, for which Rocketdyne had contracted to design and build the power distribution system. My job was to convert a set of Excel spreadsheets, containing records of various components and appropriate command responses, into a Microsoft Access Database. This is how I learned Access, which I later parlayed into the building of Locus Online and my science fiction awards database, sfadb.com. I also did some testing work, i.e. in the test lab, executing test procedures written by others, in the months when the schedule was tight and they needed extra help working overtime in the evenings.

Process Management and Improvement: CMMI

  • –add somewhere here: How with cmmi I specialized in metrics, and peer review… process performance baseline, etc.

In the early 1990s NASA and the DoD adopted a newly developed standard for assessing potential software contractors. This standard was called the Capability Maturity Model, CMM, and it was developed by the Software Engineering Institute (SEI) at Carnegie Mellon University in Pittsburgh. The CMM was an attempt to capture, in abstract terms, the best practices of successful organizations in the past.

The context is that software projects had a history of coming in late and over-budget. (Perhaps more so than other kinds of engineering projects, like building bridges.) If there were root causes for that history, they may have in the tendency for the occasional software genius to do everything by himself, or at least take charge and tell everyone else what to do. The problem then would be what the team would do when this “hero” left, or retired. All that expertise existed only in his head, and went with him. Or there was a tendency to apply the methods of the previous project to a new project, no matter how different.

In any case, the CMM established a series of best practices for software development, arranged in five “maturity levels,” to be used both as a guide for companies to manage their projects, and also as a standard whereby external assessors would assess a company for consideration when applying for government contracts.

The five levels, I now realize, are analogous to the various hierarchies I’ve identified as themes for consideration for knowledge and awareness of world, from the simplest and most intuitive, to the more sophisticated and disciplined.

  1. Level 1, Initial, is the default, where projects are managed from experience and by intuition.
  2. Level 2, Managed, requires that each project’s processes be documented and followed.
  3. Level 3, Defined, requires that the organization have a single set of standard processes that are in turn adapted for each project’s use (rather than each project creating new processes from scratch).
  4. Level 4, Quantitatively Managed, requires that each project, and the organization collectively, collect data on process performance and use it to manage the projects. (Trivial example: keep track of how many widgets are finished each month and thereby estimating when they will all be done.)
  5. Level 5, Optimizing, requires that the process performance data be analyzed and used to steadily implement process improvements.

Boiled even further down: processes are documented and reliably followed; data is collected on how the processes are executed, and then used to improve them, steadily, forever.

Examples of “improvements” might be the addition of a checklist for peer reviews, to reduce the number of errors and defects, or the acquisition of a new software tool to automate what had been a manual procedure. They are almost always incremental, not revolutionary.

The directions of those improvements can change, depending on changing business goals. For example, for products like the space shuttle, aerospace companies like Rocketdyne placed the highest premium on quality—there must be no defects that might cause a launch to fail, because astronaut’s lives are at stake. But software for an expendable booster might relax this priority in favor of, say, project completion time.

And software companies with different kinds of products, like Apple and Microsoft, place higher premiums on time-to-market and customer appeal, which is why initial releases of their products are often buggy, and don’t get fixed until a version or three later. But both domains could, in principle, use the same framework for process management and improvement.

Again, projects are run by processes, and in principle all the people executing those processes are interchangeable and replaceable. That’s not to say especially brilliant engineers won’t have a chance to perform, but it has to be done in a context in which their work can be taken over by others if necessary.

So…. In the early 1990s, while Rocketdyne was still part of Rockwell International, Rocketdyne and the several other divisions of Rockwell in southern California formed a consortium of sorts, which we called the “Software Center of Excellence” (SCOE, pronounced Skoe-ee) for the group effort of writing a set of standard processes that would satisfy the CMM, at least through Level 3. If I recall correctly, NASA had given its contractors a deadline for demonstrating compliance to Level 3, a deadline that was a few years out.

So I left the SSME Controller Software group and joined two others, Jere B and Alan P, as Rocketdyne’s process improvement group. The work of writing 15 or 20 standard processes with divvied up among the divisions, and in a year or two we put out a “Software Process Manual” in 1994.

The task of writing “standard processes” was pretty vague at first. What is a process? What do you base it on? At its most basic, a “process” identifies a set of inputs (e.g. sensor readings, commands from the astronauts), performs a series of steps on them, and results in some number of outputs (e.g. commands to the engine to start, to throttle up, to throttle down, to shut down). But how do you write up a standard process for your organization about, say, configuration management? What elements of CM (e.g. version management, audits, etc.) were required to be included? The task was to combine the guidance from the CMM, with the reality of how the different divisions of Rockwell actually did such work, and try to integrate them into some general whole.

One perk of this era, in the early/mid 1990s, was that meetings among representatives from these various sites were held. The other sites included Downey, Seal Beach, El Segundo, and one or two others I’m not remembering. At the time, Rockwell owned company helicopters! They were used to fly senior management back and forth among these sites, but if they were otherwise not reserved, lowly software engineers like Alan and me could book them, and get a half hour flight from Canoga Park to Downey, some 40 miles, avoiding an hour and half drive on the freeways. It was cool: the helicopter would land in a corner of the parking lot at the Canoga facility, we would walk toward it, ducking our heads under the spinning helicopter blades, and get a fantastic ride. What I remember especially is how the populated hills between the San Fernando Valley and west LA, crossing over the Encino Hills and Bel Air, were immense – nearly as wide as the entire San Fernando Valley. All those properties, so many with pools.

We didn’t always use the copters; I remember having to drive to the Seal Beach facility once, (a 55 miles trip) because as I got on the 405 freeway to drive home, the freeway was so empty – because of some accident behind where I’d entered – my speed crept up and I was pulled over, and got my first ever traffic ticket.

But another copter trip was memorable. Coming back from Downey, I suppose, the weather was bad and the copter was forced to land at LAX. To approach LAX, a major airport with big planes landing and taking off, always from the east and west respectively, the copter would fly at a rather high altitude toward the airport from the south, and then spiral down to its target, a rooftop on a building in El Segundo on the south side of the airport. On that occasion we had to take a taxi back to the San Fernando Valley, as the rain came in.

The software CMM was successful from both the government’s and industry’s points of view, in the sense that its basic structure made sense in so many other domains. And so CMMs were written for other contexts: software engineering; acquisitions (about contractors and tool acquisitions), and others. After some years the wise folks at Carnegie Mellon abstracted even further and consolidated all these models into an integrated CMM: CMMI (https://en.wikipedia.org/wiki/Capability_Maturity_Model_Integration). And so my company’s goals became satisfying this model.

The idea of conforming to the CMMI, for our customer NASA, entailed periodic “assessments,” where independent auditors would visit our site for some 3 or 5 days, in order to assess the extent our organization met the standards of the CMMI. The assessment included both a close examination of our documented standard processes, and interviews with the various software managers and software engineers to see if they could “speak” the processes they used, day to day. Assessments were required every 3 years.

Rocketdyne’s acquisition by Boeing, in 1996, did not change the assessment requirements by our customer, NASA. Boeing supported the CMMI model. In fact it established a goal of “Level 5 by 2005.” The advance from Level 3 to Level 5 was problematic for many engineering areas: the collecting and analyzing of data for Levels 4 and 5 was seen as an expensive overhead that might not actually pay off. Rocketdyne, under Boeing, managed to do that anyway, using a few very selected cases of projects that had used data to improve a couple specific processes. And so we achieved Level 5 ahead of schedule, in 2004. (In fact, I blogged about it at the time: http://www.markrkelly.com/Views/?p=130. )

Time went on, and the SEI kept refining and improving the CMMI, both the model and the assessment criteria; Rocketdyne’s later CMMI assessments would not get by on the bare bones examples for Level 5 that we used in 2004. I’ve been impressed by the revisions of the CMMI over the years: a version 1.1, then 1.2, then 1.3, each time refining terminology and examples and sometimes revising complete process areas, merging some and eliminating others. They did this, of course, by inviting feedback from the entire affected industry, and holding colloquia to discuss potential changes. The resulting model were written in straightforward language as precise as any legal document but without the obfuscation. This process of steadily refining and revising the model is analogous to science at its best: all conclusions are provisional and subject to refinement based on evidence. (A long-awaited version 2.0 of CMMI has apparently been released in the past year, so I haven’t seen it.)

CMMI Highlights

  • Business trips: There were lots of reasons for business trips in these years, and the trips were more interesting because they were in many more interesting places than Huntsville or Stennis. A key element of CMMI is training, that all managers and team members are trained in the processes they are using. At a meta-level, this included people doing process management taking courses in the CMMI itself, and in subjects like process definition (the various ways to capture and document a process). The CMMI training was often held in Pittsburgh, at the SEI facility, but in later years I also recall trips to both Arlington and Alexandria Maryland, just outside Washington DC, interesting trips through because they were during the work week there was no time for sight-seeing.
  • Conferences. Other trips were to attend professional conferences. Since dozens or hundreds of corporations across the country were using CMMI to improve their processes or use the model to assess their performance, these conferences were occasions for these companies to exchange information and experience (sometimes guardedly). Much like a science fiction convention, there were speakers talking to large audiences, and groups of panelists speaking and taking questions from the audience; a few dozen presenters and hundreds or thousands of attendees. Furthermore these conferences were not tied to any particular city, and so (like science fiction conventions) moved around: I attended conferences in Salt Lake City (about three times), Denver, Pittsburgh, and San Jose, and I’m probably forgetting some others.
  • Assessments. Then there were the occasional trips to other Rockwell or Boeing sites, for us from Rocketdyne to consult with the process people there, or even to perform informal assessments of their sites (since Rocketdyne was relatively ahead of the curve). I did two such trips by myself, one in Cleveland, once in some small town (name forgotten) northeast of Atlanta.
  • Maui. But the best assessment trip was one Alan P and I did in 1999, in Maui. The reason was that Rocketdyne (or was it through some other Boeing division?) had a contract to maintain the software for some of the super-secret spy telescopes on top of Haleakala (https://en.wikipedia.org/wiki/Haleakala_Observatory). There’s a cluster of small ‘scopes there, including top secret ones; we didn’t have to know anything specific about them in order to assess the processes of the support staff, who worked in an ordinary office building down near the coast in Kihei. Our connection was that a manager, Mike B, who’d worked at Rocketdyne had moved to Maui to head the facility there, and thought of us when needing an informal assessment. So Alan and his wife, and I, flew in early on a Saturday to have most of a weekend to ourselves, before meeting the local staff in their offices for the rest of the week. Meanwhile, we did get a tour of the observatory, if only a partial one, one evening after dinner, a long drive up the mountain and back in the dark. (The one hint I got about the secret scopes was that one of them was capable of tracking foreign satellites overhead, as they crossed the sky in 10 or 15 minutes, during daylight.)
  • HTML. In the mid-1990s the world wide web was becoming a thing, and one application of web technology was for companies to build internal websites, for display of information, email, and access to online documents. (Past a certain point, everything was online and no one printed out documents, especially big ones like our process manuals.) With more foresight, I think, then I’d had when learning Access for ISS support, I volunteered to learn HTML and set up webpages for our process organization, the SEPG (Software Engineering Process Group). I did so over the course of a few months, and shortly I parlayed those skills into my side-career, working for Locus magazine—I volunteered to set up its webpage. Charles Brown had thought ahead at least to secure the locusmag.com domain name (presumably locus.com was already taken), but hadn’t found anyone to set up a site. So he took me up on my offer. The rest is history, as I recounted in 2017 here: http://locusmag.com/20Years/.

Reflections

Looking back at these engineering activities, it now occurs to me there’s a strong correlation between them and both science and critical thinking. When beginning a new engineering project, you use the best possible practices available, the result of years of refinement and practice. You don’t rely on the guy who led the last project because you trust him. The processes are independent of the individuals using them; there is no dependence on “heroes” or “authorities.” There is no deference to ancient wisdom, there is no avoiding conclusions because someone’s feelings might be hurt or their vanity offended. Things never go perfectly, but you evaluate your progress and adjust your methods and conclusions as you go. That’s engineering, and that’s also science.

Things never go perfectly… because you can’t predict the future, and because engineers are still human. Even with the best management estimates and tracking of progress, it’s rare for any large project to finish on-time and on-schedule. But you do the best you can, and you try to do it better than your competitors. This is a core reason why most conspiracy theories are bunk: for them to have been executed, everything would have had to have been planned and executed perfectly, and without any of the many people involved leaking the scheme. Such perfection never happens in the real world.

UTC, P&W, ACE

For whatever reason, after a decade Boeing decided Rocketdyne was not a good fit for its long-term business plans, and sold the division to Pratt & Whitney, an east coast manufacture of passenger jet engines. (An early Twilight Zone episode from 1961, “The Odyssey of Flight 33,” https://en.wikipedia.org/wiki/The_Odyssey_of_Flight_33, mentioned Pratt & Whitney engines, so I was familiar with the name.) Pratt & Whitney was in turned owned by United Technologies Corporation, UTC, whose other companies include Otis Elevators. Whereas Boeing, a laid-back West Coast company, was hands-off with Rocketdyne, letting it establish its own standards and procedures, UTC, an east-coast company, was relatively uptight and authoritarian. This was visible no more starkly than with its “operating system,” a company wide set of tools and standards called ACE, for “Achieving Competitive Excellence.” ACE was homegrown by UTC and stood independent of industry or government standards. Furthermore, it was optimized for high-volume manufacturing, and was designed for implementation on factory floors. That didn’t stop UTC from imposing the totality of ACE on our very low-volume manufacturing site (one or two rocket engines a year) where most employees sat in cubicles and worked on PCs.

It’s notable too that while all sorts information can be found on CMMI through Google searching, almost no details of ACE can be found that way; it’s UTC proprietary. I did finally find a PDF presentation (https://pdf4pro.com/view/acts-system-management-ace-caa-gov-tw-2c4364.html) that lists (on slide 7) the 12 ACE “tools,” from which I will describe just a couple examples. Most notorious was what P&W called “6S,” its version of UTC’s “5S,” which was all about workplace cleanliness and organization. The five Ss were Sort, Straighten, Shine, Standardize, and Sustain; the sixth one was, inconsistently, called Safety. While it may make sense to keep a manufacturing environment spic and span clean, when applied to cubicle work-places it became an obsession about tying up and hiding any visible computer cables, keeping the literal desktop as empty as possible, and so on. Many engineers resented it.

Another example was that in the problem solving “DIVE” process, each ACE “cell” (each business area at a site) was obliged to collect “turnbacks,” which were any examples of inefficiency or rework. It didn’t matter to the ACE folk that in software we had a highly mature process for identifying “errors” and “defects,” we were required to double-book these for ACE as “turnbacks.” Furthermore, each cell was required to find a certain number of turnbacks each month, and show progress in addressing them. You can see how this would encourage a certain amount of make-work.

To avoid duplicate work, at least, I and others who maintained the software processes spent some time trying to resolve double-booking issues, even introducing ACE terminology into the processes we maintained to satisfy CMMI. (P&W didn’t care about CMMI, but our customers did.)

So the last few years at Rocketdyne were my least pleasant. They ended on a further sour note as I was pulled away from process management and put onto a P&W project based back east that needed more workers, even remote ones. This was NGPF, for “next generation product family,” that became the PW1000G (https://en.wikipedia.org/wiki/Pratt_%26_Whitney_PW1000G), a geared turbofan jet engine for medium-sized passenger jets. A couple dozen of us at Rocketdyne were assigned to NGPF, but I was pulled in later than the initial group and got virtually no training in the computer-based design tools they used or background in the concept of the product. So my assignments were relatively menial, and frustrating because I had to figure things out as I went along, without proper peer reviews or the other processes we used for CMMI-compliant software projects.

NFPG was winding down, I had gone back to working a final pass on a new set of process documents, when a bunch of us were laid off in November 2012. Fortunately, since I’d worked for the same company for 30 ½ years, and was old enough to have been grandfathered in to pension eligibility from Boeing, I did get a pension, as well a severance. And I had two different 401K accounts that had accumulated over the years.