As engineers, we have to think carefully about how our designs may be used in ways we did not foresee. You may have heard of the happy path, which describes a sequence of events someone takes to use a product — whether it’s software or hardware. The line between software and hardware has gotten pretty blurry in recent years due to the prevalence of embedded systems. (Good news for embedded engineers!) Things have gotten more complex, and it’s easier to run into trouble by accident. As a designer, you may be making different assumptions from that of the people who are using your designs. There is more than one way this can happen:
- You have made a perfectly reasonable assumption of “normal” behavior, but
- You have forgotten a few edge cases where someone makes reasonable assumptions in odd circumstances
- Someone makes unreasonable assumptions and gets into trouble that is disproportionately large
- You think you have made a reasonable assumption of behavior, but
- Someone uses your design in reasonable ways, but misses some of your assumptions
- Someone makes a minor mistake, and gets into trouble that is disproportionately large
Let me give you a few examples.
I have this microwave oven from Panasonic — it proclaims the Genius Sensor* 1200W on the front panel — which, like many microwaves, has a built-in clock and timer. There have been a few occasions where I go to reheat something, and I find that someone else is already using the timer. I can never seem to figure out how to use the oven without stopping the timer, and it just causes unnecessary drama in the kitchen. You would think that there would be an easy way to make the cooking and timer functions independent so you could use both at once… but I guess that’s just too much to ask.
(*Genius Sensor — marketers take note: I have had this oven for over two years and it is only now that out of curiosity that I looked up what the “Genius Sensor” actually does. Apparently it’s a steam sensor; the Panasonic website says “The Auto or Sensor Cook Features is feature that simplifies programming. It uses built-in sensor technology to sense steam to cook automatically — with no need to set the power level, food weight or cooking time.” I wish that it just said “steam sensor” instead; mystery doesn’t engender trust.)
Recently I was hiking in a remote area in northern Arizona, with a small tour group, and this fellow came up to us as we were about to get back in our vehicle, and asked us for help. He had rented a four-wheel drive SUV, and could not get the transmission to shift into either forward or reverse gears. Our tour guide looked at it; apparently it kept displaying a message about “Auto Park”, and after a few minutes, we realized that it would not allow forward or reverse gears if the door was open. Apparently the logic on Ford vehicles nowadays is that Auto Park (or Return to Park) engages if the driver’s door is open and the driver’s seatbelt is not buckled, to prevent vehicle rollaway accidents. As Ford stated in a press release:
Fusion engineers worked hard to integrate the feature into normal, convenient operation for drivers. It can detect when a driver has turned the car off while moving, and will first shift into neutral until it slows below 5 mph before shifting into park automatically. The technology is designed not to operate if it detects a belted driver opens the door with the car moving to free a stuck coattail, for example, or if a driver happens to be inching into a parking space and wishes to see the lane markers. It is designed not to operate when the car is in Stay in Neutral or Neutral Tow mode.
Vehicle rollaways typically occur when drivers exit the vehicle with the engine running and the transmission not in park. A rollaway incident could result in serious injuries to a driver exiting the vehicle, or to pedestrians in the path of a rolling vehicle.
That makes sense, but the message reported to the driver was confusing, and there are some times where the driver needs to open the door while the vehicle is moving slowly, to be able to see something. Do you ever need to do this without a buckled seatbelt? I don’t know. When should designers allow manual overrides of some automatic safety feature? That’s a really hard problem.
Don’t Turn Off Your Computer!
One thing that really gets me confused is when it’s time to reboot my work PC for a Windows OS update, and I see this kind of message:
Don’t turn off your computer!
OK, but what happens if there’s a power glitch, or the battery is almost empty and I can’t plug into my power supply in time, and the computer does turn off? If they’ve designed the operating system well, they’re using a transaction-based approach, and if you do turn off the computer, it’s just going to roll back to the last completed transaction. But I have my doubts, and I don’t want to test them.
All of these get-into-trouble situations involve mutable state. Either the software involved maintains state in memory — like the timer on the microwave — or there is physical state in the real world, like door latches and seatbelt buckles and power switches and battery connections. As a designer, you can manage software state however you like. The physical state is a bit harder, since you have no direct control over the situation. There are mechanical approaches to constrain state intentionally, like a seatbelt mechanism in a car: you can’t unbuckle the seatbelt unless you push the button on the buckling mechanism, and you can’t loosen the seatbelt strap unless you do it slowly, otherwise the locking mechanism engages to keep the seatbelt tight and hold the passenger in the seat. But these external changes in state can create some tricky situations.
What happens if you turn off an embedded system, and turn it back on again, and in the meantime, physical state has changed? For example, an automated coffee maker that was in the middle of making coffee, and something happened and it has restarted. The system needs to resynchronize its internal software state with the physical state of the machinery, and handle transitions in physical state properly. This is easier to do when there are sensors, of course. If embedded systems are cost-constrained enough to prohibit sensors, then there needs to be a way for the user to inform the system that the physical state is ready for it to take some action. If you are designing firmware, you need to keep these sorts of things in mind so that you avoid inconsistent behavior.
In addition to the design itself, you also need to keep in mind the instructions and usability cues you provide to the people using your design. (Usability cues would be things like prompts and the use of color/shapes/sounds, positioning/ordering, or familiar metaphors, to influence the way someone uses a product or service.) Writing instructions is hard! We have to create a mental model for how someone should use what is unfamiliar to them, and try to communicate that mental model; the people using our designs will develop their own mental model, and it is unlikely to be exactly the same as what we had in mind.
Simon Says!
The child’s game Simon Says is about following directions. The leader tells the group Simon says put your hand on your head! Simon says stomp your feet! Growl like a lion! and everyone is supposed to put their hand on their head and stomp their feet — but not growl like a lion, because it wasn’t preceded by Simon says.
I call usability failures “Simon Says” problems when something is overly complicated to use, and you fall into a trap just because you made one little mistake. This happens a lot with software, probably because it’s so much easier to increase complexity in software compared to hardware.
For example, suppose I’m applying for a loan online, and I’m going through lots of effort to collect information to submit online, and I reach a point where it wants some information I don’t have on hand, like the original cost of my home, or my employer’s tax identification number. So I go to look this information up, and it takes a while to find it. Or maybe my Uncle Billy calls, and of course I haven’t talked to him for a long time, and yes, it’s nice to hear how he’s doing, thank you, goodbye! Where was I? Oh, yes, the loan application. But the online application has logged me out due to more than 5 minutes of inactivity, and when I log in again, nothing has been saved for later, so I have to start all over. Argh!
Or another example: Microsoft Outlook causes Simon Says problems with meeting rooms. I finally find a time where all my colleagues are free to discuss something… and now I have to find a meeting room. What a pain. Actually, they’re not that big a deal when everything works perfectly. But recently I wanted to swap a pair of meetings, and you can’t just do that. Let’s just say that I had a meeting on the Salmon Ridge project scheduled from 8am-9am in room 304 and a meeting on the Yellowtail project from 9am-10am in room 307. The easy way to do this, since Outlook doesn’t let you schedule things simultaneously, would be to pick one meeting, like the Salmon Ridge meeting, and move it to 9am-10am in room 307, then move the other meeting. But if I do that, Outlook will tell me that Room 307 has rejected my meeting: I can’t move Salmon Ridge to room 307 at that time, because the room is occupied. (The Yellowtail project is still scheduled in room 307. Simon Says!) It’s like a break-before-make switch; you can’t simultaneously have two meetings scheduled for the same room at the same time, even if it’s just a temporary condition during rescheduling. So I have to free up one of the rooms: release room 304 from the first meeting, then move the other meeting from room 307 to room 304, then switch the first meeting to room 307, and hope that during the minute or two it took me to do this, that nobody has swooped in and grabbed a room while it was released.
And that’s just a swap with two meetings when I’m organizing both. If there are two different organizers, we have to coordinate and switch things together, which just takes more time and effort.
This would be a lot easier if the concept of a meeting room reservation were decoupled from a scheduled meeting, and if it were transferable. If I could just remove the meeting room reservation from one meeting, and attach it to the other, without canceling the reservation, then there wouldn’t be any time window that the meeting room were released and available to someone else.
You might say, “Well, that’s just an edge case.” But rescheduling meetings happens all the time; I might have to do it three or four times a week. Maybe it’s not worth it to Microsoft to add some kind of separate meeting-room-reservation feature, but if you add up all the extra work and frustration of working around the existing system, the lost productivity probably adds up to a good chunk of the annual cost my employer is paying to Microsoft for their yearly license.
The Simon Says problem shows up less often with embedded systems, but I’m sure it exists. (Especially when you get to circumstances like WiFi setups or firmware updates.)
The Price of Failure
Keep in mind the consequences of not following the Happy Path. I expect that when I drive a car, things won’t go horribly wrong just because I don’t drive in perfectly tame situations. When I pull off the shoulder of a road, and there’s a slight drop-off, the car shouldn’t flip or catch fire or eject its passengers. I know if I drive off of a cliff or into a tree, something really bad might happen — but bad things shouldn’t happen just because things were a little out of the ordinary. (Ironically, while looking up the term “drop-off”, I found that slight drop-offs on road shoulders can in fact cause dangerous accidents, and tapered pavement called the “safety edge” has been developed to reduce such incidents.)
When you are reviewing shortcomings in your design, keep in mind the consequences of “bad behavior”, even if they are things that a reasonable person would not do. The really bad consequences fall under the topic of safety protection, which is why there are all sorts of standards on product safety. But there are other consequences that are “less bad” but still very undesirable. If you are designing a consumer product with a memory card for saving data (audio / video / etc.), you should probably think about the situation that someone can remove the memory card while the device is still powered on, and even though this is not a wise thing to do, the consequence of a corrupted memory card is bad enough that it might be worth the effort to mitigate. For example, the card could be covered by a door with a sensor, and opening the door should cause the device to complete its write operations and prepare for a safe eject before the card is removed.
At any rate, I’d spend more time looking at situations that can cause more severe consequences, than ones with less severe consequences. (For example, if pressing all the buttons on a camera simultaneously does something weird, but it doesn’t cause any persistent effects, then it might not be worth trying to protect against this case.)
Document Your Design!
Whenever you’re talking about assumptions and design decisions, it’s really important to document them! It’s painful to come back two years later to find an obscure note in firmware code that “we have to wait for the button press”, and not be able to find the rationale for choosing a particular behavior. You may even have been the person who implemented the behavior at the time; don’t expect your memory to hold up perfectly. Write down what you are designing and why!
(And it’s worth mentioning again: state machines should be designed intentionally, not as side effects of the way pieces of the design interact together.)
Exhaustive Testing
Sometimes a system can run into problems just because the designers overlooked something.
My group had one case in firmware where an overcurrent fault would occur when you unplugged a motor drive circuit board from a power supply and quickly plugged it back in. No problem if it had been unpowered for a long time, and power was applied; no problem if you reset or reprogrammed the microcontroller. But something went on where the momentary removal of control power allowed some capacitors to discharge before others, and the inrush current caused a fault. I don’t remember the exact details, but we had to add a time delay of a millisecond or two to the motor drive to allow a soft-start.
We weren’t catching this issue because we had a habit of just keeping the circuit board powered all the time, and it wasn’t until someone ran a couple of tests in quick succession that they found the problem.
Exhaustive testing would cover all possible combinations of state transitions — including physical state — to make sure the design behaved as intended. But it’s not likely to be easy in practice, and in many systems it may be infeasible because of the large number of combinations.
The Drunken Happy Path
My team has a concept we call the “Drunken Happy Path”, when it comes to software or firmware testing. It’s basically a modified Happy Path test where we do some task, but we are intentionally sloppy about following directions. Maybe we forgot to do things in exactly the same order mentioned in the instructions. (Press the button before you connect the motor to the circuit board, rather than after. Or disconnect the motor while the motor shaft is still spinning.) The hope is that we can cover a wider range of cases than just the Happy Path, even though full exhaustive testing of the different possible sequences of events may be impractical.
Five or six years ago, we worked closely with another group doing the PC-side software that communicated with our firmware, and we’d find all sorts of bugs with the Drunken Happy Path approach (though we didn’t have a name for it back then) that they didn’t find. Their manager at the time used to be concerned about this, telling me, “Why can’t you just write up a test procedure?” so that their team could follow it and find these bugs earlier, before it got to our team. Somehow I couldn’t get across the idea that we just did testing with a bit of randomness, and that writing down all the things that we might be likely to test was just not practical.
The wider software community calls this sort of test-with-randomness approach fuzzing or fuzz testing, but I think Drunken Happy Path is more specific, and the intent is just to cover what a reasonable person would do if they’re not being as careful as someone who follows instructions to the letter.
Wrapup
Real-world testing of any product can uncover all sorts of situations that the designers hadn’t anticipated. Your job as a designer should be to try to anticipate those unanticipated situations. Ask yourself: What if someone takes the cover off while the power supply is still on? Or brings it outside from a warm house to cold air? Or turns it upside-down? Avoid “Simon Says” problems; don’t just expect the people using your system to follow the directions exactly. Make sure you run your assumptions by other people to double-check whether they are reasonable. Try to ensure that when someone strays from the Happy Path of perfect use, that your product still behaves in a reasonable manner. And document the reasons for your design decisions!
And speaking of “happy”, have a Happy New Year!
© 2024 Jason M. Sachs, all rights reserved.