EmbeddedRelated.com
Blogs
The 2026 Embedded Online Conference

Finite State Machines (FSM) in Embedded Systems (Part 5) - From One FSM to Many

Massimiliano PaganiApril 21, 20261 comment

You can learn to play chess in a few minutes, but then games built on those simple rules are so complex, diverse, and countless that the mind is blown.

I find that Finite State Machines have the same property - the concept is simple, even a child can grasp the basics, but then you can use FSMs to describe very complex and convoluted systems. I published four articles on this topic -

Recently, I realized a fifth article about systems made by state machines was missing.

This article is available in PDF format for easy printing

A single state machine can handle up to a certain level of complexity. You can apply some divide-and-conquer strategies by making your state machine hierarchical, but still, at a certain point, the single FSM becomes too complex, and adding features and maintaining such a beefy structure becomes too daunting.

In traditional applications, complexity is managed by running several tasks (or threads if you prefer) that share part of their state and use concurrency primitives such as semaphores, mutexes, or events to synchronize access. Quite often, those tasks are implemented by loops that wait for some condition to start processing and interact with the rest of the system by acquiring a mutex, reading or updating some data, and then releasing the mutex.

This model originated in ancient times (~1960), when systems were simple, composed of a few tasks. In such a context, mutexes and semaphores were an effective means to manage shared memory. Just to give you a picture, consider that the Apollo Guidance Computer - developed during the 60s - had at most 8 concurrent tasks. Priority inversion became a serious problem only at a later time.

Fast forward half a century, applications run with a thousand threads on several different cores, and memories are one million times the size they were. Indeed, concurrent access to shared memory is increasingly seen as a source of serious problems.

In 2021, Mozilla published a blog post about detecting data races (i.e., synchronization errors) in its codebase. To their surprise, they discovered that even code considered battle-tested and rock solid had data races. And that doesn't mean, as someone might rush to think, that their code sucks, but that concurrent/parallel code is hard to write and easy to get wrong.

There is a cognitive bias in skilled programmers [1] that makes them reckon that some classes of bugs are made by others, not them. Security and concurrency bugs belong to those classes. But this is a bias, and even for skilled programmers, it is easy to fail to write thread-safe code.

Surely Rust, with its strong typing and peculiar approach to data sharing, helps a lot, but another approach is to build the system out of communicating state machines. By following proper and simple principles, it is possible to create thread-safe code and get some other advantages as well.

There are many variations of the system, but I want to stay simple and provide the core concepts, leaving further exploration and a more formal approach to the reader.

The core component that wraps the state machine is the Actor[2]. Actors receive messages from their environment and process them. Upon the receipt of a message, the actor performs one (or more) of the following actions -

  • take decisions
  • send one or more messages to other actors
  • change how it will respond to future messages
  • spawn new actor(s)
  • terminate itself

Note that the actor can change how it will respond to future messages, meaning that it can change its own internal state (e.g., by switching to a new FSM state, or by modifying implicit state), but it cannot change external state. This is also true the other way around - no external actor can directly change the state of this actor. In a properly designed Actor System, there is no shared state. Each actor is responsible for its own state. If actor A wants to change the state managed by actor B, then A sends a message to B to request such a change. By this approach, there is no longer a need for mutexes.

Reasoning about this system is easier since all state-changing operations are local. You don't have to inspect other threads to see how they access the global state and build synchronization architectures to prevent critical races.

Up to now, actors can be implemented in any way - e.g., you can have a dedicated thread running a loop that waits for incoming messages. But the actor nature is reactive - i.e., the actor evolves only when hit by a message, and, after some CPU-intensive message processing, returns to wait for another message, freeing up the CPU. Although this can also be modeled with coroutines, it naturally lends itself to being implemented via finite state machines.

With state machines, it becomes easy to decouple the actor from the thread and move the thread management into an external component that schedules the activities of actors[3]. This scheduler (not to be confused with the OS scheduler) decides which actor needs to run to process its input message(s). When the message is processed, the actor gives the control back to the scheduler, so that another actor can be executed. The scheduler can be thread-based, usually relying on a thread pool, but can also be implemented on the bare metal (i.e., without OS / RTOS)[3]. In this case, the scheduler may also control the power mode of the system. If no actor is executing, then low power mode can be entered. This is a single central point that has this information for free.

If you want to reduce the overhead of context switch, you can design your system with a single thread (or a thread for each CPU core). When the actor receives a message, it executes it without interruption until the process must wait for further incoming messages.

The state machine model perfectly matches this approach (also called run-to-completion) because the algorithm can be executed in stages, each one independent from the others. Also consider that by avoiding a thread for each actor, you will avoid freezing resources. It is true that in a pre-emptive system, the CPU is "recycled", but thread-specific resources (such as the memory used by its stack) stay frozen and reserved when the CPU is executing another thread.

Also, there are additional benefits in testing and deployment flexibility.

Since actors communicate with the external world via a well-defined, simple, and concise interface, it is easy to write unit tests. You don't even need a mockup for external parts, since you just need to be able to send and receive messages.

The other advantage of the message-based interface is that it makes it simple to relocate one or more actors to a different machine. This may not apply today to most of the embedded systems, but in the future, it might become a cheap way to optimize software applications[4]. Taking this further, you may imagine a load management capable of dynamically balancing the execution load by moving actors from one computing node to another.

So far, so good, but is there any drawback?

Well, if you need to process a large amount of data, sending it through messages from one actor to another may not yield impressive results. You can figure out a smart way to encode large messages to avoid copying, but for sure, copying data takes more time than sharing it. On the other hand, if most of the time you have to wait for the mutex to become free, maybe the copy time is negligible.

I think that there are three problems in implementing this approach.

First is actor location - your code knows that it needs to send a message to the disk actor for storing data, but where is the disk actor? You may pass the actor memory address at system startup. But this solution doesn't scale well: if this actor needs to talk to other 20 actors, your constructor is going to have 20+ arguments. If actors are created dynamically, you may not even know their address at startup.

Possibly, if you have a small set of components statically defined, this approach is going to work. But if the system has some tens or hundreds of actors or they are instantiated dynamically when needed, you need an actor directory that, given the name, provides you with a handle to send the message to[5]. Moreover, if actors have dynamic lives, you need to query this directory each time you need to send a message or, even better, have a dispatcher that automatically looks up the actor directory.

The second problem is about priority. Modern OS, specifically real-time OS, are based on the concept of priority. Not all threads are equal. Some need to be served first, jumping the queue.

And a queue is what stays in front of each actor. If you do things naively, you may still stumble into a priority inversion. Suppose your actor input queue is almost full, and now a new request coming from a real-time activity arrives in the queue.

There won't be any lockup, as it may happen for mutexes and semaphores, because all requests are served, but the urgency of the last request is lost, and its execution will come at the end.

This problem can be fixed by employing queues with priorities and properly designing message priorities. This is not going to be as simple and enforceable as defining the priority for a task. With queues, it is the sender that tells the receiver the priority of the message. This priority is manually enforced each time a message is sent. This priority needs to be propagated to all the downstream operations triggered by the source message. So if A sends a high-priority message to B and B needs to ask C to complete the requested action, B has to use the same priority A gave to its message to communicate with C.

It is hard, but not impossible, to design correctly, and a mistake in the design would just make the system's reaction times slower.

The third and possibly worst problem is about message types. By the book, an FSM ignores events that do not make the FSM evolve. In other words, given the state where the state machine is, only events marked on the outgoing arcs are considered. Other kinds of events are ignored.

This can be ok from a theoretical point of view, but as a programmer, I'd like to know whether I can send a message to a state machine, much like I'd like to know whether I am calling a function with the correct types. Indeed, in my previous installments, I showed how to write a strongly typed C++ FSM. By using the std::variant template as the base for input events, you couldn't possibly send a message of the wrong type.

Types may be an annoyance that gets in the way when you are crafting a demo or a little tool, but they are a lifesaver when dealing with complex code. That's why, if you have an application made of concurrent state machines, you want to be sure that you send the right message type to the intended target state machine.

This change would require the actors to be typed. And this brings quite some trouble. E.g., suppose you want to reference a generic actor via a reference to the actor base class; you couldn't send it a message because the input function would need to include the types of the accepted messages in its signature.

In C++, you could use static polymorphism and turn the whole actor system into a set of template instantiations and statically solve this kind of typing problem. While possible, this approach creates quite a rigid structure, which is difficult to update.

There are two more concerns - although actor systems help in dealing with concurrent access to resources, the system is not completely free from problems. Consider the scenario where actor A is waiting for a message from actor B and, similarly, actor B is waiting for a message from A. This will result in a deadlock.

Another point of attention is about message delivery. Coming from traditional programming, where you tell someone something by calling someone's function, it is easy to overlook the problem with message delivery. A message is something that is sent, travels, and is eventually received. Something can go wrong in each stage of the delivery, resulting in no message delivered, a message properly delivered once, or the same message delivered multiple times.

This may happen even for systems hosted on a single machine. E.g., the receiving queue of an actor is full, and a message is dropped. An actor restarts and sends the last message again. So a proper design must also take into account this aspect of the communication.

Actor systems are an elegant solution to the concurrency problem, especially in large applications. This design scales well from microcontrollers (even if some simplifications are in order) to large distributed systems. The increased complexity of coding applications using the Finite State Machine pattern is offset by simplification in testing, dependency management, system design, and power management.

Although not free from problems, the actor model is a powerful tool to have in every software engineer's toolbox.

Notes

[1] I am also affected by this bias, but when I notice it, I try to compensate. ︎↩

[2] Actors can be implemented using the Active Object design pattern, where a dedicated thread executes commands received from a queue, decoupling execution from the requesting entity. Note that Active Object is one concrete implementation of the broader Actor concept — coroutines or a scheduler-based thread pool, as described above, are valid alternatives. ︎↩

[3] See also the excellent series on this topic by Nathan Jones - "You don't Need an RTOS".

[4] The Reactive Manifesto describes properties that are made possible by Actor Systems. I would not recommend implementing something 100% like this on today's embedded system hardware, but the manifesto is worth reading for design inspiration.

[5] Indeed, you may complicate the architecture by adding a message broker that takes care of connecting sender(s) to receiver(s) while keeping them decoupled. This is usually achieved via a publisher-subscriber architecture.


The 2026 Embedded Online Conference
[ - ]
Comment by waydanMay 4, 2026

This is a very nice summary of some of the benefits and concerns. One additional benefit might be breaking the system into separate OS processes using IPC-safe queues.

I've always wanted to try constructing a system using this pattern, but it is difficult to get organizational buy-in as we have so much momentum behind traditional task-based concurrency. I'll have to employ it in my next personal project.

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: