Cooperative scheduling

Started by Kocsonya 6 years ago9 replieslatest reply 6 years ago1689 views

Recently there was a discussion about RTOS vs. bare-metal on this forum.

I happen to work with embedded systems which are somewhere in the middle. They are not tiny, but they are nowhere near to the lowest Linux-able iron. To give you some actual figures, we're talking about ARM Cortex-M cores say, 20-200MHz clock, with 32-512K FLASH and 8-128K RAM, usually some sort of comms (USB, Ethernet, RF and/or some industrial comms bus) with a bunch of analogue and digital hardware to monitor and control. On some systems even a 30MHz chip is only busy at around 10% of the time doing fairly simple and boring things, on others a 100+MHz chip is working hard close to 90% of the time doing some serious math to drive the things around it.

On those gizmos we usually use a cooperative scheduler. That possibility did not pop up in that discussion and I'm just wondering why.

Cooperative scheduling, as you know, is multi-threading where context switch occurs only when the current thread voluntarily gives up the CPU, usually by calling a kernel function that waits for something. That, of course, carries the risk that a rogue thread hogs the CPU indefinitely. Which was, actually, an issue with the early Macintosh computers. They used cooperative scheduling and a badly written application could (and did) freeze the entire machine.

But we're talking about embedded systems here. The threads are not general purpose user applications from an unknown source but control functions written by us. They, more often than not, are state machines responding to external events, with a provably bounded run-time. And even if the time is not predictable, we can always force a thread to yield the processor to other threads on a regular basis.

Conceptually, it is like the superloop where you just spin waiting for something to happen and when it does you call the relevant handler which then returns to the loop. However, at least for me, the fact that I have separate threads of execution (at the cost of a separate stack for each thread) and standardised tools for communication between the threads has the allure of a much cleaner and more modular design as well as higher code re-usability.

The cooperative kernel gives me timers, events, semaphores, locks and messages. Apart from the per-thread stack penalty, the footprint of the kernel is tiny, in both FLASH and RAM. It disables interrupts only for very short periods, so interrupt latency is not affected significantly. Its cooperative nature has the advantage over a pre-emptive scheduler that there are no race conditions and that you don't need to constantly lock and unlock every shared object to guarantee integrity and state consistency. 

Of course, the lack of pre-emption also means that really time-critical things must be pushed down into the ISRs (and the Cortex-M chips have a pre-emptable, prioritised interrupt system) but that's doable.

From our point of view, for the devices we work on, the advantages outweigh the disadvantages. The cooperative scheduler gives us the tools to design firmware with well separated subsystems with well-defined, uniform communication mechanisms between them but without all the race issues and most resource penalties associated with pre-emptive scheduling.

I wonder, what's your opinion about it?

[ - ]
Reply by QLJune 4, 2018

As already mentioned in the OP, cooperative scheduling is especially applicable to run event-driven systems of cooperating state machines. In that case, you don't even need explicit yield() or a context switch. Each state machine naturally processes events in a run-to-completion (RTC) fashion, and then it *returns* to the event loop. Each activation of a state machine is a simple function call from an event loop and returns to the event loop after the RTC step. The loop might then make scheduling decision as to which state machine to run next. This is all perfectly portable without a need for the context switch magic (pushing registers and changing the stack pointer).

A cooperative scheduler of that kind requires at least one event queue, but can also work with multiple event queues. The queues can have assigned priority (priority queues), which means that the scheduler always chooses the highest-priority queue that is not empty. It takes the event out of that queue and dispatches it to the associated state machine. Such a combination of state machine, event queue, and priority is called "active object". When all event queues are empty, the scheduler might call the "idle processing", which provides a centralized place to put the CPU/peripherals into a low-power mode.


The worst-case task-level response of such cooperative scheduler is the longest RTC step of all active objects. This is much better than a "superloop", where the task-level response is the sum of the longest steps through the loop.

Most RTC steps in state machines are naturally short (microsecond-level), so the task-level response is quite low. However, sometimes you have a few much longer RTC steps. One pattern to apply here is to chop the long RTC steps into shorter pieces and thus give the scheduler a chance to run more frequently. To link the pieces of original RTC together, the active object can post events to self to trigger continuation of processing (this is called the "Reminder" pattern). The state machine can then conveniently handle the continuations of processing.

Anyway, I've described this scheduler in Section 6.3.7 of my book "Practical UML Statecharts in C/C++, 2nd Ed.". A convenient place to start learning more about it (and other kernels for state machines) is the following link:


One final comment is that people often confuse event-driven state machines with polling state machines. The cooperative scheduler described here pertains to the event-driven state machines. Polling state machines that you frequently see in the "superloops" need to run "as fast as possible" and consume all CPU cycles you give them. Such polling state machines need all the cycles, because they combine discovering of events (by polling) with processing of events. They are obviously not suitable for efficient processing. In contrast, event-driven state machine requires separation of event discovery (which happens externally to the state machine, typically in the interrupts) with processing of the events (which happens in the state machine). Event-driven state machines require queuing of events.


[ - ]
Reply by trogers531June 4, 2018

Yup, that’ll work pretty much as described. There’s always that dang problem of dealing with variability of the outside world, even simple stuff like the size and nature of a data chunk can stretch an RTC task beyond what’s expected, but as the originator said, we’re supposed to be in control of these constraints, or at least reasonably well informed about the nature of the job at hand.

I have that book, BTW; well done. Highly recommended.

[ - ]
Reply by SolderdotJune 4, 2018

If you're using an RTOS with "real" threads (not just state machines as described earlier) you can go for a kind of mixed approach. Threads will cooperatively suspend but they may also be forcefully suspended at interrupts. In that case the scheduler would be invoked in the return path of the ISR.

So you do not have to do full-blown pre-emptive scheduling with time slices but you may be able to react on interrupts with fairly complex actions, taking place in a thread context, w/o having to wait until the previous job is accomplished. Since the activity takes place in a thread context it is, in turn, interruptible and therefore does not affect IRQ latency - in contrast to a solution where the whole job is done in an ISR.

[ - ]
Reply by rtomkinsJune 4, 2018

I am so happy that you brought up the first Apple OS that ran on Motorola 68000 series processors. At times, they were miserable.

That being said, had Apple had complete control over every application, there would have been much tighter control over API use and things would have worked out a lot better.

Those days of computing were like the wild west in the consumer marketplace with everyone trying to squeeze speed and performance out of slow clock speed CPUs and minimalist memory sizes, it's no wonder that everyone was taking shortcuts everywhere and breaking things.

The ROM inside those Apple systems was half the OS. To call the routines in ROM, the OS or the APP would perform an illegal instruction and the vector for service would point to a lookup table in ROM where the routines address in ROM would get loaded from, great idea. As the OS was patented, copying a Macintosh meant flagrant patent infringement.

I agree, cooperative multitasking means a low overhead in resource use. Stack management and memory allocation are critical factors in making this work effectively. Various features in the ARM family come into play in this implementation, things that did not exist way back when.

[ - ]
Reply by KocsonyaJune 4, 2018

Well, the Commodore Amiga had a pre-emptive kernel, but bad applications still ended in Guru Meditation, as there was no memory protection. You could not hog the CPU, but you could overwrite any part of the RAM.

That reminds me, I remember that I saw a magazine back then in which both Apple and Commodore placed ads and if you read their slogans in the right order, it was quite funny (at least I thought so): 

Amiga: the computer for the creative mind

Macintosh: the computer for the rest of us


[ - ]
Reply by cprovidentiJune 4, 2018

A very good twist on this topic. That said, I would see such schedulers as, in effect, bare metal, but with an API (and/or a framework) to support the context switch from one task to another. In other words, instead of having to craft every "context switch" oneself, the co-operative framework/API standardizes the context switch mechanism. I have never used one, but would guess that it makes it much easier to program bare metal, and is equivalent to a multi-threaded system where all threads (= tasks) run at same priority without time slicing, meaning they have to "yield" the CPU in order to permit another thread to run.

Apart from Apple OS, can you think of any examples? I think uTasker is one.

And given my use of "the CPU" (singular) above,...what about multi-core embedded systems? I guess there are no multi-core MCUs that cannot run Linux (or something equivalent)...or are there?

[ - ]
Reply by KocsonyaJune 4, 2018

I believe (but not sure, never been a Windows person) early Windows versions were co-operative as well. I mean before W95, like 3.x and earlier. When I saw Win 1.0 running on a 30MHz 286, with a monochrome Hercules video card, now that was UNco-operative multitasking at its best. Screen re-draws took ages. I played Reversi on it. Working out its next move took a lot less time than re-drawing the table after said move...

[ - ]
Reply by rtomkinsJune 4, 2018

Although the name is Windows, Windows NT is a newly architected system. Some of the foundations within Windows NT are from RSX-11M and VMS. Dave Cutler left DEC for Microsoft to head up the Windows NT design and his past experiences and knowledge made it into Windows NT.

Windows 1.0 was a GUI interface sitting on top of an OS, MS-DOS. I cannot remember enough about MS-DOS other than it was not preemptive at first.

[ - ]
Reply by trogers531June 4, 2018

Hmm. Yup, sounds just right. Very like several systems I’ve done over the years. Something very much like what you’ve described was successfully built into a FORTH based system running in one of Rockwell’s 6502 varients, so the clock and memory restrictions aren’t really relevant, as much as it would seem so at first glance; the real significance is in the organization of your code.

This is true of all real-time systems, no matter what the perceived payoff. Real-time is a structural tool, and the target is us, the folks that are in desperate need of some hope of keeping our heads above water, logically speaking, and producing something we can actually trust to work as advertised.

So: you’re hitting all the right spots, IMO. We put that original 6502 based code to work in a dozen designs, some of which are still going strong thirty years later doing what sounds like a trivial job until you sit down and try to do it with, say, an IBM PCjr (one of the horror stories we came in behind, a long time ago).

We switched to 68000 based hardware as soon as practical, our own stuff, also using FIG FORTH as a base, floating point from Motorola (that we had to ‘fix’ before it was correct), 68020 and -30 hardware with IEEE math coprocessor, and like that. And similar to what you’re describing, we slid over a bit and added a commercial kernel to the underpinnings of the system, in this case pSOS, a sweet little bit of code that added the luxury of more control without a lot more effort on our part, including preemption, easy to implement resource locks and so on.

All of this was driven by a conscious awareness of the role of real-time constructs and tools as organizational devices, a natural expressional medium for those of us who fall somewhere near, or aspire to, Knuth’s ‘Priest Class’ (sorry, Don, I know you wish we’d forget you said it; it’s still true, after all).

I wouldn’t eschew the ‘Linux-able’ hardware choices out there, either; we’re doing a project right now with the Omega2+ devices, and it’s a blast, a really efficient approach all around given the limited nature of this particular problem. What we’ll have should be solid in the way that gives one that warm and fuzzy feeling when falling off to sleep, which might be the only valid test for those members of the community crazy enough to post this kind of stuff.

I don’t know how far I would chase the need for critical scheduling from within a simple Linux device; it would depend on how solid the available kernel primitives were, how solid the chipset is (in, say, a deterministic sense; I’m assuming nobody would bother trying to field a piece of flakey hardware), and what the payoff would be for our effort to get things right, and so on.

You know, the usual stuff, and as I said back at the top it sounds like you’ve got a bead on it. Being able to feel certain that the critical elements of the code are correct, timing-wise or whatever else comes into the picture, and having a solid understanding of why your opinion is valid, that’s what all this real-time stuff is about.

Despite whatever others will tell you ‘bout it, and there’s lots of that. I’ve watched the market develop since the days of the first experimental RT systems, stood up and confused a lot of good people at conferences and users group galas, and the fact remains: there are way too many examples of how correct structural principles pulled someone’s tootsies out of the fire, saved someone’s hide, really, in the oh-my-god-we-never-thought-that-500lb-thing-could-fall sense, and like that, to ever convince me to work any other way, to any other purpose.

It’s not even about code reuse, BTW. That’s just gravy; I’ve never worried about it for a minute. It just happens naturally.

Also BTW: that Rockwell-chip based system I mentioned earlier is a good example of the whole approach. Although Rockwell did have an engineering demonstration part that booted and ran a FORTH environment, it’s not what we used. We took the hardware and carefully rolled our own, complete with very useful real-time underpinnings that made it a snap to put complex systems together in small quantities economically for a large research facility.

There were plenty of other engineers out there that tried to use the chip as offered by Rockwell, we’re frustrated and quit, or produced ‘less than perfect’ results. There were a spate of similar devices out there, BASIC chips and such, all doomed for serious work. This sort of phenomena occurs regularly, at the seams of the underlying technology; every time there’s a serious shrink, that makes a step forward in size, complexity, etc., there follows a raft of hastily conceived engineering demonstration hardware offerings that should, in most cases, be approached carefully, at least.

Stick with it; you’ve got a pretty good sense of what’s needed to succeed from my view. Good luck.