EmbeddedRelated.com
Blogs

My friend, the compiler

Ido GendelSeptember 11, 2024

This was supposed to be a story of heroism; of using brain power, lookup tables and various other means to overcome strict hardware limitations. Instead, it turned out to be just a little reminder to look closely at all of the aspects of the toolchain, because someone else might have already solved the problem for you.

As I have mentioned in this blog before, I enjoy tackling programming challenges which are caused by the resource constraints of "weak" microcontrollers. That's why, when a project required me to convert a lot of raw bytes to their human-readable hexadecimal representation and vice versa, I immediately wondered what would be the fastest way to achieve that on my platform – the 8-bit ATmega4809 AVR MCU.

Hex-dec conversion is such a basic task, that I'll skip most of the details, and just mention one major self-imposed constraint: the code has to be C, no Assembly. Why? Because I'm not that well-versed in Assembly, and because I want to maintain at least a semblance of generality to whatever solution I come up with, knowing full well that it will still depend to some extent on the architecture and the Instruction Set of the MCU.

This article is available in PDF format for easy printing

So, cutting to the chase: A hexadecimal representation of a raw byte is made of two characters (e.g., "F4"), each corresponding to one nybble (4-bit "half") of that byte. An inevitable part of the conversion job is, therefore, to somehow isolate these nybbles (implicitly or explicitly), access or copy each of them separately, as independent numbers. In a PC program, we might see something like this:

// Mask away the top four bits
rightNybbleValue = sourceByte & 0x0F; 

// Right-shift the number by four bits
leftNybbleValue = sourceByte >> 4;


But the 8-bit AVR Instruction Set only has one-bit shifts. That second assignment in C will turn, at best, into four individual assembly instructions – and in fact, I've seen compilers render such commands as loops, thereby doubling (or more) the number of instructions. Is there anything we can do to circumvent this limitation?

While browsing through the Instruction Set manual, I noticed something that made me regret my decision to stick to C: an instruction called SWAP, which swaps the nybbles of a byte. So simple, and it only takes one clock cycle! Alas, there's no C operator akin to SWAP; there's nothing I can write in C which can direct the compiler to use SWAP the way I want it. Back to lookup tables, then?

To keep things scientific, I took an eval board, fired up the scope and the IDE, and - after doing a sanity check and calibration using an empty code shell - wrote a strictly unoptimized byte-to-hex conversion code, just to see how long it would take to run. That code of mine isolated the left nybble in a way that's potentially even worse than the way shown above, because the compiler might choose to perform an algebraic division operation (unsupported by the hardware, of course):

leftNybbleValue = sourceByte / 16; 

After compiling, I opened the disassembly listing to see what horrors the compiler has generated. And there, at the heart of it, were these two lines:

SWAP R24
ANDI R24, 0x0F

This came from the free version of the vendor's compiler, with the default modest optimization level; and even it knew that a quick way to divide a byte by 16 is to SWAP its nybbles and then AND the result with 0x0F to get rid of the unwanted nybble. Just two instructions, one clock cycle each. How can any algorithm, clever as it may be, top that?

Takeaways 

If you're still interested in dec-hex conversions, read the next section too; here I'll jump straight to the general conclusions. We've all heard the sage advice concerning premature optimization, but in this case, the thing was optimized before I ever touched it. It's not even a matter of someone figuring out a better algorithm than mine (God knows it happened before!) – it's just that the compiler was given enough capabilities to cleverly overcome a limitation, which I thought would be the greatest obstacle for the task. I simply wasted my time.

But here's the million-clock-cycles question: can I trust the compiler to always employ this trick? If a future change in the C code somehow persuades the compiler, without me knowing it, to fall back to a "classic" solution, that will spell some serious problems. I could be extra prudent and use the best C algorithm that doesn't have to rely on the SWAP trick, but who knows what other tricks the compiler may or may not use for that? Should I turn off optimizations altogether? Scan the disassembly output with every new version of the code?

For pessimists, a rule like "Don't use an MCU if it can't handle the task without any kind of software optimization" might work, but it's very extreme, probably unviable too. I suppose that the only realistic way to approach this is to identify any speed-critical parts in the code, where even a small inefficiency can seriously mess things up, and test them (disassembly and actual measurements) with every code revision. Maybe add custom compiler warnings as reminders to do this.

Isn't that just "unit testing"? As I understand the term, not exactly; Here, we're not testing for the correctness of the code, but how efficient it is. It's not traditional Cycle-counting either, because that's done when the developer has full and direct control over the machine instructions. Botton line, it's just one more thing to worry about.

Conversion algorithms

First, an important reminder: Any byte value can be converted into a 2-character hex representation, but not any two characters represent a valid number (e.g., what's "5V"?) so when converting from hex to decimal we must have some kind of input validation.

Here's a bird's view of the options for byte-to-hex conversion:

While option 1 above seems the most cycle-efficient (minimum calculations), it's wasteful in terms of memory, and in the context of 8-bit cores we must also remember that accessing the memory space (when it's bigger than 256 bytes) requires implicit 16-bit operations, so the overall benefit might be negligible to non-existent. Option 3's advantage is uniform runtime (no conditional statements), but other than that, I can't identify a clear winner.

For converting from hex to byte, here are our general options:

This, too, is basically just an exercise in "how many ways can you think of to do X"; the actual differences in resource usage are negligible (if we disregard the crazy option 1); especially considering that options 2 and 3 include multiplication by 16, which the compiler might achieve with SWAP as well. If we choose to have lookup tables for the hex characters, we also need to make a decision regarding their offset – whether they'll start at index 0, wasting some space (given that the lowest valid character, "0", is 48 in ASCII), or whether we'll pay a tiny price and subtract 48 from each character before accessing a more compact table.

Did I miss any clever algorithms? Is hex even still a thing in your work? The comment section is open 😊



To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: