Data Types for Control & DSP

Tim WescottApril 26, 20166 comments

There's a lot of information out there on what data types to use for digital signal processing, but there's also a lot of confusion, so the topic bears repeating.

I recently posted an entry on PID control. In that article I glossed over the data types used by showing "double" in all of my example code.  Numerically, this should work for most control problems, but it can be an extravagant use of processor resources.  There ought to be a better way to determine what precision you need out of your arithmetic, and what sorts of arithmetic you can use on your processor.

This blog post seeks to answer two questions: "what data types must I use to get the job done?" and "what data types are fast enough on my processor to get the job done?"  If you're lucky, there is some overlap in the answers to those two questions and you can go out and design a controller that works.  If you're not that lucky, at least you know that you either need to seek out a new, faster processor, or perhaps a new, more forgiving job.

This article is available in PDF format for easy printing

All of these issues are discussed in more depth in my book, Applied Control Theory for Embedded Systems.

For the purposes of this discussion the world of data representation is divided into three: floating point types, integer types, and fixed-point types.  You should know what floating point and integer representation are.  Fixed-point representation is just a generalization of integer representation, with the weights of the bits (possibly) scaled such that the least-significant bit is a fraction.  The notion of non-integer fixed-point arithmetic is discussed in this Wikipedia article: the short story is that if I talk about a number in Q0.31 notation I mean a fixed-point number in the range $-1 < x < 1$.

Each of these three types has advantages and disadvantages.  Floating point arithmetic is conceptually easy, but on all but desktop-class processors it is slower than integer or fixed-point arithmetic, and it has some subtle "gotchas" that can bite you if you're not careful. Integer arithmetic uses familiar data types, but in general the scaling for signal processing (and, hence, control systems) is all wrong.  Non-integer fixed-point types are not directly supported in common languages, and can be hard for a beginner to wrap their heads around, but are close to optimal for a wide range of problems.

For all of these data types, you have to worry about quantization, for floating point numbers you have to worry about varying quantization effects, and for integer and other fixed-point data types you need to worry about overflow.

For the purposes of this post, I'll make an example PID controller.  I'll use $u$ to mean the measurement of the controlled variable, $ut$ to be the target value of the controlled variable, and $y$ to mean the controller output.  For all variables, $x_n$ means the value of $x$ at sample time $n$.  The variables $k_i, k_p, k_d, k_{dp}$ are the integrator gain, the proportional gain, the derivative gain, and the derivative band-limiting factor, respectively.  The math for this controller is

$$xi_n = xi_{n-1} + k_i \left ( ut_n - u_n \right )$$

$$xd_n = xd_n + k_{dp} \left ( \left ( ut_n - u_n \right ) - xd_{n-1} \right )$$

$$y_n = xi_n + k_p u_n + k_d k_{dp} \left ( \left ( ut_n - u_n \right ) - xd_{n-1} \right )$$

The first problem that crops up with this algorithm is the integrator gain. For most control systems, the integrator gain is much less than 1. This means that the factor $k_i \left ( ut_n - u_n \right )$ is, in general, small.  Moreover, as you increase the sampling rate, you need to adjust the integrator gain downward.  It is up to you to insure that for the smallest possible value of $u_n$, the factor $k_i \left ( ut_n - u_n \right )$ fits in the data type that you have chosen for the integrator state, $xi$.

As a concrete example, consider a system that uses a 16-bit ADC to measure the plant's output variable. Further, let's assume that we scale this output variable to a range $0 \le u_n < 1$.  If $n_{ADC}$ is the ADC reading that ranges from 0 to 65535, then we calculate $u_n = \frac{n_{ADC}}{65536}$. Now, further assume that the integrator gain is a not-unreasonable $k_i = 0.0002$, and that the integrator state can fall in the range $-1 < xi < +1$.

With this example, the smallest increment of the ADC can be $\frac{1}{65536}$. This, in turn, means that the smallest increment of the factor $k_i \left ( ut_n - u_n \right )$ can be $\frac{0.0002}{65536}$, or about $3 \cdot 10^{-9}$.

If you store $xi$ in a 32-bit IEEE floating-point variable, then the mantissa has an effective length of 25 bits. When $xi$ has an absolute value greater than $\frac{1}{2}$, the smallest increment that can be added into $xi$ is $2^{-26}$, or about $15 \cdot 10^{-9}$.  That's about five times larger than the smallest increment that may occur.

What does all this razz-matazz with numbers mean?  It means that in this circumstance, the integrator term of your PID controller is missing out on potentially important information in the feedback.  This, in turn, could result in your system getting "stuck" at the wrong value until the error grows to objectionable amounts.  In a real-world system, this would mean that you might see a small amount of random drift around your desired target point, or a small oscillation around your desired target point.  

To make sure this doesn't happen, you should make sure that the smallest increment that will register in your integrator state is as small or smaller than the smallest increment that can be presented to it. Better yet, make sure that the smallest increment that will register on your integrator state is smaller than about $\frac{1}{8}$ of the smallest increment that will be presented to it.

  • Determine the smallest increment that your integrator state will absorb. 
    • For an integer, this increment is 1.
    • For a signed fractional number that ranges from -1 to 1, with $n$ bits this increment is $2^{-(n-1)}$, or $2^{-31}$ for a 32-bit number.
    • For a 32-bit IEEE floating point number ("float" in most C compilers) that ranges from -1 to 1, this increment can be as high as $2^{-25}$.  The increment isn't a constant -- this is one of the lovely things about using floating point.
    • For a 64-bit IEEE floating point number ("double" in most C compilers) that ranges from -1 to 1, this increment can be as high as $2^{-54}$.  Again, the increment isn't a constant.
  • Determine the smallest increment that you will present to your integrator. This will be the smallest increment of your sensor (usually an ADC, but you know your system), multiplied by any pre-scaling factors you may apply, then multiplied by the integrator gain.
  • Check which number is bigger, and by how much -- if the smallest increment you'll ever present to the integrator state is eight times bigger than the smallest increment it can register, then you're probably OK.

Astute readers will notice that there's a problem with the controller equation that I show if you're using integers -- when the smallest increment that an integrator can register is $\pm 1$, then you need to swap things around.  In this case, you should refrain from scaling the ADC output: let $u_n = n_{ADC}$.  Then, move the integrator gain:

$$xi_n = xi_{n-1} + \left ( ut_n - u_n \right )$$

$$xd_n = xd_n + k_{dp} \left ( \left ( ut_n - u_n \right ) - xd_{n-1} \right )$$

$$y_n = k_i xi_n + k_p u_n + k_d k_{dp} \left ( \left ( ut_n - u_n \right ) - xd_{n-1} \right )$$

Now, your integrator state will always register the smallest change in the ADC.  You will have to scale all of your variables for the convenience of the mathematics rather than your own convenience, but it'll work.

With fixed-point numbers, quantization is fixed -- either the smallest increment you're presenting to the integrator state is too small to be registered, or it's not. Life isn't so easy with floating point.  With floating points, if the value of a state (such as $xi$) is small then the smallest increment that you can add in to it is also small.  But as the value of the state grows the smallest increment you can add in also grows -- so if you're dealing with floating point numbers you need to do your calculations based off of the maximum value that the state can take (or the maximum value that you allow the state to take).

Floating point numbers only have problems with getting too big when they grow past the point where small changes in the system inputs can affect them properly.  Fixed point numbers, however, can have much more dramatic problems.  The problem is called overflow.

Consider the C code snippet:

int a = 30000;

printf("The number is %d\n", a + 2768);

Can you say what the output will be?  You can't, really.

If you try this on a normal PC, the output will be "32768".  However, if you can find a system that uses 16-bit integers and 2's compliment notation (and that has a working printf), the output will most likely be "-32768". The reason for this is that 32768 does not fit into a 2's compliment signed 16-bit integer, and because C tends to be pretty bone-headed about handling this situation.  The phenomenon that we've just seen is called overflow.

If you are designing a digital control system (or any digital signal processing system) you need to either design your data paths so that overflow simply cannot happen, or you need to make sure that overflow is handled gracefully.

Designing data paths so that overflow cannot happen is beyond the scope of this paper.  If you understand the relevant signal processing theory, you can start from the control system as designed and the maximum possible ranges of all the inputs, and you can compute the largest value for each of the states and intermediate values in the system.  If those largest values are all smaller than anything that will overflow, then you don't have to worry.

Designing data paths that handle overflow gracefully is conceptually more simple: at each step in the computation that might result in an increased value (whether it's a state variable or an intermediate value), you test for overflow, and you deal with it gracefully. I have found that in C and C++, the way to do this is to test for overflow and if it happens, let the result take on the greatest value allowed by the data type.

This overflow handling is detailed in my book, but I can give an example using integer math.  In this case I'm defining that "overflow" is anything that results in a value greater than INT_MAX/2.

int add(int a, int b)
  // Assume that a and b are both smaller than INT_MAX/2
  int x = a + b;
  if (x > INT_MAX / 2)
    x = INT_MAX / 2;
  else if (x < -(INT_MAX / 2))
    x = -(INT_MAX / 2);

  return x;

There are more sophisticated ways of handling this, especially if you're willing to do some assembly-language programming, but the above code shows the idea.

So far we've dealt with the "what does my data type need to do?" side of the question. The other side of the question is "what can my processor do?"  I would like to be able to give you some firm guidelines on how to find this out before the fact -- but I can't.  I can give you some rules of thumb to narrow your choices down, but in the end analysis you'll need to write a sample controller, with your intended data types, and then benchmark its performance to figure out how much of the available processor resources it consumes.

The rules of thumb that I can give you are:

  • Doing the work in fixed-point math is almost always faster than floating-point math done in software.
  • If you have floating-point hardware, it may or may not be faster than fixed-point math (and if it's slower, it'll be a lot closer).
  • Be careful when a processor claims to have floating-point hardware.  32-bit floating point hardware is much more common than 64-bit, and you often have to do some strenuous digging to figure out that you're only going to get 32-bit.
  • 16-bit fixed point will work for a few systems. 32-bit floating point gives more precision than 16-bit fixed point, but less than 32-bit fixed point. 64-bit floating point will probably give you more precision than you'll need -- if this isn't the case please hire me, I want to work on your project!
  • It always takes more clock cycles than you think -- benchmark.

It's a good idea to do your benchmarking early in a project -- you'll often find yourself either needing to adopt fixed-point math, or needing to buy an entirely different processor.  If you are a manager, drive your team to choose a processor candidate early, then buy an evaluation board and drive them to benchmark their candidate software.  If you are a line engineer, then do everything you can to get your hands on an appropriate evaluation board and give it a whirl.  In all cases, try to build time into the project to absorb a false start on the processor selection.

In a perfect world -- at least for the lazy programmer -- you'll always be able to use a processor that can keep up with the system sampling rate while using 64-bit floating point math. While this can happen (sometimes even when you're using a slow 8-bit processor!), you'll often be stuck with using fixed point math of some type, to make your speed numbers.

[ - ]
Comment by Myles TweteMay 1, 2016
Great blog post, Tim! :)

Say, I'm not understanding your PID equations exactly, so please enlighten.
The way I see it, the Proportional constant "Kp" in the last equation should be multiplied by the error ( "Utn - Un" ) and not simply the measured process variable "Un". Also, the derivative state equation seem odd to me, especially in its use of "Xdn" on both sides of the equation. Shouldn't the one on the right side of the derivative state equation be "Xd(n-1)" instead of Xdn ?

Anyway, great read---thanks!

[ - ]
Comment by Tim WescottApril 25, 2016
Don't be shy about commenting if this doesn't all make sense -- it makes perfect sense to me, but it's the end of a long day. So it may be a bit raw. Ask, and I'll either clarify in the comments or I'll edit the document.

[ - ]
Comment by Tim WescottApril 26, 2016

I don't think you really get this until you write some code and something horrible happens. I saw a picture once of a log smashed into the operator's console of a saw mill. The tree was larger than 65.535 inches -- you can probably fill in the blanks, or at least guess at the bug.
[ - ]
Comment by jyaApril 26, 2016
Thanks, Tim. I won't try to explain this anymore, I'll just refer people to your post and maybe help to work out an example. Nobody gets this just by reading about it. As with so many subjects, one must actually do it in order to understand.

[ - ]
Comment by yatesSeptember 23, 2016
Hey Tim, great blog! One thing I might add is that the technique of "handling overflow gracefully" is commonly called "saturation". Another is that those conditionals in your add() routine can really slow down the basic add.
[ - ]
Comment by yatesSeptember 23, 2016
Great blog, Tim. One thing I might add is that the function you're performing "handling overflow gracefully" is commonly called "saturation," and that all those conditionals that implement the saturation in your add() function can really slow down the computations. Caveat programmer!

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Please login (on the right) if you already have an account on this platform.

Otherwise, please use this form to register (free) an join one of the largest online community for Electrical/Embedded/DSP/FPGA/ML engineers: