EmbeddedRelated.com
Forums

Is char signed/unsigned and is >> on signed defined?

Started by MaxMaxfield 4 years ago34 replieslatest reply 4 years ago126 views

Hi chaps and chapesses -- sorry to bug you again -- this is a follow-on to my previous question: What size variables are best for 8-bit, 16-bit, and 32-bit MCUs? ( https://www.embeddedrelated.com/thread/11297/what-... ).

I'm a hardware design engineer by trade, so I'm fighting to learn the software side of things. I have a couple of questions.

-----------------

Question #1: This is in regard to the char type. I know it's supposed to be used to represent a character, but someone told me that the C standard doesn't define if it should be treated as signed or unsigned if you use it in math operations, so each compiler can treat it differently.

Is this indeed the case?

-----------------------------

Question #2: My first job was as a member of a design team for CPUs on mainframe computers, and my first task was to implement a barrel shifter/rotator. At that time I was taught that there were logical and arithmetic shift left and shift right operations.

Both logical and arithmetic shift left insert 0 into the least-significant bit (LSB), as does a logical shift right. It's only the arithmetic shift right that replicates the sign bit in the most-significant bit (MSB). Of course, we were working at the level of assembly and machine code -- I didn't know how the higher level language compilers implemented things.

Based on my earlier experience, and on what I've seen, using the the >> operator in C/C++ on an unsigned integer would shift 0s into the MSB(s), while using it on a signed integer would shift copies of the original sign bit into the MSB(s)

However, someone told me that the effect of using the >> on signed integers isn;t actually formally defined ion the C/C++ standard, so -- although most compilers would implement things as I suggest, some may not.

Is this indeed the case?

----------------------

I really appreciate your thoughts and feedback on this stuff -- thanks in advance -- Max

[ - ]
Reply by indigoredsterJune 8, 2020

I would also suggest using 

#include <stdint.h>

This header clarifies the width of the variables you are using. I usaully avoid using shifts on signed integers.

[ - ]
Reply by MaxMaxfieldJune 8, 2020

Hi indigoredster -- thanks for the feedback -- it's a shame (a) these holes were left in the original spec and (b) they haven't been patched over the years -- Ah well, such is life :-)

[ - ]
Reply by Bob11June 8, 2020

Actually, all those questions ARE answered formally in the C/C++ ISO standard. Here's an excerpt from a 2006 draft:

"The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1×2^E2, reduced modulo one more than the maximum value representable in the result type. If E1 has a signed type and nonnegative value, and E1×2^E2 is representable in the result type, then that is the resulting value; otherwise, the behavior is undefined

The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1/2^E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined."

Those are two phrases will you find a lot in the corners of C/C++: "the behavior is undefined" and "the result is implementation-defined." In essence, one of the more important aspects of learning C/C++ is learning what falls into one of those two categories, and avoiding them. char is another example of 'implementation-defined', and these days used mostly just to define simple ASCII strings where the top bit is unused. Best practices now are to include stdint.h and use uint_8 and int_8 to spell out exactly what you want if you need to do math on 8-bit quantities. (A LONG time ago I programmed on a CDC Cyber based on 6 and 12 bit characters. ISO/POSIX C at least guarantees a char is 8 bits today.)

The nice thing about C/C++ is if you NEED to do an ASL, you can just drop some assembly code inline in just about every compiler out there :-)

[ - ]
Reply by MaxMaxfieldJune 8, 2020

Now I know why I became a hardware engineer rather than a software developer LOL

[ - ]
Reply by KocsonyaJune 8, 2020

> ISO/POSIX C at least guarantees a char is 8 bits today

Well, I wouldn't be that sure of that:

An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative.

and that's about it. The header <limits.h> must contain a definition of CHAR_BIT which tells you how many bits in a char. The word 'byte' is used in the standard as the unit exactly CHAR_BIT wide. Then Annex E tells you that CHAR_BIT must at be least 8, but it can be more. Similarly, a short or an int are minimum 16 bits, a long is minimum 32 and a long long minimum 64, but you can always go higher, as long as the char is not wider than a short, which is not wider than an int, which is not wider than a long, which is not wider than a long long. As far as the standard is concerned, you are free to implement every integral type as a 73-bit quantity, if your latest WeirdStuff family CPU core can only handle 73 bit words.


[ - ]
Reply by Bob11June 8, 2020

You're right. I should have said 'at least 8 bits'. The C standard is squishy enough even for the next-generation WeirdStuff embedded quantum computer :-)

[ - ]
Reply by MaxMaxfieldJune 8, 2020

There's a "long long"? Where will this madness end?

(Does every C compiler support the long long data type?)

[ - ]
Reply by jms_nhJune 8, 2020

Yeah, it's kind of silly, but it works. On a 16-bit embedded processor, for example, you usually see int = 16-bit, long = 32-bit, long long = 64-bit.

Avoid these and use the types in stdint.h instead.

[ - ]
Reply by mr_banditJune 8, 2020

I saw a posting about:

long long ago;  // in a galaxy far far away

Note: every modern C compiler supports long long. 

Question 1: The C standard specified the signed-ness of char is implementation dependent. However, the suggestion of is spot-on. Also, all C compilers have a command-line option to specify the signed-ness.

Question 2: if the MS bit is a set of a signed variable (could be int/char/long/..) the variable contains a negative value. Shifting right a negative value needs to have a negative result. So the MS bit will be set.

Get to know the "undefined" and "unspecified" lists. You will live or die by them. Don't do bad things!

Remember KISS !!  Put the cleverness in the algorithm. Keep the code simple. It should be at the "run-dick-run see-dick-run" level. if you need to do something tricky, document it !! Why and how it works.

Also - use asserts. for example; assert(sizeof(int) == 2); for a 16-bit.

typedef struct { uint16 type; /* TYPE_FOO */  more_fields; } FOO;

#define TYPE_FOO 0xBEEF

FOO { TYPE_FOO, /* rest of field inits; };

bar( FOO *foo ) { assert(foo); assert(foo->type == TYPE_FOO); /* etc */  }

See "Writing solid code"

BTW: the C coding standard is pricey. However the last draft before the actual standard is free, with no real changes before the actual standard.

[ - ]
Reply by MaxMaxfieldJune 8, 2020
"See 'Writing solid code'"  Is that a book (or a stage play)?
[ - ]
Reply by mr_banditJune 8, 2020

I love your British humour! It would be great as a stage play - a nerd's "Hamilton" !! (I am hoping for a redneck version of Shakespeare: "Romeo, Romeo, Where in the hell are you?" "Lord, that dude's an idiot". )

Sadly, "just" a book.

He goes into the methods of using the assert() macro. 

The main difference I have with him is he has the sequence

Run the test suite
remove assert() by changing the macro
Run the test suite

I always leave the asserts in - and I do mission-critical embedded systems.

On several projects, i made the assert save the cause to a log and reboot the system.

I had one gig where, a couple of years after I left (my part was done), I had a chat with an engineer on the project. He said they had made a change and one of my asserts kicked. They were easily able to determine the reason and make the fix.

There are two main causes of bugs (like 70..80%): not setting up data structures correctly, or not walking them correctly. Asserts catch these cases.

Writing Solid Code (20th anniversary 2nd edition paperback
January 1, 2013, Steve Maguire
[ - ]
Reply by MaxMaxfieldJune 8, 2020

Oooh, a "Nerds Hamilton" -- I can imagine the queues of nerds stretching down the street and round the corner. I just took a look on Amazon -- this book does look interesting (probably over my head), but I don't have $34 to throw around at the moment -- I'll keep my eyes open for a second-hand version. Thanks for sharing :-)

[ - ]
Reply by jms_nhJune 8, 2020

Great information but please highlight the importance of avoiding undefined behavior, as compilers can do weird things because they are trying to optimize code and assume undefined behavior can never occur.

Implementation-defined behavior, on the other hand, just requires the compiler to choose and document a particular behavior.

[ - ]
Reply by KocsonyaJune 8, 2020

Yes, you are right. UB cannot be warned against enough times.

There is implementation defined behaviour, which is something that the standard leaves to the compiler writers but obliges them to document their choice and what the compiler will do. It's quite nderstandable.

There is unspecified behaviour, where the standard explicitly states that the compiler can do either this or that, on its own discretion (the compiler can switch between the choices at will), so you must not rely on which way it is done (e.g. in what order arguments to a function are evaluated). No surprise with that one either.

Undefined behaviour is different. Basically, it means that the compiler can do whatever it wants (yes, anything, including arbitrarily changing your code!) and it doesn't have to issue a warning. For a contrived example, let's say your CPU has the initial stack pointer at 0x00000000, followed by the reset vector and the various other exception vectors (which is the case for an ARM or an m68xxx core). If you write this:

void PrintVector( int *p )
{
  if ( p == 0 )
    printf( "Initial SP=%08x\n", *p );
  else
    printf( "Vector at %08x = %08x\n", p, *p );
}   

the compiler can get rid of the condition and the first printf(), reducing the function to the second printf(). If you pass a non-NULL pointer to the function, it will work. If you pass a NULL pointer, then the first printf() would dereference it, and dereferencing a NULL pointer is undefined behaviour. Therefore, the compiler can do whatever it wants, including simply removing the test and the printf(). 

And in fact gcc does exactly that, or, in some cases (depending on how your function exactly looks) it retains the if() but replaces the call to the first printf() with an invalid instruction code (which will cause the chip to spit the dummy at run-time). And it does not give you a warning!

Now, if you look at the C standard it has undefined behaviour sprinkled all over it. At the end they give you a list of cases that were specified as undefined behaviour. The list contains over a hundred items. Most of it are obvious and no sane person would ever do that. But maybe 10-20 cases are simple things, like the NULL dereference, shifting negative numbers, arithmetics overflow, stuff like that, which you may easily do because you know what the compiler could (and you think should) compile it to. Then you've just built a bomb with a hair trigger into your software.

C and unix go hand in hand and the old unix mantra "RTFM" doubly applies to C, except that it's the standard, not a man page.
[ - ]
Reply by cprovidentiJune 8, 2020

The reference for these kinds of questions is Andrew Koenig: C Traps and Pitfalls.

As I recall, the answers are exactly as you were told, sadly enough.

And don't get me started on how division truncates when the result is negative! If that matters to you, look up "man (3) div" (i.e., search the web for that).

[ - ]
Reply by MaxMaxfieldJune 8, 2020

Do you know any examples of compilers that DO NOT shift copies of the sign bit in when performing a >> operation on signed integers?

[ - ]
Reply by cprovidentiJune 8, 2020

Good question. (I, for one, do not know.)

[ - ]
Reply by MaxMaxfieldJune 8, 2020

I'm good at asking questions -- it's when answering them that I fall down LOL

[ - ]
Reply by MaxMaxfieldJune 8, 2020
I see "The quotient is rounded toward zero."  I'm assuming this is because the rounding is basically a truncation operation, but this will make positive numbers round down and negative numbers round up -- hmmm


Thanks for the suggestion re the book -- I just ordered it

[ - ]
Reply by cprovidentiJune 8, 2020

Without the use of the div function, negative quotients may be rounded towards zero or away from zero, which makes implementing a portable "round quotient to nearest whole integer" function (or macro) difficult. (Probably not an issue unless one's MCU lacks an FPU and suffers from a shortage of program storage...like mine, ouch.)

E.g., -9/2 might result in -4 in some cases, -5 in others. By contrast, div will always produce a quotient of -4.

[ - ]
Reply by KocsonyaJune 8, 2020

That should not be the case. 

The C standard says that the result of an integer division must be "the algebraic quotient with any fractional part discarded" and then qualifies it in a footnote that this is "truncation towards zero". 

If your compiler generates code that gives -9/2 -> -5, then that compiler is broken.


[ - ]
Reply by KocsonyaJune 8, 2020

The answer is the C standard...

Char can be signed or unsigned, depending on the compiler. You can force it either way by declaring the variable signed char or unsigned char, or, as suggested by the standard, including <stdint.h> and use uint8_t or int8_t for unsigned/signed 8-bit quantity instead. Note that the size of 'char' is not defined by the standard.

Shifting right an unsigned or a signed where the top bit is 0 must result 0-s being shifted in, shifting right a negative signed value is implementation defined, i.e. it can shift in either 0 or 1. 

Shifting left always shifts in 0-s *but* is you shift left a signed quantity which is either negative to start with or the shift would make it negative, that's undefined behaviour (i.e. the compiler can do whatever it wants, including rewriting your code, without even issuing a warning).

The C standard tries to cater for every CPU that ever existed or might exist and as a result it is full of booby-traps. Undefined behaviour, which can arise from such simple things as shifting a negative number or integer arithmetic overflow, is absolutely evil. What is even more sinister is that compiler writers use those cases for aggressive optimisation and the compiler occasionally completely rewrites your code due to a seemingly innocent construct being declared U.B. by the standard.

Reading the standard is an eye-opener and positively recommended. It costs money, but the last draft before the official publication can be found on the 'Net for free and apart from minor stylistic changes it's the same as the real thing (I have both and compared them).

[ - ]
Reply by MaxMaxfieldJune 8, 2020

Awesome info -- thanks Kocsonya

[ - ]
Reply by CustomSargeJune 8, 2020

Howdy, Yes and Yes - It's why I write assembler... 40+ yrs and counting.

HMI is a Lot easier in a higher level language and I can do that. But functional embedded is faster and cleaner in .asm (YMMV). Without any compiler assumptions or rules, I do stuff that a compiler would Never allow, and some coders frown on as well :(

Good Hunting   <<<)))

[ - ]
Reply by DilbertoJune 8, 2020

Hi, @CustomSarge!

Not absolutely related to the question asked by @MaxMaxfield, but I thought it was worthwhile to comment on your post.

I agree with you on many things but, as everybody knows, there is no such thing as a free lunch.

In my early days as an engineer, I began programming the famous Z80 and the not less famous 8031 in assembly language, but soon jumped to PLM ( did you know PLM? ) and C a few years after.

In my case, for instance, I make projects with very tight budgets and  I've to choose the µC literally by the price.

For example, in the recent past, I've made projects with the Silabs' EFM8 ( 8 bits, 8051 architecture ), Microchip's ATtiny ( 8 bits, RISC ), and ST's STM32 ( 32 bits, ARM Cortex ).

Hence, I've developed a framework including the most common middleware used in these projects, like character LCD displays, Monochrome graphic LCD displays, 7 segments LED displays, NTC temperature sensors, matrix keyboards, and so on.

Guess what programming language I used to make the framework portable and easily adaptable to this plethora of µCs!

I admit that, in some cases, where you must get out the most from your chip, in terms of data throughput or memory utilization, maybe you have no alternative than assembly language.

Happily for me, I've found processors that are cheap enough to fit in the budget and powerful enough to be used with a high-level programming language.

I don't know how long this luck will last.     :-)

Cheers!


[ - ]
Reply by MaxMaxfieldJune 8, 2020
I pity the person who is tasked with maintaining your code when you retire LOL
[ - ]
Reply by CustomSargeJune 8, 2020

Having been there when my colleague died, trying to port his code was silly - no and I mean NO documentation or commentary in the code.

Writers of assembler document Hell out of Everything - they MUST. In a year or 2 even the author can't follow Jack without it. I was arrogant enough to not think so in the late '70s, paid the price and considered it tuition for an important lesson.  L8R  <<<)))

[ - ]
Reply by mr_banditJune 8, 2020

My father (at one point on of the top OS designers in the world) taught my programming - in assembly. My first real code was 8080 asm.

the technique he taught me is:

// all incoming registers and what they contain (eg results of above code)
// R0 = ...
// R1 = ...  (etc)
// what the next section would do (how, why)
// what registers would be used
// R2 = scratch register
// R3 = input param to function foo()
// R4 = ...

the asm code - should be 10..12 lines max - 

// all incoming registers and what they contain (eg results of above code)
// etc

That way, if you make a change in a code block, you have a sanity check on how they effect the next block. It explicitly specifies the assumptions.

also - you can design this way before writing any actual ASM code.

He also taught me a very important lesson on commenting code:

Make your comments with the assumption the next person is a maniacal serial killer with a hair trigger - and your name is in the header. Note that might be you in 6 months.

side note: I had to maintain asm code written by an EE. It was not pretty. I did what I normally do with poorly documented code: I documented it. A good way to learn the code.

Max: With apologies, but hardware people should not be allowed to write code. You do great HW. I have noticed EEs generally do not have a clue on designing interfaces to their hardware - I have written over 100 device drivers. I have used "vocabulary" on a number of HW interface designs. I have explicitly done the interface design and given it to the EE. (granted, this is easier to do on an FPGA design.)

[ - ]
Reply by CustomSargeJune 8, 2020

Sage neigh prescient stuff here:

"Make your comments with the assumption the next person is a maniacal serial killer with a hair trigger - and your name is in the header. Note that might be you in 6 months." (I LOVE THIS !!! - BT)

I will disagree somewhat on the Hdwe not writing Swre: Classic "give 'em enough rope" scenario. If they can do it, let 'em, else somebody has to take the reins. But, if the swre spec is sufficiently defined, the hdwe designer just may be the best coder for it.

I've been doing both >40 yrs and have enough face-plants on Both aspects to be humble on ability. Always try - let your betters laugh and learn from them, they don't mean it personally - just musing to when They were in your place.

NOBODY is born knowing this stuff - stay both resolute and sanguine in the learning thereof.  Good Hunting   <<<)))


[ - ]
Reply by mr_banditJune 8, 2020

@CustomSarge: we have about the same amount of experience - I've been programing for close on 45 years, and embedded for at least 35 years. So - long enough to become k-krusty with a lot of bitter experience. And my share of faceplants, too.

My response is: while HW and FW have some overlap - mainly in the design process. But HW skills and SW skills are different in some critical aspects.

A simple example: status bits. Some should be "sticky" for status of a transitory event. Some need to be the status at the time of reading. But I have fought these fights because the HWE did not take them into consideration and could not see why I needed them.

Two of my top five managers are skilled in both HW and SW, as well as being good managers. But most engineers are not skilled in both. I have been mentored by good EEs, but I don't claim to be an EE - I only play EE on TV. I know the EE skillset has a lot I don't know about. I can do simple HW design, but I would not claum to be competent in HW. But I can mumble with EEs on what I need.

Classic "give 'em enough rope" scenario

The problem is it creates unneeded problems down the road. YMMV

[ - ]
Reply by CustomSargeJune 8, 2020

Howdy, Something I need to 'mem/admit is I'm a 1 peep shop. Version control is "ok moron" where's the last functional version... As soon as you get to more than 2 maybe 3, it's a Totally different game.

I got fired from a database swre shop because I blew up the git train on a critical customer - I DESERVED IT. (hated the job, but that's anyway)

I forget the freedom/consequence of being a sole developer (sorry). <<<)))


[ - ]
Reply by mr_banditJune 8, 2020

I have been a "sole-prop" guy too. No worries.

Version control has saved my ass on many occasions.

Sorry about you git fire. I tend to create scripts so I don't do something stupid - I just do other stupid things. I just try to kepp them to a minimum and on "harmless" things.

[ - ]
Reply by SpiderKennyJune 8, 2020

When you can't get the answer to the questions, one should test the results.

If you are coding for general release, where the end-user may re-compile the source code, then your ./configure or Makefile should include some way of testing the compiler behaviour and producing correct (ie expected) results.

[ - ]
Reply by MaxMaxfieldJune 8, 2020

Great point -- thanks for sharing -- Max