[Solved] Garbage characters in C


There’s some confusion here regarding the term garbage characters. What it refers to is any byte that resides in a variable that wasn’t assigned in some well-defined way. The character A can be a garbage character if it happens to appear in (for example) a block of memory returned by malloc or an uninitialized char variable.

This is distinct from unprintable characters which are any character that does not have a well-defined representation when printed as characters. For example, ASCII codes 0 – 31 and 127 (0 – 1F and 7F hex) are control characters and therefore unprintable. There are also multibyte characters for which a particular terminal may not know how to render them.

To get into your specific questions:

Why can’t the character (image) be copied?

As an unprintable character, its screen representation is not well defined. So attempting to copy and paste it from a terminal will yield unexpected results.

Do garbage characters have some pattern? Meaning that can you
predict for an empty string what character can come, for an empty
integer what will come, and so on.

The nature of garbage characters is that their contents are undefined. Trying to predict what uninitialized data will contain is a futile effort. The same piece of code compiled with two different compilers (or the same compiler with different optimization settings) can have completely different contents for any uninitialized data.

The standard doesn’t say what values should go there, so implementations are free to handle it however they want. They could chose to leave whatever values happen to be at those memory addresses, they could choose to write 0 to all addresses, they could choose to write the values 0, 1, 2, 3, etc. in sequence. In other words, the contents are undefined.

When a variable is declared, why does it have a garbage character
instead of being blank? Is there a specific reason of storing it with
a garbage character?

Global variables and static local variables are initialized with all bytes zero, which is what the standard dictates. That is something that is done easily at compile time. Local variables on the other hand reside on the stack. So their values are whatever happens to be on the stack at the time the function is called.

Here’s an interesting example:

void f1()
{
    char str[10];
    strcpy(str, "hello");
}

int main()
{
    f1();
    f1();
    return 0;
}

Here is what a particular implementation might do:

The first time f1 is called, the local variable str is uninitialized. Then strcpy is called which copies in the string “hello”. This takes up the first 6 bytes of the variable (5 for the string and 1 for the null terminator). The remaining 4 bytes are still garbage. When this functions returns, the memory that the variable str resided at is free to be used for some other purpose.

Now f1 gets called again immediately after the first call. Since no other function was called, the stack for this invocation of f1 happens to sit at the exact same place as the last invocation. So if you were to examine str at this time, you would find it contains h, e, l, l, o, and a null byte (i.e. the string “hello”) for the first 6 bytes. But, this string is garbage. It wasn’t specifically stored there. If some other function was called before calling f1 a second time, most likely those values would not be there.

Again, garbage means the contents are undefined. The compiler doesn’t explicitly put “garbage” (or unprintable characters) in variables.

For a string which is not null-terminated, will the same garbage
character be printed on every OS? If yes, which one?

Here’s one of those places you’re confusing garbage and unprintable. In your specific case, the garbage character happens to be unprintable, but it doesn’t have to be. Here’s another example:

void f3()
{
    char str1[5], str2[5];

    strcpy(str1, "hello");
    strcpy(str2, "test");
    printf("str1=%s\n", str1);
}

Let’s suppose the compiler decides to place str2 immediately after str1 in memory (although it doesn’t have to). The first call to strcpy will write the string “hello” into str1, but this variable doesn’t have enough room the the null terminating byte. So it gets written to the next byte in memory, which happens to be the first byte of str2. Then when the next call to strcpy runs, it puts the string “test” in str2 but in doing so it overwrites the null terminating byte put there when str1 was written to.

Then when printf gets called, you’ll get this as output:

str1=hellotest

When printing str1, printf looks for the null terminator, but there isn’t one inside of str1. So it keeps reading until it does. In this case there happens to be another string right after it, so it prints that as well until it finds the null terminator that was properly stored in that string.

But again, this behavior is undefined. A seemingly minor change in this function could result in str2 appearing in memory first. The compiler is free to do as it wishes in the regard, so there’s no way to predict what will happen.

Are there the same garbage characters on every OS? Or are they
different?

I believe you’re actually referring to unprintable characters in this case. This really depends on the character set of the OS and/or terminal in question. For example, Chinese characters are represented with multiple bytes. If your terminal can’t print Chinese characters, you’ll see some type of code similar to what you saw for each of the bytes. But if it can, it will display it in a well-defined manner.

Is there a way to print these characters on the stdout buffer in C /
C++?

Not as characters. You can however print out their numerical representations. For example:

void f4()
{
    char c;
    printf("c=%02hhX\n", (unsigned char)c);
}

The contents of c are undefined, but the above will print whatever value happens to be there in hexadecimal format.

If you see carefully in the character (image),
there are some characters and numbers in it. Do they represent
something?

Some terminals will display unprintable characters by printing a box containing the Unicode codepoint of the character so the reader can know what it is.

Unicode is a standard for text where each character is assigned a numerical code point. Besides the typical set of characters in the ASCII range, Unicode also defines other characters, such as accented letters, other alphabets like Greek, Hebrew, Cyrillic, Chinese, and Japanese, as well as various symbols. Because there are thousands of characters defined by Unicode, multiple bytes are needed to represent them. The most common encoding for Unicode is UTF-8, which allows regular ASCII characters to be encoded with one byte, and other characters to be encoded with two or more bytes as needed.

In this case, the codepoint in question is 007F. This is the DELETE control character, which is typically generated when the Delete key is pressed. Since this is a control character, your terminal is displaying it as a box with the Unicode point for the character instead of attempting to “print” it.

Is there a list of garbage characters which can be printed in C /
C++?

Again, assuming you really mean unprintable characters here, that has more to do with the terminal that’s displaying the characters that with the language. Generally, control characters are unprintable, while certain multibyte characters may or may not display properly depending on the font / character set of the terminal.

8

solved Garbage characters in C