[Solved] UTF-8 Unicode encoding and country specific characters

Question

In the following, I use the term “character” to denote something that can be displayed on a screen and printed on paper by a computer. The official name in Unicode is “code-point”. The letter ‘a’ is a code-point – it is “character” number 97 (0x61), so is a ‘ྦྷ’ (character 4007, 0xfa7)

Unicode as such encodes just about every known character in every language known on this planet. The coding starts with traditional English/American characters and control character in the first 128 characters (0..127). The next 128 covers a bunch of European letters such as accented and umlauted characters (é, Ä, ö) and some special character (£, €, etc). Then higher numbers cover “less European” languages such as Russian, Japanese, Chinese, Thai, Urdu, Arabic, Hebrew, etc, etc [I’m not sure exactly in which order these are].

The numbers go into millions.

You can look at the different characters for example here.

UTF-8 uses 8 bits per “token”. The first 128 characters are encoded straight away as 0..127. Everything else starts with 11xxxxxx in binary. The first character actually tells you how many further characters (up to 5), by using more and more 1’s in the beginning, and each subsequent character is encoded as 10xxxxxx. There is ALWAYS a 0 between the last “this is special character” and the “actual data”. So for example, a 2-byte combination will have 11*0*xxxxx 10yyyyyy, where xxxxxyyyyyy is the binary code of the character.

UTF-16 works according exactly the same principle, except each “token” is 16 bits. In UTF-16, the range 0xD800-DFFF to encode “longer than 16 bits” encodings. You can read more in the Wikipedia article here (I’ve not worked much with UTF-16).

Accepted Answer

In the following, I use the term “character” to denote something that can be displayed on a screen and printed on paper by a computer. The official name in Unicode is “code-point”. The letter ‘a’ is a code-point – it is “character” number 97 (0x61), so is a ‘ྦྷ’ (character 4007, 0xfa7)

Unicode as such encodes just about every known character in every language known on this planet. The coding starts with traditional English/American characters and control character in the first 128 characters (0..127). The next 128 covers a bunch of European letters such as accented and umlauted characters (é, Ä, ö) and some special character (£, €, etc). Then higher numbers cover “less European” languages such as Russian, Japanese, Chinese, Thai, Urdu, Arabic, Hebrew, etc, etc [I’m not sure exactly in which order these are].

The numbers go into millions.

You can look at the different characters for example here.

UTF-8 uses 8 bits per “token”. The first 128 characters are encoded straight away as 0..127. Everything else starts with 11xxxxxx in binary. The first character actually tells you how many further characters (up to 5), by using more and more 1’s in the beginning, and each subsequent character is encoded as 10xxxxxx. There is ALWAYS a 0 between the last “this is special character” and the “actual data”. So for example, a 2-byte combination will have 11*0*xxxxx 10yyyyyy, where xxxxxyyyyyy is the binary code of the character.

UTF-16 works according exactly the same principle, except each “token” is 16 bits. In UTF-16, the range 0xD800-DFFF to encode “longer than 16 bits” encodings. You can read more in the Wikipedia article here (I’ve not worked much with UTF-16).