0% found this document useful (0 votes)

13 views4 pages

Understanding Unicode and UTF-8

1509.02971v6

Uploaded by

Muhammad Awais Tariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views4 pages

Understanding Unicode and UTF-8

1509.02971v6

Uploaded by

Muhammad Awais Tariq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Software Design Lecture Notes Prof.

Stewart Weiss
Unicode and UTF-8

Unicode and UTF-8

1 About Text
The Problem

Most computer science students are familiar with the ASCII character encoding scheme, but no others. This
was the most prevalent encoding for more than forty years. The ASCII encoding maps characters to 7-bit
integers, using the range from 0 to 127 to represent 94 printing characters, 33 control characters, and the
space. Since a byte is usually used to store a character, the eighth bit of the byte is lled with a 0.

The problem with the ASCII code is that it does not provide a way to encode characters from other scripts,
such as Cyrillic or Greek. It does not even have encodings of Roman characters with diacritical marks, such
as ¦, ¡, ±, or ó. Over time, as computer usage extended world-wide, other encodings for dierent alphabets
and scripts were developed, usually with overlapping codes. These encoding systems conicted with one
another. That is, two encodings could use the same number for two dierent characters, or use dierent
numbers for the same character. A program transferring text from one computer to another would run the
risk that the text would be corrupted in the transition.

Unifying Solutions

In 1989, to overcome this problem, the International Standards Organization (ISO ) started work on a
universal, all-encompassing character code standard, and in 1990 they published a draft standard (ISO
10646) called the Universal Character Set (UCS). UCS was designed as a superset of all other character set
standards, providing round-trip compatibility to other character sets. Round-trip compatibility means that
no information is lost if a text string is converted to UCS and then back to its original encoding.

Simultaneously, the Unicode Project, which was a consortium of private industrial partners, was working on
its own, independent universal character encoding. In 1991, the Unicode Project and ISO decided to work
cooperatively to avoid creating two dierent character encodings. The result was that the code table created
by the Unicode Consortium (as they are now called) satised the original ISO 10646 standard. Over time,
the two groups continued to modify the respective standards, but they always remain compatible. Unicode
adds new characters over time, but it always contains the character set dened by ISO 10646-x. The most
current Unicode standard is Unicode 6.0.

Unicode

Unicode contains the alphabets of almost all known languages, as diverse as Japanese, Chinese, Greek,
Cyrillic, Canadian Aboriginal, and Arabic. It was originally a 16-bit character set, but in 1995, with Unicode
2.0, it became 32 bits. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which
is roughly a 21-bit code space. The code reserves the remaining values for future use.

In Unicode, a character is dened as the smallest component of a written language that has semantic value.
The number assigned to a character is called a code point. A code point is denoted by U+ following by a
hexadecimal number from 4 to 8 digits long. Most of the code points in use are 4 digits long. For example,
U+03C6 is the code point for the Greek character φ.

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 1
Software Design Lecture Notes Prof. Stewart Weiss
Unicode and UTF-8

Figure 1: Unicode layout

UTF-8

Unicode code points are just numeric values assigned to characters. They are not representations of characters
as sequences of bytes. For example, the code point U+0C36 is not a sequence of the bytes 0x0C and 0x36.
In other words, it is not a character encoding scheme. If we were to use it as an encoding scheme, there
would be no way to distinguish the sequence of two characters '\f ' '$' (form feed followed by $) from the
Greek character φ.
There are several encoding schemes that can represent Unicode, including UCS-2, UCS-4, UTF-2, UTF-4,
UTF-8, UTF-16, and UTF-32. UCS-2 and UCS-4 encode Unicode text as sequences of either 2 or 4 bytes,
but these cannot work in a UNIX system because strings with these encodings can contain bytes that match
ASCII characters and in particular, \0 or /, which have a special meaning in lenames and other C library
function parameters. UNIX le systems and tools expect ASCII characters and would fail if they were given
2-byte encodings.

The most prevalent encoding of Unicode as sequences of bytes is UTF-8, invented by Ken Thompson in
1992. In UTF-8 characters are encoded with anywhere from 1 to 6 bytes. In other words, the number of
bytes varies with the character. In UTF-8, all ASCII characters are encoded within the 7 least signicant
bits of a byte whose most signicant bit is 0.

UTF-8 uses the following scheme for encoding Unicode code points:

1. Characters U+0000 to U+007F ( i.e., the ASCII characters) are encoded simply as bytes 0x00 to 0x7F.
This implies that les and strings that contain only 7-bit ASCII characters have the same encoding
under both ASCII and UTF-8.

2. All UCS characters larger than U+007F are encoded as a sequence of two or more bytes, each of which
has the most signicant bit set. This means that no ASCII byte can appear as part of any other
character, because ASCII characters are the only characters whose leading bit is 0.

3. The rst byte of a multibyte sequence that represents a non-ASCII character is always in the range
0xC0 to 0xFD and it indicates how many bytes follow for this character. Specically it is one of
110xxxxx, 1110xxxx, 11110xxx, 111110xx, and 1111110x, where the x's may be 0's or 1's. The number
of 1-bits following the rst 1-bit up until the next 0-bit is the number of bytes in the rest of the sequence.
All further bytes in a multibyte sequence start with the two bits 10 and are in the range 0x80 to 0xBF.
This implies that UTF-8 sequences must be of the following forms in binary, where the x's represent
the bits from the code point, with the leftmost x-bit being its most signicant bit:

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 2
Software Design Lecture Notes Prof. Stewart Weiss
Unicode and UTF-8

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

4. The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

A few things can be concluded from the above rules. First, the number of x's in a sequence is the maxiumum
number of bits that a code point can have to be to be representable in that many bytes. For example,
there are 11 x-bits in a two-byte UTF-8 sequence, so all code points whose 16-bit binary value is at least
0000000010000000 but at most 0000011111111111 can be encoded using two bytes. In hex, these lie between
0080 and 07FF. The table below shows the ranges of Unicode code points that map to the dierent UTF-8
sequence lengths.

Number of
Number of bits in Code
Bytes Point Range
1 7 00000000 - 0000007F
2 11 00000080 - 000007FF
3 16 00000800 - 0000FFFF
4 21 00001000 - 001FFFFF
5 26 00200000 - 03FFFFFF
6 31 04000000 - FFFFFFFF

You can see that, although UTF-8 encoded characters may be up to six bytes long in theory, code points
through U+FFFF, having at most 16 bits, can be encoded in sequences of no more than 3 bytes.

Converting a Unicode code point to UTF-8 by hand is straightforward using the above table.

1. From the range, determine how many bytes are needed.

2. Starting with the least signicant bit, copy bits from the code point from right to left into the least
signicant byte.

3. When the current byte has reached 8 bits, continue lling the next most signicant byte with succes-
sively more signicant bits from the code point.

4. Repeat until all bits have been copied into the byte sequence, lling with leading zeros as required.

Example 1. To convert U+05E7 to UTF-8, rst determine that it is in the interval 0080 to 07FF, requiring
two bytes. Write it in binary as

0000 0101 1110 0111

The rightmost 6 bits go into the right byte after 10:

10 100111

and the remaining 5 bits go into the left byte after 110:

110 10111

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 3
Software Design Lecture Notes Prof. Stewart Weiss
Unicode and UTF-8

So the sequence is 11010111 10100111 = 0xD7 0xA7, which in decimal is 215 in byte1 and 167 in byte 2.

Example 2. To convert U+0ABC to UTF-8, since it is greater than U+07FF, it is a three-byte code. In
binary,

0000 1010 1011 1100

which is distributed into the three bytes as

1110 0000
10 101010
10 111100

This is the sequence 11100000 10101010 10111100 = 0xE0 0xAA 0xBC, which in decimal is 224 170 188, the
Gujarati sign Nukta.

Exercise. Write an algorithm to do the conversion in general.

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 4

Common questions

The initial byte in a UTF-8 multibyte sequence indicates the sequence's length through a prefix structure composed of bits '1' and '0'. The number of consecutive '1' bits starting from the most significant bit determines the total number of bytes in the sequence. For example, bytes starting with 110xxxxx indicate a two-byte sequence, while 1110xxxx indicates three bytes. The combination of prefixed '1' bits and remaining zeros not only signals sequence length but also ensures no confusion with ASCII bytes, whose leading bit is 0 .

Ken Thompson invented UTF-8 in 1992, providing a variable-length encoding for Unicode that supports backward compatibility with ASCII and compatibility with existing system tools and libraries in UNIX environments. His contribution is significant because UTF-8 allows the seamless integration of internationalization into systems originally designed for ASCII, thus supporting a wide range of global languages and scripts in a unified way .

The mechanics of UTF-8 deliberately exclude the bytes 0xFE and 0xFF from valid encodings to avoid conflicts and irregularities in systems that might misinterpret such bytes due to their special uses in certain protocols or their potential indication of byte order marks. This restriction helps ensure robust and predictable behavior of encoded text across diverse systems and applications, preventing misinterpretation or processing errors that could arise from these byte patterns .

Converting a Unicode code point to UTF-8 involves determining the number of bytes needed from the range the code point falls into. For U+05E7, which lies between 0080 and 07FF, two bytes are required. The binary form is 0000 0101 1110 0111: the rightmost 6 bits are placed into the right byte after '10': 10 100111, and the left byte gets the remaining 5 bits after '110': 110 10111, resulting in a sequence of 11010111 10100111 or 0xD7 0xA7 .

UTF-8 is the prevalent choice for encoding Unicode characters in UNIX systems because it can represent characters with a variable number of bytes, allowing ASCII characters to remain unchanged when encoded. This ensures compatibility with existing UNIX file systems and tools, which expect ASCII characters and may fail with fixed-width encodings like UCS-2 or UCS-4 that include special characters such as '\0' or '/' . Thus, UTF-8 avoids conflicts that could arise in UNIX systems due to these special character encodings .

Unicode maintains round-trip compatibility by ensuring that no information is lost if a text string is converted to Unicode and then back to its original encoding . This means that when data is converted between different encoding schemes, the integrity of the text remains intact, preventing corruption and allowing for accurate representation across different systems and formats .

UTF-8 provides several advantages, including backward compatibility with ASCII, which allows files containing only ASCII characters to have the same encoding in UTF-8. It also avoids conflicts with UNIX systems by preventing special characters like '\0' or '/' from causing failures. Additionally, because it is variable-length, it can encode any Unicode character efficiently by using 1 to 4 bytes, whereas UCS-2 and UCS-4 require fixed widths that are less efficient for certain characters .

In UTF-8 encoding, all ASCII characters are encoded within the 7 least significant bits of a byte whose most significant bit is 0, ensuring that ASCII characters and non-ASCII characters are distinctly separate. All UCS characters larger than U+007F are encoded as a sequence of two or more bytes, each having the most significant bit set, thus preventing any ASCII byte from appearing as part of a non-ASCII character .

In UTF-8, the number of bytes required to encode a Unicode code point depends on its bit length. For code points in the range U+0000 to U+007F, only 1 byte is needed as the range fits into 7 bits. For U+0080 to U+07FF, 2 bytes are used, accommodating up to 11 bits. For example, U+0041, corresponding to an ASCII character, requires 1 byte (0x41), while U+05E7, a Hebrew character, needs 2 bytes: 0xD7 0xA7 .

The ASCII encoding scheme is limited as it only maps characters to 7-bit integers, representing 94 printing characters and 33 control characters along with the space, thus providing no way to encode characters from non-Latin scripts or even Latin characters with diacritical marks . Unicode addresses these limitations by being a universal character set capable of encoding the alphabets of almost all known languages. It was originally a 16-bit character set and later expanded to 32 bits, providing a much larger code space for diverse characters .

Utf-8 - Wikipedia, The Free Encyclopedia
No ratings yet
Utf-8 - Wikipedia, The Free Encyclopedia
10 pages
Understanding Unicode and UTF-8 Encoding
No ratings yet
Understanding Unicode and UTF-8 Encoding
51 pages
Understanding Unicode in C++
No ratings yet
Understanding Unicode in C++
125 pages
Understanding ASCII and Its Role
No ratings yet
Understanding ASCII and Its Role
3 pages
Unicode CPP PDF
No ratings yet
Unicode CPP PDF
139 pages
EBCDIC vs. ASCII and Unicode Overview
No ratings yet
EBCDIC vs. ASCII and Unicode Overview
25 pages
Overview of Unicode History and Encoding
No ratings yet
Overview of Unicode History and Encoding
4 pages
Understanding Character Encoding Basics
No ratings yet
Understanding Character Encoding Basics
26 pages
Encoding Language Fundamentals
No ratings yet
Encoding Language Fundamentals
18 pages
Character Encoding in Computer Science
No ratings yet
Character Encoding in Computer Science
5 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Comparing UTF-8, UTF-16, and UTF-32
No ratings yet
Comparing UTF-8, UTF-16, and UTF-32
12 pages
Introduction To Character Encoding
No ratings yet
Introduction To Character Encoding
10 pages
Differences Between UTF Encodings
No ratings yet
Differences Between UTF Encodings
13 pages
Unit 2
No ratings yet
Unit 2
12 pages
Understanding Unicode and Strings
No ratings yet
Understanding Unicode and Strings
4 pages
Factual Presentation 1
No ratings yet
Factual Presentation 1
11 pages
ASCII, ISCII, and Unicode Explained
No ratings yet
ASCII, ISCII, and Unicode Explained
6 pages
Linux Unicode Programming
No ratings yet
Linux Unicode Programming
10 pages
Understanding Unicode Code Points
No ratings yet
Understanding Unicode Code Points
15 pages
Understanding Unicode Encoding Systems
No ratings yet
Understanding Unicode Encoding Systems
10 pages
Unicode Text Segmentation Overview
No ratings yet
Unicode Text Segmentation Overview
34 pages
ASCII vs Unicode: Key Differences Explained
0% (1)
ASCII vs Unicode: Key Differences Explained
38 pages
Understanding Character Encodings
No ratings yet
Understanding Character Encodings
47 pages
Fundations of Sequencial Programs CSC 208 Material
No ratings yet
Fundations of Sequencial Programs CSC 208 Material
24 pages
Fundations of Sequential Programming CSC 208 Material
No ratings yet
Fundations of Sequential Programming CSC 208 Material
24 pages
Understanding Character Encodings
No ratings yet
Understanding Character Encodings
4 pages
Unicode - Wikipedia, The Free Encyclopedia
No ratings yet
Unicode - Wikipedia, The Free Encyclopedia
18 pages
Understanding Unicode and ASCII Standards
No ratings yet
Understanding Unicode and ASCII Standards
4 pages
ASCII and Unicode: A Historical Overview
No ratings yet
ASCII and Unicode: A Historical Overview
6 pages
Differences Between Unicode and UTF-8
No ratings yet
Differences Between Unicode and UTF-8
2 pages
Coding Encoding
No ratings yet
Coding Encoding
14 pages
ASCII vs Unicode: Character Encoding Explained
No ratings yet
ASCII vs Unicode: Character Encoding Explained
2 pages
Data Representation and Character Encoding
100% (1)
Data Representation and Character Encoding
10 pages
ASCII vs Unicode: Key Differences Explained
No ratings yet
ASCII vs Unicode: Key Differences Explained
12 pages
Python Unicode Support Overview
No ratings yet
Python Unicode Support Overview
12 pages
Text Preprocessing Techniques Explained
No ratings yet
Text Preprocessing Techniques Explained
35 pages
Understanding Unicode and Encoding Methods
No ratings yet
Understanding Unicode and Encoding Methods
7 pages
ASCII Encoding and Decoding Explained
No ratings yet
ASCII Encoding and Decoding Explained
15 pages
ASCII vs Unicode: Character Encoding Explained
No ratings yet
ASCII vs Unicode: Character Encoding Explained
3 pages
Understanding Unicode Character Encoding
No ratings yet
Understanding Unicode Character Encoding
2 pages
Utf8 latexEN
No ratings yet
Utf8 latexEN
7 pages
Coding Standards: ASCII, EBCDIC, Unicode
No ratings yet
Coding Standards: ASCII, EBCDIC, Unicode
11 pages
String Representation and Processing
No ratings yet
String Representation and Processing
10 pages
Understanding Character Encoding Schemes
No ratings yet
Understanding Character Encoding Schemes
26 pages
String Representation and Encoding Basics
No ratings yet
String Representation and Encoding Basics
10 pages
ASCII vs Unicode: Character Encoding Explained
No ratings yet
ASCII vs Unicode: Character Encoding Explained
14 pages
HTML Introduction Part 2
No ratings yet
HTML Introduction Part 2
28 pages
Understanding Unicode Basics
No ratings yet
Understanding Unicode Basics
5 pages
Understanding Unicode Coding Schemes
No ratings yet
Understanding Unicode Coding Schemes
4 pages
Understanding Text Encoding Standards
No ratings yet
Understanding Text Encoding Standards
10 pages
Universal Character Set Characters
No ratings yet
Universal Character Set Characters
34 pages
Comprehensive Guide to Unicode
No ratings yet
Comprehensive Guide to Unicode
13 pages
Data Representation in Computer Systems
No ratings yet
Data Representation in Computer Systems
13 pages
Overview of the Unicode Standard
No ratings yet
Overview of the Unicode Standard
4 pages
Understanding Unicode Character Sets
No ratings yet
Understanding Unicode Character Sets
2 pages
ASCII vs Unicode: Key Differences Explained
No ratings yet
ASCII vs Unicode: Key Differences Explained
2 pages
11 - Encoding & Filtering
No ratings yet
11 - Encoding & Filtering
69 pages
Understanding Unicode Standards
No ratings yet
Understanding Unicode Standards
15 pages
Welding Lab Test Report Overview
No ratings yet
Welding Lab Test Report Overview
8 pages
Benefits of GST Registration Explained
No ratings yet
Benefits of GST Registration Explained
4 pages
Renr4035renr4035 PDF
No ratings yet
Renr4035renr4035 PDF
2 pages
Deped Copy: Unit 1: Consumer Health
No ratings yet
Deped Copy: Unit 1: Consumer Health
34 pages
Mettler Toledo AM/PM Balance Guide
0% (1)
Mettler Toledo AM/PM Balance Guide
2 pages
Electronic Warfare Antennas
No ratings yet
Electronic Warfare Antennas
8 pages
EMEA Food & Drink Trends 2024 Insights
No ratings yet
EMEA Food & Drink Trends 2024 Insights
39 pages
Essential Obstetric Procedures Guide
No ratings yet
Essential Obstetric Procedures Guide
6 pages
Red Light Therapy Pad Guide
100% (1)
Red Light Therapy Pad Guide
14 pages
Python Tuple Programming Basics
No ratings yet
Python Tuple Programming Basics
6 pages
Walnut Shells in Cement-Bonded Boards
No ratings yet
Walnut Shells in Cement-Bonded Boards
28 pages
Instruction Manual: V959 Future Battleship 360 Stunt Drone (V969-V979-V989-V999)
No ratings yet
Instruction Manual: V959 Future Battleship 360 Stunt Drone (V969-V979-V989-V999)
9 pages
Formal Methods for Algorithm Verification
No ratings yet
Formal Methods for Algorithm Verification
6 pages
Digital Marketing Plan for Local Aari Business
No ratings yet
Digital Marketing Plan for Local Aari Business
36 pages
Proposed Perimeter Fencing Plan
No ratings yet
Proposed Perimeter Fencing Plan
1 page
Drilling Service PDF
100% (1)
Drilling Service PDF
2 pages
Questions
No ratings yet
Questions
25 pages
SIAM Commodity Price Monthly Monitor Report Nov 2025
No ratings yet
SIAM Commodity Price Monthly Monitor Report Nov 2025
55 pages
Cary Huang: BFDI Co-Creator Profile
No ratings yet
Cary Huang: BFDI Co-Creator Profile
1 page
Youth Empowerment in Social Services
No ratings yet
Youth Empowerment in Social Services
5 pages
IGO A320 Minimum Equipment List Rev 06
No ratings yet
IGO A320 Minimum Equipment List Rev 06
1,554 pages
KUKA-Planning For The Future of Automation
No ratings yet
KUKA-Planning For The Future of Automation
16 pages
IFC v. Tobias: Burden of Proof in Debt
No ratings yet
IFC v. Tobias: Burden of Proof in Debt
2 pages
Abhijat Resume
0% (1)
Abhijat Resume
3 pages
Key Issues in Production Management
No ratings yet
Key Issues in Production Management
5 pages
Integrating Technology in Pedagogical Knowledge
No ratings yet
Integrating Technology in Pedagogical Knowledge
4 pages
Medical Equipment Manufacturing Business Plan Example
100% (1)
Medical Equipment Manufacturing Business Plan Example
36 pages
Empirical Analysis of Quant Trading Strategies
No ratings yet
Empirical Analysis of Quant Trading Strategies
280 pages
Advanced Research Methods Workshop 2025
No ratings yet
Advanced Research Methods Workshop 2025
8 pages
Opposition to HP/Compaq Merger Analysis
No ratings yet
Opposition to HP/Compaq Merger Analysis
48 pages

Understanding Unicode and UTF-8

Uploaded by

Understanding Unicode and UTF-8

Uploaded by

Software Design Lecture Notes Prof.

Unicode and UTF-8

Figure 1: Unicode layout

1. From the range, determine how many bytes are needed.

0000 0101 1110 0111

The rightmost 6 bits go into the right byte after 10:

0000 1010 1011 1100

which is distributed into the three bytes as

Exercise. Write an algorithm to do the conversion in general.

Common questions

How does the initial byte in a UTF-8 multibyte sequence indicate the length of the sequence, and what is the significance of the '1' and '0' prefix structure?

What role did Ken Thompson play in the development of UTF-8, and why is his contribution significant?

What mechanics within UTF-8 prevent the use of bytes 0xFE and 0xFF, and why might this be strategically significant?

Describe the process of converting a Unicode code point to UTF-8, using U+05E7 as an example.

Why is UTF-8 the prevalent choice for encoding Unicode characters in UNIX systems over other encoding schemes like UCS-2 or UCS-4?

Explain how Unicode maintains round-trip compatibility with other character sets. What does this mean for data integrity during text conversion?

What advantages does UTF-8 encoding provide over fixed-width encoding schemes such as UCS-2 or UCS-4?

How does UTF-8 encoding ensure that no ASCII byte can appear as part of any other character?

Illustrate how the number of bytes required in UTF-8 varies with the bit length of Unicode code points. Provide an example for two different ranges.

What are the limitations of the ASCII encoding scheme, and how did the development of Unicode address these limitations?

You might also like