Unicode
Unicode is a standard, promoted by the Unicode Consortium, for encoding the text of most of the world's writing systems, using variable-width encodings, such as UTF-8. Its goal is to replace current and previous character encoding standards with one worldwide standard for all languages. It has already done that to a large degree; for example, it is dominant on the web, in the form of the UTF-8 encoding. UTF-16 is also common, for example, on Windows, while Microsoft recommends UTF-8. The standard supports emojis and other symbols that older standards did not support.
Older standards for (English) text could not represent all languages of the world, for example, Chinese or Japanese. They also could not represent languages such as Arabic or Hebrew, which are written from right to left—at least not when mixed with other languages that are written from left to right. Unicode supports such mixing. It also allows for sorting (collating), which is not easy when languages are mixed.
Unicode provides many printable characters, such as letters, digits, diacritics (things that attach to letters), and punctuation marks. It also provides characters that do not actually print, but instead control how text is processed. For example, a newline and a character that makes text go from right to left are both characters that do not print.
Details
Unicode, formally the Unicode Standard, is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, as of Unicode 14.0 defines 144,697 characters covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes. Unicode is promoted by the Unicode Consortium and based on ISO standards. New versions are issued every few years and later versions have over 144,000 characters, covering 159 modern and historic scripts, as well as symbols, emoji, and non-visual control and formatting codes.
Unicode was developed in the 1990s and integrated earlier codes used on computer systems.
The Unicode Standard includes more than just the base code. Alongside the character encodings, the Consortium's official publication includes a wide variety of details about the scripts and how to display them: normalization rules, decomposition, collation, rendering, and bidirectional text display order for multilingual texts, and so on.
Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many recent technologies, including modern operating systems, XML, and most modern programming languages.
Older standards for (English) text could represent only 32, 64, 128 (ASCII) or usually 256 different letters or symbols, 32 though only enough for upper case English letters and no numbers. ASCII is enough for English, but not most other languages, and 256 codes are enough for many European languages, but not for many Asian languages such as Chinese. Unicode has over 144,000 characters already defined, and allows for over a million, which is enough for all languages of the world, including e.g. Arabic and Hebrew that are written from right-to-left, and Unicode allows those languages, even mixed with other such as English or Russian that written in the other direction.
Unicode can be implemented by different character encodings. The Unicode standard defines Unicode Transformation Formats (UTF): UTF-8, UTF-16, and UTF-32, and several other encodings. The most commonly used encodings are UTF-8, UTF-16; GB18030, while not an official Unicode standard, is standardized in China and implements Unicode fully.
Unicode considers a graphical character (for instance é) as a code point (alone or in sequence [e + ‘] ). Each code point is a number with many digits which can be encoded in one or several code units. Code units are 8, 16, or 32 bits. This allows Unicode to represent characters in binary.
Some letters such as the Devanagari kshi, and national flag emojis, are represented with more than one code point.
Encodings
There are different ways to encode Unicode, the most common ones are:
- UTF-8, uses one to four bytes for each code point, maximizes compatibility with ASCII
- UTF-16, uses one or two 16-bit code units per code point, cannot encode surrogates
UTF-8 is the most common of these for exchange. It is used for internet, electronic mail, and Java also uses a variant of it.
UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022, uses one byte (8 bits) for the first 128 code points, and up to 4 bytes for other characters. The first 128 Unicode code points represent the ASCII characters, which means that any ASCII text is also a UTF-8 text.
Both UTF-8 and UTF-16 are variable-width encodings. It's a common mistake to think that UTF-16 isn't, but it's not a fixed-width encoding like ASCII, so that's not a good reason to prefer UTF-16 (only its obsolete predecessor UCS-2 is fixed-width).