tags : Systems
The agreement between the two computers about the correspondence between letters and numbers is what Unicode standardizes.
In terms of Unicode,
h
is:
- An
abstract character
namedLATIN SMALL LETTER H
.- This
character
has the correspondingnumber 0x68
- Which is a
code point
in notationU+0068
.
- The first Unicode version 1.0 was published in October 1991 and had 7,161 characters âassociatedâ. The latest version provides codes for 149,186 characters âassociatedâ.
- Total characters: 216 + 220 - 211 = 1,112,064(~1mn) assignable code points.
- 216 : Plane 0 (BMP) (
U+0000
-U+FFFF
) - 220 : Plane 1-16 (astral planes/supplementary planes) (
U+10000
-U+10FFFF
) - 211 : Code points from
U+D800
-U+DFFF
which are used to encode surrogate pairs inUTF-16
- 216 : Plane 0 (BMP) (
Codespace and blocks
- The Unicode
codespace
is divided into 17planes
, numbered 0 to 16. - The Unicode Standard defines a
codespace
(a set of numerical values ranging from 0 through 10FFFF) calledcode points
and denoted asU+0000
throughU+10FFFF
(See blocks.txt ) - Plane 0 is the
Basic Multilingual Plane (BMP)
- Single code unit in
UTF-16
encoding - Encoded in 1, 2 or 3 bytes in
UTF-8
.
- Single code unit in
- Planes 1 through 16 are supplementary planes
- Accessed as surrogate pairs in
UTF-16
- Encoded in four bytes in
UTF-8
- Accessed as surrogate pairs in
- Within each
plane
, characters are allocated within namedblocks
of related characters.- Characters required for a given script may be spread out over several different blocks.
Code units
How unicode is interpreted by the computer: code units
Code unit is a bit sequence used to encode each character within a given encoding form.
- The
character encoding
is what transforms abstract code points into physical bits: code units. - Popular encodings are
UTF-8
,UTF-16
andUTF-32
(UTF = Unicode Transformation Format)
Encodings
UTF-16
- Windows tends to use UTF-16
- It uses 16 bit code units (16 bits = 2 bytes).
- variable-length encoding
- Surrogate pairs are a UTF-16 thing.
- The main hazard of UTF-16 is that it leads to people believing they are handling unicode correctly, when often they donât properly decode surrogate pairs, etc.
- Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.
- Code points
- Code points from BMP are encoded using a single code unit of 16-bit
- Code points from astral planes are encoded using two code units of 16-bit each(U+D800 - U+DFFF). (surrogate pairs) This allows encoding of over one million additional characters.
- High surrogate: Unicode code point from range U+D800 to U+DBFF gets combined with another unicode code point
- Low surrogate: Unicode code point from range U+DC00 to U+DFFF to generate a whole new character
UTF-32
- It uses 32 bit code units (32 bits = 4 bytes).
- An encoding where âall characters [are] matched with just one code pointâ, and itâs UTF-32.
- Pros
- character is the same length. (fixed length encoding). So you know right away where you can split it without cutting any letters in half.
- Cons
- Takes too much space. i.e downloading that text takes four times as long compared to UTF-8.
UTF-8
- Mac OS X and Linux use UTF-8.
- variable-length encoding
- It uses 8 bit code units (8 bits = 1 byte).
- It uses sequences of
up to four code units
to encode onecode point
. - Pros
- UTF-8 has the advantage that it uses the least amount of space if your characters are mostly in the basic Latin alphabet and punctuation. Webpages use UTF-8 because of this.
- Cons
- If you want to split the text into pieces, you have to be careful not to break up the text in the middle of a character.
- Itâs also not possible to find the 100th character without going through the first 99 (since they could be different lengths).
Language usage
Javascript
Strings
are represented fundamentally as sequences ofUTF-16
code units.- every code unit is exact 16 bits long. (maximum of 216, or
65,536
possible characters) - This character set is called the basic multilingual plane (BMP)
- To avoid ambiguity, the two parts of the pair must be between
0xD800
and0xDFFF
, and these code units are not used to encode single-code-unit characters.split
by UTF-16 code units and will separate surrogate pairs.
- Surrogate pairs, combining marks and grapheme are tough to handle in JavaScript.
- Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit. i.e The length of a String is the number of elements. always think of string in JavaScript as a sequence of code units.
- Most of the JavaScript string methods are not Unicode-aware.
- What every JavaScript developer should know about Unicode
- Every element of a string is interpreted by the engine as a code unit.
- The way a string is rendered does not provide a deterministic way to decide what code units (that represent code points) it contains.
console.log('cafe\u0301'); // => 'café'
console.log('café'); // => 'café'
// 'cafe\u0301' and 'café' literals have slightly different code units,
// but both are rendered the same sequence of symbols café.
Escape sequences
Escape sequences in strings are used to express code units based on code point numbers.
- Hexadecimal escape sequence
- Unicode escape sequence
- Code point escape sequence(new)