Encoding
Table of Contents
1. Number Encoding
1.1. Integer
- Base two representation for the positive integers.
- 0001 0101 = 21₁₀
- For negative numbers various methods are
used.
- Sign-Magnitude
- Put 1 in the most significant digit if negative.
- One's Complement
- Find complement with respect to 1111 1111.
- Two's Complement
- Find complement with respect to (1)0000 0000
- 1111 1111 = -1₁₀
- 1110 1011 = -21₁₀
- Offset-Binary
- Offset the zero to \( K \), then called excess-\( K \)
- e.g. excess-256
- Base -2
- Each digits represents \( (-2)^{-n} \).
- Sign-Magnitude
1.2. BCD
- Binary-Coded Decimal
- It represents a number by encoding each digits of the decimal numbers in sequence.
- Binary integer to Binary-coded decimal happens when a number is displayed on a seven segmented display.
- A single digit is encoded with 4 bits.
Gray Code
- The display may flicker if two many bits flip, therefore it is
desirable that every pair of consecutive numbers differ only in a single bit.
2. Character Encoding
2.1. BCDIC and EBCDIC
- BCD, Alphanumeric BCD
- (Extended) Binary-Coded Decimal Interchange Code.
- BCDIC introduced in 1928 by IBM for use in IBM card, and EBCDIC in 1963.
- BCD is 6 bit and EBCDIC is 8 bit.
- BCD is agnostic about the case of a letter.
- No standard for BCD and loose one for EBCDIC.
2.1.1. ⌑
- Square Lozenge #+BEGINCOMMENT diamond, rhombus #+ENDCOMMENT
- 0x3C in BCDIC
- Calculator symbol for 'subtotal'
2.2. ASCII
- American Standard Code for Information Interchange
7 bit encoding, published in 1963.
- There are many versions of it, including international one and
recommendations.
- 0x00 - 0x1F control characters
- 00^@ 03C 04D 07G 09I 0AJ 11Q 1AZ 1B^[
- 0x20 - 0x40 Symbols
20␣ | 21! | 22” | 23# | 24$ | 25% | 26& | 27' | 28( | 29) | 2A* | 2B+ | 2C, | 2D- | 2E. | 2F/ | | 300 | | | | | | | | | | 3A: | 3B; | 3C< |3D= | 3E> | 3F ? | | | 1 | ' | 3 | 4 | 5 | 7 | ' | 9 | 0 | 8 | = | , | - | . | / | | | | | | | | | | | | ; | ; | , | = | . | / |
- 40@
- 0x41 - 0x5A Uppercase Latin
- 0x5B - 0x60 Symbols
- B[ C\ D] E^ F_
- 0`
- 0x61 - 0x7A Lowercase Latin
- 0x7B - 0x7E Symbols
- B{ C| D} E~
- 0x7F DEL
- ^?
3. Binary Encoding
3.1. Base64
- Take three bytes and convert it into four alphanumeric characters ([A-Za-z0-9+/=]).
- Useful for ASCII only transmission such as SMTP.
4. Control Characters
- Non-Printable Character, NPC
4.1. ASCII Control Character
- C0 Control Code
- It has been transferred to Unicode at the same codepoint. They are in the category of Cc.
4.1.1. Representations
- Codepoint 0x1B
- Abbreviation BEL
- Symbols ␊
- around U+2400
- Graphics ⌁, around U+2370
- Caret Notation
^[
- ASCII-based keyboards used to generate the control character by
pressing
Ctrl
key with other characters. It generates uppercase letter and symbol code minus0x40
, except for?
which is plus0x40
.
- ASCII-based keyboards used to generate the control character by
pressing
- Escape Sequence
4.1.2. Input
- Usually shift is avoided:
<C-/>
for^?=(DEL), =<C-2>
for^@=(NUL). =<C-SPC>
for^@
is also common.- Not in Emacs.
- In Emacs,
<C-q>
allows the user to write the control characters.<C-v>
also works.<C-q>
and the corresponding key combination,<C-[>
, or the key itself,<ESC>
, or the octal code,033
, can be used.- Note
<S>
is unnecessary for uppercase, but it is required for symbols.<C-SPC>
for^@
- Terminal Emulators takes the keyboard inputs and sends the
corresponding control characters to the program.
- They are mostly sensible.
<C-BACKSPACE>
to^H
(BS)<C-/>
to^_
(US)
- Shell interprets the escape characters
\e
,\t
,\n
,… when the string is prefixed with$
?- when using echo
\t
and\n
are interpreted by theecho
.
- when using echo
4.1.3. Characters
- 0x00 NUL
^@
\0
- 0x03 ETX
^C
- End of Text
- 0x04 EOT
^D
- End of Transmission
- 0x09 HT
^I
\t
- 0x0A LF
^J
\n
- 0x0D CR
^M
- 0x11 DC1
^Q
- Device Control
- 0x1A SUB
^Z
- 0x1B ESC
^[
\e
4.2. ANSI Escape Sequence
- ANSI Escape Code, C1 Control Code
^[
+ 0x40-0x5F, 0x80 - 0x9F- One byte code is not widely used.
- They are interpreted by the program.
- The terminal emulator takes the keyboard input and change it into ANSI escape code and sends it to the programs.
4.2.1. Control Sequence Introducer
- CSI
^[[
,\e[
,\033[
- The ESC and the letter
[
- It is followed by
- Parameter bytes 0x30-0x3F (0-9:;<=>?)
- Intermidiate bytes 0x20-0x2F (!“#$%&'()*+,-./)
- Final Byte 0x40-0x7E (?? (????)–Z[\]^_`a–z{|}~)
- The ESC and the letter
- Cursor control (
A
,B
,C
,D
, …), Erase (J
,K
), Scroll (K
,S
)
4.2.1.1. Select Graphic Rendition
m
- PARAMETERS
0
normal,1
bold,3
italic, …30
-37
set foreground color,40
-47
set background color (3-bit and 4-bit)38
and48
is used to access more colors;n
(8-bit) or2;r;g;b
(24-bit, "true color") is followed.
5. Unicode
5.1. Code Point
- Unicode defines the code points for each grapheme. A grapheme may corresponds to multiple code points, such as é corresponds to e (U+0065) + ´ (U+00B4).
- It follows ASCII for the first few.
5.2. Encoding
- Unlike other encodings such as EUC-KR, CP-949, EUC-JP, etc. that maps graphemes to bytes, Unicode encoding schemes maps code points to bytes. So there's another layer of abstraction.
5.2.1. UFT-8
- Most common encoding method.
- Characters, Symbols and the Unicode Miracle - Computerphile - YouTube
5.2.1.1. Encoding
- 0 + ASCII 110 + {5} , 10 + {6} 1110 + {4}, 10 + {6}, 10 + {6} … 1111110 + {1}, 10 + {6}, …, 10 + {6}
- Moving back and forth by a character is easy this way.
5.2.2. UTF-16
- Variable length encoding with the length varies by the units of two bytes.
5.2.3. UTF-32
- Fixed length encoding of four bytes.
5.2.4. Byte Order Mark
BOM
U+FEFF zero width no-break space is used as a magic number at the start of a text stream.
It allows the editor to figure out the encoding and byte order(endianness)
5.3. Catalog
- 0300-036F: Latin Combining
- 0483-0489: Cyrillic Combining
- U+0489 COMBINING CYRILLIC MILLIONS SIGN: ҉
- '' ZERO WIDTH SPACE, ' ' NO BREAK SPACE, ' ' THIN SPACE, ' ' HAIR SPACE, ' ' FIGURE SPACE, ' ' PUNCTUATION SPACE
- '' ZERO WIDTH JOINER, '' WORD JOINER
- It is used right beside space to represent that the space is part of a word.
5.4. History
5.4.1. |
- These Keys Shouldn't Exist | Nostalgia Nerd - YouTube
- In the early days of punchcards there was no standard. The character encoding of different bitsize had also emerged. Some of them were chosen to be the encoding for the information exchange.
- 12 May, 1966 The ASCII standard was established.
- But the IBM user group (
PL/1
programmer) proposed0x23
and0x24
to be the|
and¬
, thence it became available to stylize it as such. Moreover the original pipe symbol0x7C
was broken to distinguish them. - 5 July 1967 the broken bar was officially part of the ASCII
- ISO-646.1973 and ASCII-1977 removed the stylized
0x23
and0x24
, and replace the broken bar with solid bar. But some vendors, especially IBM, kept the broken bar under code page 437. - ISO-8859-1 and ECMA-94 introduced the Latin.1 that included the
broken bar again within the hight bit range as
0xA6
. DOS has them in the code page 8050. This made a disagreement on where to put the broken bar and solid bar. There were even two different solid bars or two different broken bars.
6. Compression
6.1. AV1
- Compression for videos
- Smaller in size about 40%, but slower to encode and decode.
6.2. Lempel-Ziv Compression
It generates the temporary codebook as it compresses a file. The token expands from the beginning of a file until a new sequence that is not in the codebook is found, and the token is encoded using the codes stored in the codebook and stored in the archive file, the new sequence is recorded in the temporary codebook.
The uncompression works similarly. It looks for a new token and keep expanding using the temporary codebook that is generated on its own.
7. Reference
- Signed number representations - Wikipedia
- Binary-coded decimal - Wikipedia
- Control character - Wikipedia
- Caret notation - Wikipedia
- C0 and C1 control codes - Wikipedia
- ASCII - Wikipedia
- ANSI escape code - Wikipedia
- ISO/IEC 2022 - Wikipedia
- Unicode 15.1 Character Code Charts
- Unicode - Compart
- Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more - YouTube
- How are Redstone Computers even possible? - YouTube
- The Beauty of Lempel-Ziv Compression - YouTube