Encoding

1. Number Encoding
- 1.1. Integer
- 1.2. BCD
2. Character Encoding
- 2.1. BCDIC and EBCDIC
  - 2.1.1. ⌑
- 2.2. ASCII
3. Binary Encoding
- 3.1. Base64
4. Control Characters
- 4.1. ASCII Control Character
- 4.2. ANSI Escape Sequence
  - 4.2.1. Control Sequence Introducer
    - 4.2.1.1. Select Graphic Rendition
5. Unicode
6. Compression
- 6.1. AV1
- 6.2. Lempel-Ziv Compression
7. Reference

1. Number Encoding

1.1. Integer

Base two representation for the positive integers.
- 0001 0101 = 21₁₀
For negative numbers various methods are used.
- Sign-Magnitude
  - Put 1 in the most significant digit if negative.
- One's Complement
  - Find complement with respect to 1111 1111.
- Two's Complement
  - Find complement with respect to (1)0000 0000
  - 1111 1111 = -1₁₀
  - 1110 1011 = -21₁₀
- Offset-Binary
  - Offset the zero to $ K $, then called excess-$ K $
  - e.g. excess-256
- Base -2
  - Each digits represents $ (-2)^{-n} $.

1.2. BCD

Binary-Coded Decimal
It represents a number by encoding each digits of the decimal numbers in sequence.
Binary integer to Binary-coded decimal happens when a number is displayed on a seven segmented display.
A single digit is encoded with 4 bits.
Gray Code
- The display may flicker if two many bits flip, therefore it is
desirable that every pair of consecutive numbers differ only in a single bit.
- See Gray code - Wikipedia

2. Character Encoding

2.1. BCDIC and EBCDIC

BCD, Alphanumeric BCD
(Extended) Binary-Coded Decimal Interchange Code.
BCDIC introduced in 1928 by IBM for use in IBM card, and EBCDIC in 1963.
BCD is 6 bit and EBCDIC is 8 bit.
BCD is agnostic about the case of a letter.
No standard for BCD and loose one for EBCDIC.

2.1.1. ⌑

Square Lozenge #+BEGIN_COMMENT diamond, rhombus #+END_COMMENT
0x3C in BCDIC
Calculator symbol for 'subtotal'

2.2. ASCII

American Standard Code for Information Interchange
7 bit encoding, published in 1963.
- There are many versions of it, including international one and
recommendations.
0x00 - 0x1F control characters
- 00^@ 03^C 04^D 07^G 09^I 0A^J 11^Q 1A^Z 1B^[
0x20 - 0x40 Symbols

20␣ | 21! | 22” | 23# | 24$ | 25% | 26& | 27' | 28( | 29) | 2A* | 2B+ | 2C, | 2D- | 2E. | 2F/ | | 300 | | | | | | | | | | 3A: | 3B; | 3C< |3D= | 3E> | 3F ? | | | 1 | ' | 3 | 4 | 5 | 7 | ' | 9 | 0 | 8 | = | , | - | . | / | | | | | | | | | | | | ; | ; | , | = | . | / |

0x41 - 0x5A Uppercase Latin
0x5B - 0x60 Symbols
- B[ C\ D] E^ F_
- 0`
0x61 - 0x7A Lowercase Latin
0x7B - 0x7E Symbols
- B{ C| D} E~
0x7F DEL
- ^?

3. Binary Encoding

3.1. Base64

Take three bytes and convert it into four alphanumeric characters ([A-Za-z0-9+/=]).
Useful for ASCII only transmission such as SMTP.

4. Control Characters

Non-Printable Character, NPC

4.1. ASCII Control Character

C0 Control Code
It has been transferred to Unicode at the same codepoint. They are in the category of Cc.

4.1.1. Representations

Codepoint 0x1B
Abbreviation BEL
Symbols ␊
- around U+2400
Graphics ⌁, around U+2370
Caret Notation ^[
- ASCII-based keyboards used to generate the control character by pressing Ctrl key with other characters. It generates uppercase letter and symbol code minus 0x40, except for ? which is plus 0x40.
Escape Sequence
- \e
  - C, Python, Bash and numerous others supports it.
- \033 (Octal code)
  - It can used to represent any byte.
- \x1B.

4.1.2. Input

Usually shift is avoided: <C-/> for ^?=(DEL), =<C-2> for ^@=(NUL). =<C-SPC> for ^@ is also common.
- Not in Emacs.
In Emacs, <C-q> allows the user to write the control characters. <C-v> also works.
- <C-q> and the corresponding key combination, <C-[>, or the key itself, <ESC>, or the octal code, 033, can be used.
- Note
  - <S> is unnecessary for uppercase, but it is required for symbols.
  - <C-SPC> for ^@
Terminal Emulators takes the keyboard inputs and sends the corresponding control characters to the program.
- They are mostly sensible.
- <C-BACKSPACE> to ^H (BS)
- <C-/> to ^_ (US)
Shell interprets the escape characters \e, \t, \n,… when the string is prefixed with $?
- when using echo \t and \n are interpreted by the echo.

4.1.3. Characters

0x00 NUL ^@ \0
0x03 ETX ^C
- End of Text
0x04 EOT ^D
- End of Transmission
0x09 HT ^I \t
0x0A LF ^J \n
0x0D CR ^M
0x11 DC1 ^Q
- Device Control
0x1A SUB ^Z
- Substitute, ␚, ␦
- Used in place of unrecognizable characters.
- The replacement character � U+FFFD is recommended by Unicode.
- SIGTSTP in Linux
0x1B ESC ^[ \e

4.2. ANSI Escape Sequence

ANSI Escape Code, C1 Control Code
^[ + 0x40-0x5F, 0x80 - 0x9F
One byte code is not widely used.
They are interpreted by the program.
- The terminal emulator takes the keyboard input and change it into ANSI escape code and sends it to the programs.

4.2.1. Control Sequence Introducer

CSI
^[[, \e[, \033[
- The ESC and the letter [
- It is followed by
  - Parameter bytes 0x30-0x3F (0-9:;<=>?)
  - Intermidiate bytes 0x20-0x2F (!“#$%&'()*+,-./)
  - Final Byte 0x40-0x7E (?? (????)–Z[\]^_`a–z{|}~)
Cursor control (A, B, C, D, …), Erase (J, K), Scroll (K, S)

4.2.1.1. Select Graphic Rendition

m
PARAMETERS
- 0 normal, 1 bold, 3 italic, …
- 30 - 37 set foreground color, 40 - 47 set background color (3-bit and 4-bit)
- 38 and 48 is used to access more colors
  - ;n (8-bit) or 2;r;g;b (24-bit, "true color") is followed.

5. Unicode

5.1. Code Point

Unicode defines the code points for each grapheme. A grapheme may corresponds to multiple code points, such as é corresponds to e (U+0065) + ´ (U+00B4).
It follows ASCII for the first few.

5.2. Encoding

Unlike other encodings such as EUC-KR, CP-949, EUC-JP, etc. that maps graphemes to bytes, Unicode encoding schemes maps code points to bytes. So there's another layer of abstraction.

5.2.1. UFT-8

Most common encoding method.
Characters, Symbols and the Unicode Miracle - Computerphile - YouTube

5.2.1.1. Encoding

0 + ASCII 110 + {5} , 10 + {6} 1110 + {4}, 10 + {6}, 10 + {6} … 1111110 + {1}, 10 + {6}, …, 10 + {6}
Moving back and forth by a character is easy this way.

5.2.2. UTF-16

Variable length encoding with the length varies by the units of two bytes.

5.2.3. UTF-32

Fixed length encoding of four bytes.

5.2.4. Byte Order Mark

BOM

U+FEFF zero width no-break space is used as a magic number at the start of a text stream.

It allows the editor to figure out the encoding and byte order(endianness)

5.3. Catalog

0300-036F: Latin Combining
0483-0489: Cyrillic Combining
- U+0489 COMBINING CYRILLIC MILLIONS SIGN: ҉
'' ZERO WIDTH SPACE, ' ' NO BREAK SPACE, ' ' THIN SPACE, ' ' HAIR SPACE, ' ' FIGURE SPACE, ' ' PUNCTUATION SPACE
'‍' ZERO WIDTH JOINER, '⁠' WORD JOINER
- It is used right beside space to represent that the space is part of a word.

5.4. History

5.4.1. `|`

These Keys Shouldn't Exist | Nostalgia Nerd - YouTube
In the early days of punchcards there was no standard. The character encoding of different bitsize had also emerged. Some of them were chosen to be the encoding for the information exchange.
12 May, 1966 The ASCII standard was established.
But the IBM user group (PL/1 programmer) proposed 0x23 and 0x24 to be the | and ¬, thence it became available to stylize it as such. Moreover the original pipe symbol 0x7C was broken to distinguish them.
5 July 1967 the broken bar was officially part of the ASCII
ISO-646.1973 and ASCII-1977 removed the stylized 0x23 and 0x24, and replace the broken bar with solid bar. But some vendors, especially IBM, kept the broken bar under code page 437.
ISO-8859-1 and ECMA-94 introduced the Latin.1 that included the broken bar again within the hight bit range as 0xA6. DOS has them in the code page 8050. This made a disagreement on where to put the broken bar and solid bar. There were even two different solid bars or two different broken bars.

6. Compression

6.1. AV1

Compression for videos
Smaller in size about 40%, but slower to encode and decode.

6.2. Lempel-Ziv Compression

It generates the temporary codebook as it compresses a file. The token expands from the beginning of a file until a new sequence that is not in the codebook is found, and the token is encoded using the codes stored in the codebook and stored in the archive file, the new sequence is recorded in the temporary codebook.

The uncompression works similarly. It looks for a new token and keep expanding using the temporary codebook that is generated on its own.