Encoding

Table of Contents

1. Number Encoding

1.1. Integer

  • Base two representation for the positive integers.
    • 0001 0101 = 21₁₀
  • For negative numbers various methods are used.
    • Sign-Magnitude
      • Put 1 in the most significant digit if negative.
    • One's Complement
      • Find complement with respect to 1111 1111.
    • Two's Complement
      • Find complement with respect to (1)0000 0000
      • 1111 1111 = -1₁₀
      • 1110 1011 = -21₁₀
    • Offset-Binary
      • Offset the zero to \( K \), then called excess-\( K \)
      • e.g. excess-256
    • Base -2
      • Each digits represents \( (-2)^{-n} \).

1.2. BCD

  • Binary-Coded Decimal
  • It represents a number by encoding each digits of the decimal numbers in sequence.
  • Binary integer to Binary-coded decimal happens when a number is displayed on a seven segmented display.
  • A single digit is encoded with 4 bits.
  • Gray Code

    • The display may flicker if two many bits flip, therefore it is

    desirable that every pair of consecutive numbers differ only in a single bit.

2. Character Encoding

2.1. BCDIC and EBCDIC

  • BCD, Alphanumeric BCD
  • (Extended) Binary-Coded Decimal Interchange Code.
  • BCDIC introduced in 1928 by IBM for use in IBM card, and EBCDIC in 1963.
  • BCD is 6 bit and EBCDIC is 8 bit.
  • BCD is agnostic about the case of a letter.
  • No standard for BCD and loose one for EBCDIC.

2.1.1.

  • Square Lozenge #+BEGINCOMMENT diamond, rhombus #+ENDCOMMENT
  • 0x3C in BCDIC
  • Calculator symbol for 'subtotal'

2.2. ASCII

  • American Standard Code for Information Interchange
  • 7 bit encoding, published in 1963.

    • There are many versions of it, including international one and

    recommendations.

  • 0x00 - 0x1F control characters
    • 00^@ 03C 04D 07G 09I 0AJ 11Q 1AZ 1B^[
  • 0x20 - 0x40 Symbols

20␣ | 21! | 22” | 23# | 24$ | 25% | 26& | 27' | 28( | 29) | 2A* | 2B+ | 2C, | 2D- | 2E. | 2F/ | | 300 | | | | | | | | | | 3A: | 3B; | 3C< |3D= | 3E> | 3F ? | | | 1 | ' | 3 | 4 | 5 | 7 | ' | 9 | 0 | 8 | = | , | - | . | / | | | | | | | | | | | | ; | ; | , | = | . | / |

  • 40@
  • 0x41 - 0x5A Uppercase Latin
  • 0x5B - 0x60 Symbols
    • B[ C\ D] E^ F_
    • 0`
  • 0x61 - 0x7A Lowercase Latin
  • 0x7B - 0x7E Symbols
    • B{ C| D} E~
  • 0x7F DEL
    • ^?

3. Binary Encoding

3.1. Base64

  • Take three bytes and convert it into four alphanumeric characters ([A-Za-z0-9+/=]).
  • Useful for ASCII only transmission such as SMTP.

4. Control Characters

  • Non-Printable Character, NPC

4.1. ASCII Control Character

  • C0 Control Code
  • It has been transferred to Unicode at the same codepoint. They are in the category of Cc.

4.1.1. Representations

  • Codepoint 0x1B
  • Abbreviation BEL
  • Symbols ␊
    • around U+2400
  • Graphics ⌁, around U+2370
  • Caret Notation ^[
    • ASCII-based keyboards used to generate the control character by pressing Ctrl key with other characters. It generates uppercase letter and symbol code minus 0x40, except for ? which is plus 0x40.
  • Escape Sequence
    • \e
    • \033 (Octal code)
      • It can used to represent any byte.
    • \x1B.

4.1.2. Input

  • Usually shift is avoided: <C-/> for ^?=(DEL), =<C-2> for ^@=(NUL). =<C-SPC> for ^@ is also common.
    • Not in Emacs.
  • In Emacs, <C-q> allows the user to write the control characters. <C-v> also works.
    • <C-q> and the corresponding key combination, <C-[>, or the key itself, <ESC>, or the octal code, 033, can be used.
    • Note
      • <S> is unnecessary for uppercase, but it is required for symbols.
      • <C-SPC> for ^@
  • Terminal Emulators takes the keyboard inputs and sends the corresponding control characters to the program.
    • They are mostly sensible.
    • <C-BACKSPACE> to ^H (BS)
    • <C-/> to ^_ (US)
  • Shell interprets the escape characters \e, \t, \n,… when the string is prefixed with $?
    • when using echo \t and \n are interpreted by the echo.

4.1.3. Characters

  • 0x00 NUL ^@ \0
  • 0x03 ETX ^C
    • End of Text
  • 0x04 EOT ^D
    • End of Transmission
  • 0x09 HT ^I \t
  • 0x0A LF ^J \n
  • 0x0D CR ^M
  • 0x11 DC1 ^Q
    • Device Control
  • 0x1A SUB ^Z
    • Substitute, ␚, ␦
    • Used in place of unrecognizable characters.
    • The replacement character � U+FFFD is recommended by Unicode.
    • SIGTSTP in Linux
  • 0x1B ESC ^[ \e

4.2. ANSI Escape Sequence

  • ANSI Escape Code, C1 Control Code
  • ^[ + 0x40-0x5F, 0x80 - 0x9F
  • One byte code is not widely used.
  • They are interpreted by the program.
    • The terminal emulator takes the keyboard input and change it into ANSI escape code and sends it to the programs.

4.2.1. Control Sequence Introducer

  • CSI
  • ^[[, \e[, \033[
    • The ESC and the letter [
    • It is followed by
      • Parameter bytes 0x30-0x3F (0-9:;<=>?)
      • Intermidiate bytes 0x20-0x2F (!“#$%&'()*+,-./)
      • Final Byte 0x40-0x7E (?? (????)–Z[\]^_`a–z{|}~)
  • Cursor control (A, B, C, D, …), Erase (J, K), Scroll (K, S)
4.2.1.1. Select Graphic Rendition
  • m
  • PARAMETERS
    • 0 normal, 1 bold, 3 italic, …
    • 30 - 37 set foreground color, 40 - 47 set background color (3-bit and 4-bit)
    • 38 and 48 is used to access more colors
      • ;n (8-bit) or 2;r;g;b (24-bit, "true color") is followed.

5. Unicode

5.1. Code Point

  • Unicode defines the code points for each grapheme. A grapheme may corresponds to multiple code points, such as é corresponds to e (U+0065) + ´ (U+00B4).
  • It follows ASCII for the first few.

5.2. Encoding

  • Unlike other encodings such as EUC-KR, CP-949, EUC-JP, etc. that maps graphemes to bytes, Unicode encoding schemes maps code points to bytes. So there's another layer of abstraction.

5.2.1. UFT-8

5.2.1.1. Encoding
  • 0 + ASCII 110 + {5} , 10 + {6} 1110 + {4}, 10 + {6}, 10 + {6} … 1111110 + {1}, 10 + {6}, …, 10 + {6}
  • Moving back and forth by a character is easy this way.

5.2.2. UTF-16

  • Variable length encoding with the length varies by the units of two bytes.

5.2.3. UTF-32

  • Fixed length encoding of four bytes.

5.2.4. Byte Order Mark

BOM

U+FEFF zero width no-break space is used as a magic number at the start of a text stream.

It allows the editor to figure out the encoding and byte order(endianness)

5.3. Catalog

  • 0300-036F: Latin Combining
  • 0483-0489: Cyrillic Combining
    • U+0489 COMBINING CYRILLIC MILLIONS SIGN: ҉
  • '​' ZERO WIDTH SPACE, ' ' NO BREAK SPACE, ' ' THIN SPACE, ' ' HAIR SPACE, ' ' FIGURE SPACE, ' ' PUNCTUATION SPACE
  • '‍' ZERO WIDTH JOINER, '⁠' WORD JOINER
    • It is used right beside space to represent that the space is part of a word.

5.4. History

5.4.1. |

  • These Keys Shouldn't Exist | Nostalgia Nerd - YouTube
  • In the early days of punchcards there was no standard. The character encoding of different bitsize had also emerged. Some of them were chosen to be the encoding for the information exchange.
  • 12 May, 1966 The ASCII standard was established.
  • But the IBM user group (PL/1 programmer) proposed 0x23 and 0x24 to be the | and ¬, thence it became available to stylize it as such. Moreover the original pipe symbol 0x7C was broken to distinguish them.
  • 5 July 1967 the broken bar was officially part of the ASCII
  • ISO-646.1973 and ASCII-1977 removed the stylized 0x23 and 0x24, and replace the broken bar with solid bar. But some vendors, especially IBM, kept the broken bar under code page 437.
  • ISO-8859-1 and ECMA-94 introduced the Latin.1 that included the broken bar again within the hight bit range as 0xA6. DOS has them in the code page 8050. This made a disagreement on where to put the broken bar and solid bar. There were even two different solid bars or two different broken bars.

6. Compression

6.1. AV1

  • Compression for videos
  • Smaller in size about 40%, but slower to encode and decode.

6.2. Lempel-Ziv Compression

It generates the temporary codebook as it compresses a file. The token expands from the beginning of a file until a new sequence that is not in the codebook is found, and the token is encoded using the codes stored in the codebook and stored in the archive file, the new sequence is recorded in the temporary codebook.

The uncompression works similarly. It looks for a new token and keep expanding using the temporary codebook that is generated on its own.

7. Reference

Created: 2025-05-06 Tue 23:25