tags : System Design, Systems, Codec
The description are itself vague so don’t get too pedantic and this is for personal ref.
FAQ
Why Little Endian useful?
- It helps CPU increase length of numbers easily.
44 33 22 11 (little end, int32) 44 33 22 11 00 00 00 00 (little end, int64)
- You can’t do this shit w big endian, it’ll change the number.
- This is mostly useful at the compiler/cpu level but not so much useful at the protocol level.
What’s the deal with variable-length encoding?
- Main idea: small numbers should take up less space than big numbers.
- There are different schemes for doing so. Golang’s
binary
package does the same what Protocol Buffers does.
More resources
Relationship between encoding
, serialization
, formats
and protocols
In essence: You encode data elements, arrange them according to a format to represent a structure, and exchange messages containing this formatted data according to the rules of a protocol.
1. Encoding
- What it is: The most fundamental layer. Encoding is the process of transforming data from one representation to another. It’s about how individual pieces of data (like numbers, characters, or raw bytes) are represented.
- Purpose: Can be for efficiency (
variable-length integers
), compatibility (binary-to-text likeBase64
), or fundamental representation (how a CPU stores numbers likeLittle Endian
vs.Big Endian
, how text characters are represented as bytes likeUTF-8
). - Examples from the text:
- Little Endian vs. Big Endian: How the bytes of a multi-byte number are ordered in memory. Little Endian helps CPUs extend numbers easily (e.g.,
11 22 33 44
becomes11 22 33 44 00 00 00 00
). - Variable-length encoding: Schemes where smaller numbers use fewer bytes than larger numbers (used in
Protocol Buffers
, Go’sbinary
package). - Binary-to-text encoding (e.g.,
Base64
): Representing arbitrary binary data using only printable characters, often for transmission through systems designed for text. - Text-to-binary encoding: How text characters (like in a string) are represented as sequences of bytes (e.g., ASCII, UTF-8). This is fundamental computer operation.
- Little Endian vs. Big Endian: How the bytes of a multi-byte number are ordered in memory. Little Endian helps CPUs extend numbers easily (e.g.,
- Key Idea: Transformation of representation, often at a low level or for specific constraints.
2. Format (Data Serialization Format)
- What it is: A defined structure or set of rules for organizing and arranging data (which has been encoded) into a coherent message or file. It specifies how different data elements (numbers, strings, lists, objects/structs) are laid out sequentially.
- Purpose: To take complex data structures from application memory and turn them into a sequence of bytes (or characters) suitable for storage (files) or transmission (networks), and allow them to be reconstructed later (deserialization).
- Examples from the text:
- Binary Formats:
BSON
,MessagePack
,Protocol Buffers
(Protobuf),Thrift
,Avro
,Parquet
,CBOR
. These use compact binary encodings for data types and structure. - Text Formats:
JSON
(mentioned implicitly via BSON comparison),XML
,YAML
,Markdown
,.txt
. These use human-readable characters to represent data and structure.
- Binary Formats:
- Relationship to Encoding: A format uses specific encodings for its constituent parts (e.g.,
Protobuf
uses variable-length encoding for integers, UTF-8 for strings, and specific byte markers for structure). The format defines the overall grammar of the serialized data. - Key Idea: Structure and rules for laying out serialized data.
3. Protocol
- What it is: A set of rules and conventions governing the communication and interaction between systems. It defines the sequence of messages, their meaning, expected actions, error handling, and how a conversation proceeds.
- Purpose: To enable different systems (or components) to exchange information and coordinate actions reliably and predictably over a network or other communication channel.
- Examples from the text:
- Text Protocols:
HTTP
,SMTP
,SIP
. These primarily use text-based commands and structures for control messages, even if they transfer binary payloads. They are often designed to be somewhat human-readable. - Binary Protocols:
TCP
,IP
,TLS
,SSH
,MQTT
,RTP
. These use binary structures for their messages for efficiency and compactness. - gRPC: Described as an RPC framework that acts like a protocol. It defines how remote procedures are called, including aspects like streaming, cancellation, and uses
HTTP/2
(an underlying protocol) for transport andProtocol Buffers
(a format) for data.
- Text Protocols:
- Relationship to Format: A protocol often uses a specific data format to structure the payload (the actual data being sent) within its messages. For instance,
HTTP
(protocol) often carries payloads formatted asHTML
,JSON
, orXML
(formats).gRPC
(protocol/framework) mandatesProtocol Buffers
(format). - Distinction from Format: The protocol is about the entire interaction – the handshake, request/response sequence, headers, status codes, error handling, session management – whereas the format is just about the structure of the data payload itself. The text notes a protocol isn’t primarily about how binary blobs are encoded, but whether the interaction logic is text-oriented or binary-oriented.
- Key Idea: Rules of engagement for communication between systems.
Mapping it Out
- Foundation: Encoding is the most basic concept – how individual data items are represented.
- Structure: Formats build on encodings to define how structured data (like objects or records) is laid out sequentially. Serialization is the process of encoding data structures into a specific format.
- Interaction: Protocols define the rules for exchanging messages between systems. These messages often contain data structured according to a specific format.
Overlaps
- Serialization: As the text states, “Serialization is a specific instance of encoding.” It specifically refers to encoding data structures into a format for storage/transmission. So, it sits between the general concept of encoding and the resulting format.
- RPC Frameworks (like
gRPC
): These often blend protocol and format concepts.gRPC
is a protocol in the sense that it defines rules for remote calls, streaming, etc., but it relies on another protocol (HTTP/2
) for transport and mandates a specific format (Protocol Buffers
) for the data. - Binary vs. Text Protocols: The distinction isn’t always absolute. A “text protocol” like
HTTP
can carry raw binary data (like an image) in its body. The “text” part refers more to the control messages (headers, methods, status lines) being human-readable strings, whereas binary protocols use byte fields for these. Parsing text protocols often involves state machines looking for delimiters, while binary protocols often read fixed-size fields or length-prefixed data. - File Formats: These often use data serialization formats as their building blocks (e.g., Apache
Parquet
usesThrift
structures). The file format standard might add extra layers like metadata, magic numbers, indexing structures, etc., beyond what the basic serialization format defines.
Encoding
Serialization is a specific instance of encoding. How serialization relates to the term “encoding” is slightly vague.
Binary to Text (Eg. Base64)
Encoding of binary data in a sequence of printable characters.
- Binary-to-text encoding - Wikipedia
- Binary to text encoding — state of the art and missed opportunities | Lobsters
- Base32 Encoding Explained | ptrchm
- GitHub - qntm/base2048: Binary encoding optimised for Twitter
Text to Binary
This is what computers do anyway.
Fun experiments
- keith-turner/ecoji
- https://goshify.tny.im/
- https://github.com/alcor/itty-bitty
- Writing an MP4 Muxer for Fun and Profit | Hacker News
Data serialization
Data serialization refers to the process of translating data structures or object state(from memory) into a different format capable of being stored (in-memory or file), or transmitted and reconstructed at a different point.
Use-cases
- Transferring Data through the Wires
- Creating file formats
- Network Transmission
- RPC via. Interface Definition Language (IDL)
- Other uses
Formats
Tools
- File formats can be implemented using data serilization formats. Eg. Apache Parquet is implemented using the Apache Thrift framework.
Serialization and De-serialization
When you use serialization you also expect de-serialization. These are not part of the serialization format. These are separate things.
Serialization/Writing/Generation
- Eg. Construct(python only) does both parsing and serialization.
Boost.Serialization
,cereal
,bitsery
,msgpack
also do serilization and deserialization but they have their own binary format.
Deserialization/Reading/Parsing
- Eg. Kaitai just does parsing.
Serialization formats
These can be binary or plaintext. They can have different tradeoffs like, fixed width integers, varints, length-prefixed strings, circular references, weak references, suppress repeated values, efficient encoding, special-cased encodings, headers or lack of it, having to define schemas on advance etc. etc.
Binary formats
- BSON, MessagePack, Protocol Buffers, Thrift, Parquet(column based), Avro(row based), …, https://cbor.io/
Text formats
.txt
,markdown
,yaml
,xml
Protocols
- When we design a
protocol
, we need to design aprotocol handler
- Is not about how binary blobs are encoded. (They’re always binary for networking)
- Is about whether the protocol is oriented around
data structures
or aroundtext strings
. Is it supposed to be readable by humans or by machines is the question to ask. Eg. HTTP is a text protocol, even though when it sends a jpeg(binary) image, it just sends the raw bytes, not a text encoding of it.
Binary protocols
- Will always be more space efficient than text protocols.
- Examples: RTP, TCP, IP, TLS, SSH, MQTT.
- RITP: Reliable Immutable Transfer Protocol — binarycat
Text protocols
- Examples: SMTP, HTTP, SIP.
Binary vs Text protocols
Not strict def. but general idea.
- How a computer parses JSON:
- You’re kind of like advancing one rune at a time
- And kind of maintaining some look back, looking for a bunch of object delimiters
- Keeping state for how deeply nested in this object you are etc.
- complicated, stateful process.
- How a computer parses Binary data:
- It’ll say, hey, the next field coming up is a string, and it’s 70 bytes long.
- Then the parser just like grabs the next 70 bytes
- Interprets them as a string in memory and is done.
Some Theory
- What are the prospects of process calculi and the actor model? : compsci
- A Theory of Composing Protocols (2023) | Hacker News
- Process calculus - Wikipedia
RPC
- Data Serialization Formats
- Protocols
- RPC sort of combines both.
- Not really, depends on implementation but you get the idea.
- Some data of some format is getting transferred via some protocol
gRPC
- GRPC(2016) is a RPC framework
- PB/Protocol Buffers(2001) is a data serialization format.
- See grpc, protocol buffers and friends
What about GRPC?
- A wrapper on top of an 2 (atm) server that communicates using PB
- Offers features such as, Streaming, Cancellation, Circuit Breaking, load balancing, tracing, metric collection, header propagation,authorization, IDL etc.
REST vs GRPC
REST is not at all an RPC framework, it is an architectural style for constructing web services. So in an way, it’s an apples to oranges comparison. But there are usecases where you might want to use GRPC over REST and vice versa. Here’s a table from the internets.
- See REpresentational State Transfer
- With a PRC system peculiarities of serialization (like, say, JSON’s lack of 64-bit numbers) are a non-issue
- REST can be implemented without HTTP, a home-grown binary substitute can be use and you can still be restful.
- You can deploy a RESTful service over ordinary email exchange for instance.
- But using HTTP has benefits, such as you’ll have HTTPs caching infrastructure at your disposal.
- A detailed comparison of REST and gRPC | Hacker News