tags : System Design, Systems, Codec
The description are itself vague so don’t get too pedantic and this is for personal ref.
FAQ
Why Little Endian useful?
- It helps CPU increase length of numbers easily.
44 33 22 11 (little end, int32) 44 33 22 11 00 00 00 00 (little end, int64)
- You can’t do this shit w big endian, it’ll change the number.
- This is mostly useful at the compiler/cpu level but not so much useful at the protocol level.
What’s the deal with variable-length encoding?
- Main idea: small numbers should take up less space than big numbers.
- There are different schemes for doing so. Golang’s
binary
package does the same what Protocol Buffers does.
More resources
Encoding
Serialization is a specific instance of encoding. How serialization relates to the term “encoding” is slightly vague.
Binary to Text (Eg. Base64)
Encoding of binary data in a sequence of printable characters.
- Binary-to-text encoding - Wikipedia
- Binary to text encoding — state of the art and missed opportunities | Lobsters
- Base32 Encoding Explained | ptrchm
- GitHub - qntm/base2048: Binary encoding optimised for Twitter
Text to Binary
This is what computers do anyway.
Fun experiments
- keith-turner/ecoji
- https://goshify.tny.im/
- https://github.com/alcor/itty-bitty
- Writing an MP4 Muxer for Fun and Profit | Hacker News
Data serialization
Data serialization refers to the process of translating data structures or object state(from memory) into a different format capable of being stored (in-memory or file), or transmitted and reconstructed at a different point.
Use-cases
- Transferring Data through the Wires
- Creating file formats
- Network Transmission
- RPC via. Interface Definition Language (IDL)
- Other uses
Formats
How to
- Designing File Formats
- Binary formats and protocols: LTV is better than TLV | Lobsters
- Visual Programming with Elixir: Learning to Write Binary Parsers
- Zip – How not to design a file format (2021)
- carlmjohnson/lich: A port of Wolf Rentzsch’s Lich binary file format
- binary_io : Not a data serialization library but a library to help write binary formats
- bincode
- File formats can be implemented using data serilization formats. Eg. Apache Parquet is implemented using the Apache Thrift framework.
Serialization and De-serialization
When you use serialization you also expect de-serialization. These are not part of the serialization format. These are separate things.
Serialization/Writing/Generation
- Eg. Construct(python only) does both parsing and serialization.
Boost.Serialization
,cereal
,bitsery
,msgpack
also do serilization and deserialization but they have their own binary format.
Deserialization/Reading/Parsing
- Eg. Kaitai just does parsing.
Serialization formats
These can be binary or plaintext. They can have different tradeoffs like, fixed width integers, varints, length-prefixed strings, circular references, weak references, suppress repeated values, efficient encoding, special-cased encodings, headers or lack of it, having to define schemas on advance etc. etc.
Binary formats
- BSON, MessagePack, Protocol Buffers, Thrift, Parquet(column based), Avro(row based), …, https://cbor.io/
Text formats
.txt
,markdown
,yaml
,xml
Protocols
How to
- custom binary protocol library implementation
- Visual Programming with Elixir: Learning to Write Binary Parsers
- Bare Metal Programming Series 7.1 - YouTube]]
- When we design a
protocol
, we need to design aprotocol handler
- Is not about how binary blobs are encoded. (They’re always binary for networking)
- Is about whether the protocol is oriented around
data structures
or aroundtext strings
. Is it supposed to be readable by humans or by machines is the question to ask. Eg. HTTP is a text protocol, even though when it sends a jpeg(binary) image, it just sends the raw bytes, not a text encoding of it.
Binary protocols
- Will always be more space efficient than text protocols.
- Examples: RTP, TCP, IP, TLS, SSH, MQTT.
Text protocols
- Examples: SMTP, HTTP, SIP.
Binary vs Text protocols
Not strict def. but general idea.
- How a computer parses JSON:
- You’re kind of like advancing one rune at a time
- And kind of maintaining some look back, looking for a bunch of object delimiters
- Keeping state for how deeply nested in this object you are etc.
- complicated, stateful process.
- How a computer parses Binary data:
- It’ll say, hey, the next field coming up is a string, and it’s 70 bytes long.
- Then the parser just like grabs the next 70 bytes
- Interprets them as a string in memory and is done.
Some Theory
- What are the prospects of process calculi and the actor model? : compsci
- A Theory of Composing Protocols (2023) | Hacker News
- Process calculus - Wikipedia
RPC
- Data Serialization Formats
- Protocols
- RPC sort of combines both.
- Not really, depends on implementation but you get the idea.
- Some data of some format is getting transferred via some protocol
gRPC
- GRPC(2016) is a RPC framework
- PB/Protocol Buffers(2001) is a data serialization format.
- See grpc, protocol buffers and friends
What about GRPC?
- A wrapper on top of an 2 (atm) server that communicates using PB
- Offers features such as, Streaming, Cancellation, Circuit Breaking, load balancing, tracing, metric collection, header propagation,authorization, IDL etc.
REST vs GRPC
REST is not at all an RPC framework, it is an architectural style for constructing web services. So in an way, it’s an apples to oranges comparison. But there are usecases where you might want to use GRPC over REST and vice versa. Here’s a table from the internets.
- See REpresentational State Transfer
- With a PRC system peculiarities of serialization (like, say, JSON’s lack of 64-bit numbers) are a non-issue
- REST can be implemented without HTTP, a home-grown binary substitute can be use and you can still be restful.
- You can deploy a RESTful service over ordinary email exchange for instance.
- But using HTTP has benefits, such as you’ll have HTTPs caching infrastructure at your disposal.
- A detailed comparison of REST and gRPC | Hacker News