diff options
Diffstat (limited to 'docs/specs/CompactBinary.md')
| -rw-r--r-- | docs/specs/CompactBinary.md | 663 |
1 files changed, 663 insertions, 0 deletions
diff --git a/docs/specs/CompactBinary.md b/docs/specs/CompactBinary.md new file mode 100644 index 000000000..d8cccbd1e --- /dev/null +++ b/docs/specs/CompactBinary.md @@ -0,0 +1,663 @@ +# Compact Binary Format Specification + +**Version:** 1.0 + +## Overview + +Compact Binary (CB) is a binary serialization format designed for efficient storage and +transmission of structured data. It is self-describing, supports nested objects and arrays, +and optimizes for minimal size through variable-length encoding and uniform container +optimizations. + +Key design goals: + +- **Compact representation** — variable-length integers, type elision in uniform containers +- **Self-describing** — every field carries its own type; no external schema required +- **Safe to traverse** — a reader can skip any field without understanding its type +- **Deterministic** — a canonical encoding exists so that byte-identical payloads compare equal +- **Hashable** — every field and object can be content-addressed with a stable hash + +## 1. Notation + +| Symbol | Meaning | +|---------------|---------| +| `byte` | An unsigned 8-bit integer (octet). | +| `VarUInt(v)` | A variable-length unsigned integer encoding of value `v` (see §2). | +| `BE32(v)` | A 32-bit value stored in big-endian (network) byte order. | +| `BE64(v)` | A 64-bit value stored in big-endian (network) byte order. | +| `+` | Concatenation of byte sequences. | + +All multi-byte numeric values in this specification are stored in **big-endian** byte order +unless stated otherwise. + +--- + +## 2. Variable-Length Unsigned Integer (VarUInt) + +VarUInt encodes a 64-bit unsigned integer in 1–9 bytes. The number of leading 1-bits in the +first byte indicates how many *additional* bytes follow. The remaining bits of the first byte, +concatenated with the additional bytes in big-endian order, form the integer value. + +### 2.1 Encoding table + +| Leading 1-bits | Total bytes | First byte pattern | Value range | +|:-:|:-:|---|---| +| 0 | 1 | `0b0_______` | `0x00` – `0x7F` | +| 1 | 2 | `0b10______` | `0x80` – `0x3FFF` | +| 2 | 3 | `0b110_____` | `0x4000` – `0x1F_FFFF` | +| 3 | 4 | `0b1110____` | `0x20_0000` – `0x0FFF_FFFF` | +| 4 | 5 | `0b11110___` | `0x1000_0000` – `0x07_FFFF_FFFF` | +| 5 | 6 | `0b111110__` | `0x08_0000_0000` – `0x03FF_FFFF_FFFF` | +| 6 | 7 | `0b1111110_` | `0x0400_0000_0000` – `0x01_FFFF_FFFF_FFFF` | +| 7 | 8 | `0b11111110` | `0x02_0000_0000_0000` – `0xFF_FFFF_FFFF_FFFF` | +| 8 | 9 | `0b11111111` | `0x0100_0000_0000_0000` – `0xFFFF_FFFF_FFFF_FFFF` | + +### 2.2 Measuring the byte count from encoded data + +Count the number of leading 1-bits in the first byte (equivalently, count leading zeros of the +bitwise complement). The byte count is that number plus one: + +``` +ByteCount = CountLeadingOnes(FirstByte) + 1 +``` + +### 2.3 Reading + +1. Determine `ByteCount` from the first byte. +2. Mask the first byte: `Value = FirstByte & (0xFF >> ByteCount)`. +3. For each subsequent byte (in order): `Value = (Value << 8) | NextByte`. + +### 2.4 Writing + +1. Determine `ByteCount` from the value magnitude (see table above). +2. Store `ByteCount - 1` trailing bytes from the value in big-endian order. +3. Set the first byte to the remaining most-significant bits of the value, OR'd with a prefix + mask of `0xFF << (9 - ByteCount)`. + +### 2.5 Canonical form + +A VarUInt is canonical when it uses the minimum number of bytes required for its value. Format +validation (§9) rejects non-canonical VarUInt encodings. + +### 2.6 Byte-order preservation + +Encoded VarUInt values sort identically in a byte-wise (lexicographic) comparison as when their +decoded unsigned values are compared numerically. This property does **not** hold for signed +integers encoded via ZigZag. + +### 2.7 Encoding examples + +| Value | Encoded bytes | +|-------|---------------| +| `0x01` | `01` | +| `0x7F` | `7F` | +| `0x80` | `80 80` | +| `0x123` | `81 23` | +| `0x1234` | `92 34` | +| `0x12345` | `C1 23 45` | +| `0x123456` | `D2 34 56` | +| `0x1234567` | `E1 23 45 67` | +| `0x12345678` | `F0 12 34 56 78` | +| `0x123456789ABCDEF0` | `FF 12 34 56 78 9A BC DE F0` | + +--- + +## 3. Field Type + +Every field has a type, stored as a single byte. The low 6 bits identify the type; the upper +2 bits are flags. + +### 3.1 Type byte layout + +``` + Bit 7 Bit 6 Bits 5..0 +┌───────────┬───────────┬──────────────────┐ +│ HasFieldName │ HasFieldType │ Type ID (0x00–0x3F) │ +└───────────┴───────────┴──────────────────┘ +``` + +### 3.2 Flags + +| Flag | Value | Meaning | +|------|-------|---------| +| `HasFieldType` | `0x40` | **Transient.** The type byte is stored inline before the field payload. Set on fields in non-uniform containers. This flag is **not** persisted when hashing or serializing the type for comparison purposes. | +| `HasFieldName` | `0x80` | **Persisted.** The field has a name (used inside objects). | + +### 3.3 Type identifiers + +> **Stability notice:** Type values are fixed for backward compatibility and must never change. + +| ID | Name | Payload | +|----|------|---------| +| `0x00` | None | *(invalid — must not appear in valid data)* | +| `0x01` | Null | Empty (0 bytes) | +| `0x02` | Object | `VarUInt(Size)` + fields (see §5) | +| `0x03` | UniformObject | `VarUInt(Size)` + `FieldType(1)` + fields (see §5) | +| `0x04` | Array | `VarUInt(Size)` + `VarUInt(Count)` + fields (see §6) | +| `0x05` | UniformArray | `VarUInt(Size)` + `VarUInt(Count)` + `FieldType(1)` + fields (see §6) | +| `0x06` | Binary | `VarUInt(Size)` + raw bytes | +| `0x07` | String | `VarUInt(Size)` + UTF-8 bytes (no null terminator) | +| `0x08` | IntegerPositive | `VarUInt(Value)` — non-negative integer (0 to 2^64−1) | +| `0x09` | IntegerNegative | `VarUInt(~Value)` — negative integer (−1 to −2^63) | +| `0x0A` | Float32 | `BE32(IEEE 754 binary32)` — 4 bytes | +| `0x0B` | Float64 | `BE64(IEEE 754 binary64)` — 8 bytes | +| `0x0C` | BoolFalse | Empty (0 bytes) | +| `0x0D` | BoolTrue | Empty (0 bytes) | +| `0x0E` | ObjectAttachment | 20 raw bytes — hash of a compact binary attachment | +| `0x0F` | BinaryAttachment | 20 raw bytes — hash of a binary attachment | +| `0x10` | Hash | 20 raw bytes — hash digest | +| `0x11` | Uuid | 16 bytes — UUID (see §4.10) | +| `0x12` | DateTime | `BE64(int64 ticks)` — 8 bytes (see §4.11) | +| `0x13` | TimeSpan | `BE64(int64 ticks)` — 8 bytes (see §4.12) | +| `0x14` | ObjectId | 12 raw bytes — opaque identifier | +| `0x1E` | CustomById | `VarUInt(Size)` + `VarUInt(TypeId)` + payload (see §4.14) | +| `0x1F` | CustomByName | `VarUInt(Size)` + `VarUInt(NameLen)` + name + payload (see §4.14) | +| `0x20` | *(Reserved)* | Reserved for future flags. Do not define types in this range. | + +### 3.4 Type family classification + +Several types form families that can be recognized by bitmask tests on the type ID (low 6 bits): + +| Family | Mask | Base | Members | +|--------|------|------|---------| +| Object | `0x3E` | `0x02` | Object, UniformObject | +| Array | `0x3E` | `0x04` | Array, UniformArray | +| Integer | `0x3E` | `0x08` | IntegerPositive, IntegerNegative | +| Float | `0x3C` | `0x08` | Float32, Float64, IntegerPositive, IntegerNegative | +| Bool | `0x3E` | `0x0C` | BoolFalse, BoolTrue | +| Attachment | `0x3E` | `0x0E` | ObjectAttachment, BinaryAttachment | + +A type belongs to a family when `(TypeID & Mask) == Base`. + +Note that the Float family intentionally includes integer types because integers can be +implicitly converted to floating-point when reading. + +--- + +## 4. Field Types in Detail + +### 4.1 Null (`0x01`) + +Represents an absent or null value. Payload is empty. + +### 4.2 Binary (`0x06`) + +An arbitrary byte sequence. + +``` +VarUInt(ByteCount) + Bytes[ByteCount] +``` + +### 4.3 String (`0x07`) + +A UTF-8 encoded text string, **not** null-terminated. `Size` is the byte length, not the +character count. + +``` +VarUInt(ByteCount) + UTF8Bytes[ByteCount] +``` + +Canonical form requires valid UTF-8 (validated in Format mode, §9). + +### 4.4 IntegerPositive (`0x08`) + +A non-negative integer in the range [0, 2^64−1]. + +``` +VarUInt(Value) +``` + +### 4.5 IntegerNegative (`0x09`) + +A negative integer in the range [−2^63, −1]. The payload is the ones' complement of the +value encoded as a VarUInt: + +``` +VarUInt(~Value) +``` + +Where `~` is bitwise NOT. For example, −1 is encoded as `VarUInt(0)`, −42 is encoded as +`VarUInt(41)`. + +To decode: read the VarUInt magnitude `M`, then `Value = ~M` (equivalently, `Value = M ^ -1`, +or `Value = -(M + 1)`). + +> **Important:** This is ones' complement encoding, **not** ZigZag encoding. The VarInt +> functions in the codebase (which use ZigZag) are a separate encoding used elsewhere; Compact +> Binary integer fields use the type-tag approach with ones' complement for negatives. + +### 4.6 Float32 (`0x0A`) + +A 32-bit IEEE 754 binary32 floating-point value in big-endian byte order. + +``` +BE32(float32_bits) — 4 bytes +``` + +### 4.7 Float64 (`0x0B`) + +A 64-bit IEEE 754 binary64 floating-point value in big-endian byte order. + +``` +BE64(float64_bits) — 8 bytes +``` + +**Canonical form:** A Float64 value that can be represented exactly as Float32 (i.e., where +`(double)(float)value == value`) should be encoded as Float32 instead. Format validation +(§9) flags this as `InvalidFloat`. + +### 4.8 Bool (`0x0C` / `0x0D`) + +Boolean values are encoded purely by their type — there is no payload. + +- `BoolFalse` (`0x0C`): payload is empty. +- `BoolTrue` (`0x0D`): payload is empty. + +### 4.9 Hash (`0x10`), ObjectAttachment (`0x0E`), BinaryAttachment (`0x0F`) + +All three are 20 raw bytes representing a hash digest. There is no length prefix — the size is +fixed. + +``` +Bytes[20] +``` + +- **Hash** — a general-purpose hash digest. +- **ObjectAttachment** — a hash referencing an external Compact Binary object. +- **BinaryAttachment** — a hash referencing external raw binary data. + +The hash algorithm is determined by the application context (the format itself does not +prescribe a specific hash algorithm, though the reference implementation uses BLAKE3 truncated +to 160 bits). + +### 4.10 Uuid (`0x11`) + +A 128-bit UUID/GUID, stored as four 32-bit unsigned integers in big-endian byte order. + +``` +BE32(A) + BE32(B) + BE32(C) + BE32(D) — 16 bytes total +``` + +The four components (A, B, C, D) correspond to the four 32-bit segments of the UUID when +read as a sequence of big-endian 32-bit words. For an RFC 4122 UUID string +`"aabbccdd-eeff-0011-2233-445566778899"`: + +- `A = 0xAABBCCDD` +- `B = 0xEEFF0011` +- `C = 0x22334455` +- `D = 0x66778899` + +### 4.11 DateTime (`0x12`) + +A date and time value encoded as a big-endian signed 64-bit integer counting 100-nanosecond +ticks since the epoch **0001-01-01 00:00:00.0000000**. + +``` +BE64(int64 Ticks) — 8 bytes +``` + +Valid range: 0001-01-01 00:00:00.0000000 through 9999-12-31 23:59:59.9999999. + +Reference tick constants: + +| Unit | Ticks | +|------|-------| +| Microsecond | 10 | +| Millisecond | 10,000 | +| Second | 10,000,000 | +| Minute | 600,000,000 | +| Hour | 36,000,000,000 | +| Day | 864,000,000,000 | + +### 4.12 TimeSpan (`0x13`) + +A duration encoded as a big-endian signed 64-bit integer counting 100-nanosecond ticks. May be +negative. + +``` +BE64(int64 Ticks) — 8 bytes +``` + +Uses the same tick unit as DateTime (§4.11). + +### 4.13 ObjectId (`0x14`) + +A 12-byte opaque identifier. There is no length prefix — the size is fixed. + +``` +Bytes[12] +``` + +### 4.14 Custom types + +Custom types allow extending the format with application-specific types. + +**CustomById (`0x1E`):** + +``` +VarUInt(TotalSize) + VarUInt(TypeId) + Payload[TotalSize - sizeof(VarUInt(TypeId))] +``` + +`TotalSize` is the combined byte count of the encoded TypeId VarUInt and the Payload. + +**CustomByName (`0x1F`):** + +``` +VarUInt(TotalSize) + VarUInt(NameByteCount) + Name[NameByteCount] + Payload[remainder] +``` + +`TotalSize` is the combined byte count of the encoded name-length VarUInt, the name bytes, and +the payload. The name is UTF-8 encoded, not null-terminated. + +--- + +## 5. Objects + +An object is an unordered collection of uniquely named fields. There are two encoding forms: + +### 5.1 Non-uniform Object (`0x02`) + +Used when fields have different types (or when the object is empty). + +``` +VarUInt(PayloadSize) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` is the total byte count of all encoded fields (not including the `PayloadSize` +VarUInt itself or the container's own type byte). + +Each field is encoded as: + +``` +TypeByte + VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload +``` + +The `TypeByte` includes both `HasFieldType` (`0x40`) and `HasFieldName` (`0x80`) flags OR'd +with the type ID — i.e., the stored type byte is `TypeID | 0xC0`. + +### 5.2 Uniform Object (`0x03`) + +Used when every field has the same type. The shared type is stored once in the header and +omitted from individual fields. + +``` +VarUInt(PayloadSize) + FieldType(1 byte) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` includes the 1-byte field type and all field bytes. + +Each field is encoded as: + +``` +VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload +``` + +The individual fields do **not** include a type byte. They do retain the `HasFieldName` flag +behavior (names are present), but the type is provided by the container header. + +### 5.3 Empty Object + +An empty non-uniform object is 2 bytes: type byte + `VarUInt(0)`. + +``` +0x02 0x00 +``` + +(A uniform object cannot be empty because there is no type to store.) + +### 5.4 Object field constraints + +- Field names must be non-empty. +- Field names must be unique within the object (case-sensitive comparison). +- Field names are UTF-8 encoded, not null-terminated. +- Field ordering is not prescribed by the format but is significant for equality comparison — + two objects with the same fields in a different order are byte-wise different. + +--- + +## 6. Arrays + +An array is an ordered collection of unnamed fields. There are two encoding forms: + +### 6.1 Non-uniform Array (`0x04`) + +Used when items have different types. + +``` +VarUInt(PayloadSize) + VarUInt(ItemCount) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` is the total byte count of `VarUInt(ItemCount)` plus all encoded fields. + +Each field is encoded as: + +``` +TypeByte + FieldPayload +``` + +The `TypeByte` includes the `HasFieldType` flag (`0x40`) OR'd with the type ID. Fields in +arrays do **not** have names. + +### 6.2 Uniform Array (`0x05`) + +Used when every item has the same type **and** every item has a non-zero-byte encoding. + +``` +VarUInt(PayloadSize) + VarUInt(ItemCount) + FieldType(1 byte) + Field₁ + … + Fieldₙ +``` + +`PayloadSize` includes `VarUInt(ItemCount)`, the 1-byte field type, and all field bytes. + +Each field is encoded as just its payload — no type byte, no name. + +### 6.3 Empty Array + +An empty non-uniform array: + +``` +0x04 0x01 0x00 +``` + +That is: type `0x04` + `VarUInt(1)` (payload size = 1 byte for the count) + `VarUInt(0)` +(item count = 0). + +### 6.4 Uniform array constraints + +A uniform array **must not** be used when items have zero-byte payloads (e.g., all Null or all +Bool fields). Because such items encode as zero bytes each, they would be indistinguishable, +and the container would have no way to address individual items. Use a non-uniform array in +these cases. + +--- + +## 7. Top-Level Fields + +A Compact Binary payload at the top level is typically a single field. This field may or may +not include its type byte, depending on context: + +- **With type:** The field starts with its type byte (with `HasFieldType` flag set). This is + the self-describing form used when the consumer does not know the type in advance. +- **Without type:** The type is communicated out of band (e.g., the consumer knows to expect an + Object). The field begins directly with its payload. + +A top-level object field (the most common case) is encoded as: + +``` +TypeByte(0x02) + ObjectPayload +``` + +or without the type byte, just: + +``` +ObjectPayload +``` + +--- + +## 8. Packages + +A package bundles a Compact Binary object with its external attachments (referenced via +ObjectAttachment and BinaryAttachment fields). It is serialized as a sequence of unnamed +top-level fields: + +### 8.1 Package structure + +``` +[Attachment₁] [Attachment₂] … [Object] [ObjectHash] [Attachment₃] … [Null] +``` + +- **Object** — An `Object` field containing the root compact binary data. +- **ObjectHash** — An `ObjectAttachment` field (`0x0E`) containing the 20-byte hash of the + serialized root object. Omitted when the object is empty. +- **Attachments** — Each attachment is a pair of fields: + 1. A `Binary` field containing the attachment data. + 2. A `BinaryAttachment` or `ObjectAttachment` field containing the hash of that data. + The hash field is omitted when the binary data is empty. +- **Null** — A `Null` field (`0x01`) terminates the package. + +### 8.2 Ordering + +The canonical order is: + +1. Root object + its hash +2. Attachments ordered by hash +3. Null terminator + +However, it is valid for components to appear in any order as long as: +- There is at most one root object. +- The Null terminator is last. + +### 8.3 Package constraints + +- At most one root object per package. +- No duplicate attachments (by hash). +- No null/empty attachments. +- Attachment hashes must match their data. + +--- + +## 9. Validation + +Implementations should support the following validation modes: + +| Mode | Checks | +|------|--------| +| **Default** | All fields are within bounds and have recognized types. Minimum required for safe reading. | +| **Names** | Object fields have unique, non-empty names. Array fields have no names. | +| **Format** | Canonical encoding: minimal VarUInt sizes, Float64→Float32 demotion, uniform containers used when possible, valid UTF-8 in names and strings. | +| **Padding** | No trailing bytes after the top-level field. | +| **Package** | Package/attachment structure is well-formed. | +| **PackageHash** | Stored hashes match computed hashes. | + +--- + +## 10. Hashing + +The canonical hash of a field is computed over: + +1. The **serialized type byte** (type ID | `HasFieldName` flag; the `HasFieldType` flag is + stripped). +2. The **name** (if present): `VarUInt(NameByteCount) + NameBytes`. +3. The **payload**. + +This allows deterministic content-addressing of any field, object, or array. + +--- + +## 11. Complete Encoding Examples + +### 11.1 Simple object + +An object with fields `"name": "Alice"` (String) and `"age": 30` (IntegerPositive): + +``` +02 -- Object type + 17 -- VarUInt PayloadSize = 23 bytes + C7 -- String | HasFieldType | HasFieldName + 04 -- VarUInt NameLen = 4 + 6E 61 6D 65 -- "name" + 05 -- VarUInt StringLen = 5 + 41 6C 69 63 65 -- "Alice" + C8 -- IntegerPositive | HasFieldType | HasFieldName + 03 -- VarUInt NameLen = 3 + 61 67 65 -- "age" + 1E -- VarUInt Value = 30 +``` + +Total: 25 bytes. + +### 11.2 Uniform array of integers + +An array of three positive integers `[1, 2, 3]`: + +``` +05 -- UniformArray type + 06 -- VarUInt PayloadSize = 6 + 03 -- VarUInt ItemCount = 3 + 08 -- FieldType = IntegerPositive + 01 -- VarUInt 1 + 02 -- VarUInt 2 + 03 -- VarUInt 3 +``` + +Total: 8 bytes. + +### 11.3 Negative integer + +The value −42 encoded as a standalone field: + +``` +09 -- IntegerNegative type +29 -- VarUInt(~(-42)) = VarUInt(41) = 0x29 +``` + +### 11.4 Nested object + +An object containing a nested object: + +``` +02 -- Outer Object type + 0E -- VarUInt PayloadSize = 14 + C2 -- Object | HasFieldType | HasFieldName + 05 -- VarUInt NameLen = 5 + 69 6E 6E 65 72 -- "inner" + 05 -- VarUInt inner PayloadSize = 5 + C8 -- IntegerPositive | HasFieldType | HasFieldName + 01 -- VarUInt NameLen = 1 + 78 -- "x" + 0A -- VarUInt Value = 10 +``` + +--- + +## 12. Summary of Fixed-Size Payloads + +| Type | Payload size | +|------|-------------| +| Null | 0 | +| BoolFalse | 0 | +| BoolTrue | 0 | +| Float32 | 4 | +| Float64 | 8 | +| Hash | 20 | +| ObjectAttachment | 20 | +| BinaryAttachment | 20 | +| Uuid | 16 | +| DateTime | 8 | +| TimeSpan | 8 | +| ObjectId | 12 | + +## 13. Summary of Variable-Size Payloads + +| Type | Payload structure | +|------|-------------------| +| Binary | `VarUInt(Size) + Bytes` | +| String | `VarUInt(Size) + UTF8Bytes` | +| IntegerPositive | `VarUInt(Value)` | +| IntegerNegative | `VarUInt(~Value)` | +| Object | `VarUInt(Size) + Fields` | +| UniformObject | `VarUInt(Size) + Type + Fields` | +| Array | `VarUInt(Size) + VarUInt(Count) + Fields` | +| UniformArray | `VarUInt(Size) + VarUInt(Count) + Type + Fields` | +| CustomById | `VarUInt(Size) + VarUInt(TypeId) + Data` | +| CustomByName | `VarUInt(Size) + VarUInt(NameLen) + Name + Data` | |