# Compact Binary Format Specification **Version:** 1.0 ## Overview Compact Binary (CB) is a binary serialization format designed for efficient storage and transmission of structured data. It is self-describing, supports nested objects and arrays, and optimizes for minimal size through variable-length encoding and uniform container optimizations. Key design goals: - **Compact representation** — variable-length integers, type elision in uniform containers - **Self-describing** — every field carries its own type; no external schema required - **Safe to traverse** — a reader can skip any field without understanding its type - **Deterministic** — a canonical encoding exists so that byte-identical payloads compare equal - **Hashable** — every field and object can be content-addressed with a stable hash ## 1. Notation | Symbol | Meaning | |---------------|---------| | `byte` | An unsigned 8-bit integer (octet). | | `VarUInt(v)` | A variable-length unsigned integer encoding of value `v` (see §2). | | `BE32(v)` | A 32-bit value stored in big-endian (network) byte order. | | `BE64(v)` | A 64-bit value stored in big-endian (network) byte order. | | `+` | Concatenation of byte sequences. | All multi-byte numeric values in this specification are stored in **big-endian** byte order unless stated otherwise. --- ## 2. Variable-Length Unsigned Integer (VarUInt) VarUInt encodes a 64-bit unsigned integer in 1–9 bytes. The number of leading 1-bits in the first byte indicates how many *additional* bytes follow. The remaining bits of the first byte, concatenated with the additional bytes in big-endian order, form the integer value. ### 2.1 Encoding table | Leading 1-bits | Total bytes | First byte pattern | Value range | |:-:|:-:|---|---| | 0 | 1 | `0b0_______` | `0x00` – `0x7F` | | 1 | 2 | `0b10______` | `0x80` – `0x3FFF` | | 2 | 3 | `0b110_____` | `0x4000` – `0x1F_FFFF` | | 3 | 4 | `0b1110____` | `0x20_0000` – `0x0FFF_FFFF` | | 4 | 5 | `0b11110___` | `0x1000_0000` – `0x07_FFFF_FFFF` | | 5 | 6 | `0b111110__` | `0x08_0000_0000` – `0x03FF_FFFF_FFFF` | | 6 | 7 | `0b1111110_` | `0x0400_0000_0000` – `0x01_FFFF_FFFF_FFFF` | | 7 | 8 | `0b11111110` | `0x02_0000_0000_0000` – `0xFF_FFFF_FFFF_FFFF` | | 8 | 9 | `0b11111111` | `0x0100_0000_0000_0000` – `0xFFFF_FFFF_FFFF_FFFF` | ### 2.2 Measuring the byte count from encoded data Count the number of leading 1-bits in the first byte (equivalently, count leading zeros of the bitwise complement). The byte count is that number plus one: ``` ByteCount = CountLeadingOnes(FirstByte) + 1 ``` ### 2.3 Reading 1. Determine `ByteCount` from the first byte. 2. Mask the first byte: `Value = FirstByte & (0xFF >> ByteCount)`. 3. For each subsequent byte (in order): `Value = (Value << 8) | NextByte`. ### 2.4 Writing 1. Determine `ByteCount` from the value magnitude (see table above). 2. Store `ByteCount - 1` trailing bytes from the value in big-endian order. 3. Set the first byte to the remaining most-significant bits of the value, OR'd with a prefix mask of `0xFF << (9 - ByteCount)`. ### 2.5 Canonical form A VarUInt is canonical when it uses the minimum number of bytes required for its value. Format validation (§9) rejects non-canonical VarUInt encodings. ### 2.6 Byte-order preservation Encoded VarUInt values sort identically in a byte-wise (lexicographic) comparison as when their decoded unsigned values are compared numerically. This property does **not** hold for signed integers encoded via ZigZag. ### 2.7 Encoding examples | Value | Encoded bytes | |-------|---------------| | `0x01` | `01` | | `0x7F` | `7F` | | `0x80` | `80 80` | | `0x123` | `81 23` | | `0x1234` | `92 34` | | `0x12345` | `C1 23 45` | | `0x123456` | `D2 34 56` | | `0x1234567` | `E1 23 45 67` | | `0x12345678` | `F0 12 34 56 78` | | `0x123456789ABCDEF0` | `FF 12 34 56 78 9A BC DE F0` | --- ## 3. Field Type Every field has a type, stored as a single byte. The low 6 bits identify the type; the upper 2 bits are flags. ### 3.1 Type byte layout ``` Bit 7 Bit 6 Bits 5..0 ┌───────────┬───────────┬──────────────────┐ │ HasFieldName │ HasFieldType │ Type ID (0x00–0x3F) │ └───────────┴───────────┴──────────────────┘ ``` ### 3.2 Flags | Flag | Value | Meaning | |------|-------|---------| | `HasFieldType` | `0x40` | **Transient.** The type byte is stored inline before the field payload. Set on fields in non-uniform containers. This flag is **not** persisted when hashing or serializing the type for comparison purposes. | | `HasFieldName` | `0x80` | **Persisted.** The field has a name (used inside objects). | ### 3.3 Type identifiers > **Stability notice:** Type values are fixed for backward compatibility and must never change. | ID | Name | Payload | |----|------|---------| | `0x00` | None | *(invalid — must not appear in valid data)* | | `0x01` | Null | Empty (0 bytes) | | `0x02` | Object | `VarUInt(Size)` + fields (see §5) | | `0x03` | UniformObject | `VarUInt(Size)` + `FieldType(1)` + fields (see §5) | | `0x04` | Array | `VarUInt(Size)` + `VarUInt(Count)` + fields (see §6) | | `0x05` | UniformArray | `VarUInt(Size)` + `VarUInt(Count)` + `FieldType(1)` + fields (see §6) | | `0x06` | Binary | `VarUInt(Size)` + raw bytes | | `0x07` | String | `VarUInt(Size)` + UTF-8 bytes (no null terminator) | | `0x08` | IntegerPositive | `VarUInt(Value)` — non-negative integer (0 to 2^64−1) | | `0x09` | IntegerNegative | `VarUInt(~Value)` — negative integer (−1 to −2^63) | | `0x0A` | Float32 | `BE32(IEEE 754 binary32)` — 4 bytes | | `0x0B` | Float64 | `BE64(IEEE 754 binary64)` — 8 bytes | | `0x0C` | BoolFalse | Empty (0 bytes) | | `0x0D` | BoolTrue | Empty (0 bytes) | | `0x0E` | ObjectAttachment | 20 raw bytes — hash of a compact binary attachment | | `0x0F` | BinaryAttachment | 20 raw bytes — hash of a binary attachment | | `0x10` | Hash | 20 raw bytes — hash digest | | `0x11` | Uuid | 16 bytes — UUID (see §4.10) | | `0x12` | DateTime | `BE64(int64 ticks)` — 8 bytes (see §4.11) | | `0x13` | TimeSpan | `BE64(int64 ticks)` — 8 bytes (see §4.12) | | `0x14` | ObjectId | 12 raw bytes — opaque identifier | | `0x1E` | CustomById | `VarUInt(Size)` + `VarUInt(TypeId)` + payload (see §4.14) | | `0x1F` | CustomByName | `VarUInt(Size)` + `VarUInt(NameLen)` + name + payload (see §4.14) | | `0x20` | *(Reserved)* | Reserved for future flags. Do not define types in this range. | ### 3.4 Type family classification Several types form families that can be recognized by bitmask tests on the type ID (low 6 bits): | Family | Mask | Base | Members | |--------|------|------|---------| | Object | `0x3E` | `0x02` | Object, UniformObject | | Array | `0x3E` | `0x04` | Array, UniformArray | | Integer | `0x3E` | `0x08` | IntegerPositive, IntegerNegative | | Float | `0x3C` | `0x08` | Float32, Float64, IntegerPositive, IntegerNegative | | Bool | `0x3E` | `0x0C` | BoolFalse, BoolTrue | | Attachment | `0x3E` | `0x0E` | ObjectAttachment, BinaryAttachment | A type belongs to a family when `(TypeID & Mask) == Base`. Note that the Float family intentionally includes integer types because integers can be implicitly converted to floating-point when reading. --- ## 4. Field Types in Detail ### 4.1 Null (`0x01`) Represents an absent or null value. Payload is empty. ### 4.2 Binary (`0x06`) An arbitrary byte sequence. ``` VarUInt(ByteCount) + Bytes[ByteCount] ``` ### 4.3 String (`0x07`) A UTF-8 encoded text string, **not** null-terminated. `Size` is the byte length, not the character count. ``` VarUInt(ByteCount) + UTF8Bytes[ByteCount] ``` Canonical form requires valid UTF-8 (validated in Format mode, §9). ### 4.4 IntegerPositive (`0x08`) A non-negative integer in the range [0, 2^64−1]. ``` VarUInt(Value) ``` ### 4.5 IntegerNegative (`0x09`) A negative integer in the range [−2^63, −1]. The payload is the ones' complement of the value encoded as a VarUInt: ``` VarUInt(~Value) ``` Where `~` is bitwise NOT. For example, −1 is encoded as `VarUInt(0)`, −42 is encoded as `VarUInt(41)`. To decode: read the VarUInt magnitude `M`, then `Value = ~M` (equivalently, `Value = M ^ -1`, or `Value = -(M + 1)`). > **Important:** This is ones' complement encoding, **not** ZigZag encoding. The VarInt > functions in the codebase (which use ZigZag) are a separate encoding used elsewhere; Compact > Binary integer fields use the type-tag approach with ones' complement for negatives. ### 4.6 Float32 (`0x0A`) A 32-bit IEEE 754 binary32 floating-point value in big-endian byte order. ``` BE32(float32_bits) — 4 bytes ``` ### 4.7 Float64 (`0x0B`) A 64-bit IEEE 754 binary64 floating-point value in big-endian byte order. ``` BE64(float64_bits) — 8 bytes ``` **Canonical form:** A Float64 value that can be represented exactly as Float32 (i.e., where `(double)(float)value == value`) should be encoded as Float32 instead. Format validation (§9) flags this as `InvalidFloat`. ### 4.8 Bool (`0x0C` / `0x0D`) Boolean values are encoded purely by their type — there is no payload. - `BoolFalse` (`0x0C`): payload is empty. - `BoolTrue` (`0x0D`): payload is empty. ### 4.9 Hash (`0x10`), ObjectAttachment (`0x0E`), BinaryAttachment (`0x0F`) All three are 20 raw bytes representing a hash digest. There is no length prefix — the size is fixed. ``` Bytes[20] ``` - **Hash** — a general-purpose hash digest. - **ObjectAttachment** — a hash referencing an external Compact Binary object. - **BinaryAttachment** — a hash referencing external raw binary data. The hash algorithm is determined by the application context (the format itself does not prescribe a specific hash algorithm, though the reference implementation uses BLAKE3 truncated to 160 bits). ### 4.10 Uuid (`0x11`) A 128-bit UUID/GUID, stored as four 32-bit unsigned integers in big-endian byte order. ``` BE32(A) + BE32(B) + BE32(C) + BE32(D) — 16 bytes total ``` The four components (A, B, C, D) correspond to the four 32-bit segments of the UUID when read as a sequence of big-endian 32-bit words. For an RFC 4122 UUID string `"aabbccdd-eeff-0011-2233-445566778899"`: - `A = 0xAABBCCDD` - `B = 0xEEFF0011` - `C = 0x22334455` - `D = 0x66778899` ### 4.11 DateTime (`0x12`) A date and time value encoded as a big-endian signed 64-bit integer counting 100-nanosecond ticks since the epoch **0001-01-01 00:00:00.0000000**. ``` BE64(int64 Ticks) — 8 bytes ``` Valid range: 0001-01-01 00:00:00.0000000 through 9999-12-31 23:59:59.9999999. Reference tick constants: | Unit | Ticks | |------|-------| | Microsecond | 10 | | Millisecond | 10,000 | | Second | 10,000,000 | | Minute | 600,000,000 | | Hour | 36,000,000,000 | | Day | 864,000,000,000 | ### 4.12 TimeSpan (`0x13`) A duration encoded as a big-endian signed 64-bit integer counting 100-nanosecond ticks. May be negative. ``` BE64(int64 Ticks) — 8 bytes ``` Uses the same tick unit as DateTime (§4.11). ### 4.13 ObjectId (`0x14`) A 12-byte opaque identifier. There is no length prefix — the size is fixed. ``` Bytes[12] ``` ### 4.14 Custom types Custom types allow extending the format with application-specific types. **CustomById (`0x1E`):** ``` VarUInt(TotalSize) + VarUInt(TypeId) + Payload[TotalSize - sizeof(VarUInt(TypeId))] ``` `TotalSize` is the combined byte count of the encoded TypeId VarUInt and the Payload. **CustomByName (`0x1F`):** ``` VarUInt(TotalSize) + VarUInt(NameByteCount) + Name[NameByteCount] + Payload[remainder] ``` `TotalSize` is the combined byte count of the encoded name-length VarUInt, the name bytes, and the payload. The name is UTF-8 encoded, not null-terminated. --- ## 5. Objects An object is an unordered collection of uniquely named fields. There are two encoding forms: ### 5.1 Non-uniform Object (`0x02`) Used when fields have different types (or when the object is empty). ``` VarUInt(PayloadSize) + Field₁ + Field₂ + … + Fieldₙ ``` `PayloadSize` is the total byte count of all encoded fields (not including the `PayloadSize` VarUInt itself or the container's own type byte). Each field is encoded as: ``` TypeByte + VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload ``` The `TypeByte` includes both `HasFieldType` (`0x40`) and `HasFieldName` (`0x80`) flags OR'd with the type ID — i.e., the stored type byte is `TypeID | 0xC0`. ### 5.2 Uniform Object (`0x03`) Used when every field has the same type. The shared type is stored once in the header and omitted from individual fields. ``` VarUInt(PayloadSize) + FieldType(1 byte) + Field₁ + Field₂ + … + Fieldₙ ``` `PayloadSize` includes the 1-byte field type and all field bytes. Each field is encoded as: ``` VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload ``` The individual fields do **not** include a type byte. They do retain the `HasFieldName` flag behavior (names are present), but the type is provided by the container header. ### 5.3 Empty Object An empty non-uniform object is 2 bytes: type byte + `VarUInt(0)`. ``` 0x02 0x00 ``` (A uniform object cannot be empty because there is no type to store.) ### 5.4 Object field constraints - Field names must be non-empty. - Field names must be unique within the object (case-sensitive comparison). - Field names are UTF-8 encoded, not null-terminated. - Field ordering is not prescribed by the format but is significant for equality comparison — two objects with the same fields in a different order are byte-wise different. --- ## 6. Arrays An array is an ordered collection of unnamed fields. There are two encoding forms: ### 6.1 Non-uniform Array (`0x04`) Used when items have different types. ``` VarUInt(PayloadSize) + VarUInt(ItemCount) + Field₁ + Field₂ + … + Fieldₙ ``` `PayloadSize` is the total byte count of `VarUInt(ItemCount)` plus all encoded fields. Each field is encoded as: ``` TypeByte + FieldPayload ``` The `TypeByte` includes the `HasFieldType` flag (`0x40`) OR'd with the type ID. Fields in arrays do **not** have names. ### 6.2 Uniform Array (`0x05`) Used when every item has the same type **and** every item has a non-zero-byte encoding. ``` VarUInt(PayloadSize) + VarUInt(ItemCount) + FieldType(1 byte) + Field₁ + … + Fieldₙ ``` `PayloadSize` includes `VarUInt(ItemCount)`, the 1-byte field type, and all field bytes. Each field is encoded as just its payload — no type byte, no name. ### 6.3 Empty Array An empty non-uniform array: ``` 0x04 0x01 0x00 ``` That is: type `0x04` + `VarUInt(1)` (payload size = 1 byte for the count) + `VarUInt(0)` (item count = 0). ### 6.4 Uniform array constraints A uniform array **must not** be used when items have zero-byte payloads (e.g., all Null or all Bool fields). Because such items encode as zero bytes each, they would be indistinguishable, and the container would have no way to address individual items. Use a non-uniform array in these cases. --- ## 7. Top-Level Fields A Compact Binary payload at the top level is typically a single field. This field may or may not include its type byte, depending on context: - **With type:** The field starts with its type byte (with `HasFieldType` flag set). This is the self-describing form used when the consumer does not know the type in advance. - **Without type:** The type is communicated out of band (e.g., the consumer knows to expect an Object). The field begins directly with its payload. A top-level object field (the most common case) is encoded as: ``` TypeByte(0x02) + ObjectPayload ``` or without the type byte, just: ``` ObjectPayload ``` --- ## 8. Packages A package bundles a Compact Binary object with its external attachments (referenced via ObjectAttachment and BinaryAttachment fields). It is serialized as a sequence of unnamed top-level fields: ### 8.1 Package structure ``` [Attachment₁] [Attachment₂] … [Object] [ObjectHash] [Attachment₃] … [Null] ``` - **Object** — An `Object` field containing the root compact binary data. - **ObjectHash** — An `ObjectAttachment` field (`0x0E`) containing the 20-byte hash of the serialized root object. Omitted when the object is empty. - **Attachments** — Each attachment is a pair of fields: 1. A `Binary` field containing the attachment data. 2. A `BinaryAttachment` or `ObjectAttachment` field containing the hash of that data. The hash field is omitted when the binary data is empty. - **Null** — A `Null` field (`0x01`) terminates the package. ### 8.2 Ordering The canonical order is: 1. Root object + its hash 2. Attachments ordered by hash 3. Null terminator However, it is valid for components to appear in any order as long as: - There is at most one root object. - The Null terminator is last. ### 8.3 Package constraints - At most one root object per package. - No duplicate attachments (by hash). - No null/empty attachments. - Attachment hashes must match their data. --- ## 9. Validation Implementations should support the following validation modes: | Mode | Checks | |------|--------| | **Default** | All fields are within bounds and have recognized types. Minimum required for safe reading. | | **Names** | Object fields have unique, non-empty names. Array fields have no names. | | **Format** | Canonical encoding: minimal VarUInt sizes, Float64→Float32 demotion, uniform containers used when possible, valid UTF-8 in names and strings. | | **Padding** | No trailing bytes after the top-level field. | | **Package** | Package/attachment structure is well-formed. | | **PackageHash** | Stored hashes match computed hashes. | --- ## 10. Hashing The canonical hash of a field is computed over: 1. The **serialized type byte** (type ID | `HasFieldName` flag; the `HasFieldType` flag is stripped). 2. The **name** (if present): `VarUInt(NameByteCount) + NameBytes`. 3. The **payload**. This allows deterministic content-addressing of any field, object, or array. --- ## 11. Complete Encoding Examples ### 11.1 Simple object An object with fields `"name": "Alice"` (String) and `"age": 30` (IntegerPositive): ``` 02 -- Object type 17 -- VarUInt PayloadSize = 23 bytes C7 -- String | HasFieldType | HasFieldName 04 -- VarUInt NameLen = 4 6E 61 6D 65 -- "name" 05 -- VarUInt StringLen = 5 41 6C 69 63 65 -- "Alice" C8 -- IntegerPositive | HasFieldType | HasFieldName 03 -- VarUInt NameLen = 3 61 67 65 -- "age" 1E -- VarUInt Value = 30 ``` Total: 25 bytes. ### 11.2 Uniform array of integers An array of three positive integers `[1, 2, 3]`: ``` 05 -- UniformArray type 06 -- VarUInt PayloadSize = 6 03 -- VarUInt ItemCount = 3 08 -- FieldType = IntegerPositive 01 -- VarUInt 1 02 -- VarUInt 2 03 -- VarUInt 3 ``` Total: 8 bytes. ### 11.3 Negative integer The value −42 encoded as a standalone field: ``` 09 -- IntegerNegative type 29 -- VarUInt(~(-42)) = VarUInt(41) = 0x29 ``` ### 11.4 Nested object An object containing a nested object: ``` 02 -- Outer Object type 0E -- VarUInt PayloadSize = 14 C2 -- Object | HasFieldType | HasFieldName 05 -- VarUInt NameLen = 5 69 6E 6E 65 72 -- "inner" 05 -- VarUInt inner PayloadSize = 5 C8 -- IntegerPositive | HasFieldType | HasFieldName 01 -- VarUInt NameLen = 1 78 -- "x" 0A -- VarUInt Value = 10 ``` --- ## 12. Summary of Fixed-Size Payloads | Type | Payload size | |------|-------------| | Null | 0 | | BoolFalse | 0 | | BoolTrue | 0 | | Float32 | 4 | | Float64 | 8 | | Hash | 20 | | ObjectAttachment | 20 | | BinaryAttachment | 20 | | Uuid | 16 | | DateTime | 8 | | TimeSpan | 8 | | ObjectId | 12 | ## 13. Summary of Variable-Size Payloads | Type | Payload structure | |------|-------------------| | Binary | `VarUInt(Size) + Bytes` | | String | `VarUInt(Size) + UTF8Bytes` | | IntegerPositive | `VarUInt(Value)` | | IntegerNegative | `VarUInt(~Value)` | | Object | `VarUInt(Size) + Fields` | | UniformObject | `VarUInt(Size) + Type + Fields` | | Array | `VarUInt(Size) + VarUInt(Count) + Fields` | | UniformArray | `VarUInt(Size) + VarUInt(Count) + Type + Fields` | | CustomById | `VarUInt(Size) + VarUInt(TypeId) + Data` | | CustomByName | `VarUInt(Size) + VarUInt(NameLen) + Name + Data` |