diff options
| author | Stefan Boberg <[email protected]> | 2026-03-18 11:19:10 +0100 |
|---|---|---|
| committer | GitHub Enterprise <[email protected]> | 2026-03-18 11:19:10 +0100 |
| commit | eba410c4168e23d7908827eb34b7cf0c58a5dc48 (patch) | |
| tree | 3cda8e8f3f81941d3bb5b84a8155350c5bb2068c /docs | |
| parent | bugfix release - v5.7.23 (#851) (diff) | |
| download | zen-eba410c4168e23d7908827eb34b7cf0c58a5dc48.tar.xz zen-eba410c4168e23d7908827eb34b7cf0c58a5dc48.zip | |
Compute batching (#849)
### Compute Batch Submission
- Consolidate duplicated action submission logic in `httpcomputeservice` into a single `HandleSubmitAction` supporting both single-action and batch (actions array) payloads
- Group actions by queue in `RemoteHttpRunner` and submit as batches with configurable chunk size, falling back to individual submission on failure
- Extract shared helpers: `MakeErrorResult`, `ValidateQueueForEnqueue`, `ActivateActionInQueue`, `RemoveActionFromActiveMaps`
### Retracted Action State
- Add `Retracted` state to `RunnerAction` for retry-free rescheduling — an explicit request to pull an action back and reschedule it on a different runner without incrementing `RetryCount`
- Implement idempotent `RetractAction()` on `RunnerAction` and `ComputeServiceSession`
- Add `POST jobs/{lsn}/retract` and `queues/{queueref}/jobs/{lsn}/retract` HTTP endpoints
- Add state machine documentation and per-state comments to `RunnerAction`
### Compute Race Fixes
- Fix race in `HandleActionUpdates` where actions enqueued between session abandon and scheduler tick were never abandoned, causing `GetActionResult` to return 202 indefinitely
- Fix queue `ActiveCount` race where `NotifyQueueActionComplete` was called after releasing `m_ResultsLock`, allowing callers to observe stale counters immediately after `GetActionResult` returned OK
### Logging Optimization and ANSI improvements
- Improve `AnsiColorStdoutSink` write efficiency — single write call, dirty-flag flush, `RwLock` instead of `std::mutex`
- Move ANSI color emission from sink into formatters via `Formatter::SetColorEnabled()`; remove `ColorRangeStart`/`End` from `LogMessage`
- Extract color helpers (`AnsiColorForLevel`, `StripAnsiSgrSequences`) into `helpers.h`
- Strip upstream ANSI SGR escapes in non-color output mode. This enables colour in log messages without polluting log files with ANSI control sequences
- Move `RotatingFileSink`, `JsonFormatter`, and `FullFormatter` from header-only to pimpl with `.cpp` files
### CLI / Exec Refactoring
- Extract `ExecSessionRunner` class from ~920-line `ExecUsingSession` into focused methods and a `ExecSessionConfig` struct
- Replace monolithic `ExecCommand` with subcommand-based architecture (`http`, `inproc`, `beacon`, `dump`, `buildlog`)
- Allow parent options to appear after subcommand name by parsing subcommand args permissively and forwarding unmatched tokens to the parent parser
### Testing Improvements
- Fix `--test-suite` filter being ignored due to accumulation with default wildcard filter
- Add test suite banners to test listener output
- Made `function.session.abandon_pending` test more robust
### Startup / Reliability Fixes
- Fix silent exit when a second zenserver instance detects a port conflict — use `ZEN_CONSOLE_*` for log calls that precede `InitializeLogging()`
- Fix two potential SIGSEGV paths during early startup: guard `sentry_options_new()` returning nullptr, and throw on `ZenServerState::Register()` returning nullptr instead of dereferencing
- Fail on unrecognized zenserver `--mode` instead of silently defaulting to store
### Other
- Show host details (hostname, platform, CPU count, memory) when discovering new compute workers
- Move frontend `html.zip` from source tree into build directory
- Add format specifications for Compact Binary and Compressed Buffer wire formats
- Add `WriteCompactBinaryObject` to zencore
- Extended `ConsoleTui` with additional functionality
- Add `--vscode` option to `xmake sln` for clangd / `compile_commands.json` support
- Disable compute/horde/nomad in release builds (not yet production-ready)
- Disable unintended `ASIO_HAS_IO_URING` enablement
- Fix crashpad patch missing leading whitespace
- Clean up code triggering gcc false positives
Diffstat (limited to 'docs')
| -rw-r--r-- | docs/specs/CompactBinary.md | 663 | ||||
| -rw-r--r-- | docs/specs/CompressedBuffer.md | 185 |
2 files changed, 848 insertions, 0 deletions
diff --git a/docs/specs/CompactBinary.md b/docs/specs/CompactBinary.md new file mode 100644 index 000000000..d8cccbd1e --- /dev/null +++ b/docs/specs/CompactBinary.md @@ -0,0 +1,663 @@ +# Compact Binary Format Specification + +**Version:** 1.0 + +## Overview + +Compact Binary (CB) is a binary serialization format designed for efficient storage and +transmission of structured data. It is self-describing, supports nested objects and arrays, +and optimizes for minimal size through variable-length encoding and uniform container +optimizations. + +Key design goals: + +- **Compact representation** — variable-length integers, type elision in uniform containers +- **Self-describing** — every field carries its own type; no external schema required +- **Safe to traverse** — a reader can skip any field without understanding its type +- **Deterministic** — a canonical encoding exists so that byte-identical payloads compare equal +- **Hashable** — every field and object can be content-addressed with a stable hash + +## 1. Notation + +| Symbol | Meaning | +|---------------|---------| +| `byte` | An unsigned 8-bit integer (octet). | +| `VarUInt(v)` | A variable-length unsigned integer encoding of value `v` (see §2). | +| `BE32(v)` | A 32-bit value stored in big-endian (network) byte order. | +| `BE64(v)` | A 64-bit value stored in big-endian (network) byte order. | +| `+` | Concatenation of byte sequences. | + +All multi-byte numeric values in this specification are stored in **big-endian** byte order +unless stated otherwise. + +--- + +## 2. Variable-Length Unsigned Integer (VarUInt) + +VarUInt encodes a 64-bit unsigned integer in 1–9 bytes. The number of leading 1-bits in the +first byte indicates how many *additional* bytes follow. The remaining bits of the first byte, +concatenated with the additional bytes in big-endian order, form the integer value. + +### 2.1 Encoding table + +| Leading 1-bits | Total bytes | First byte pattern | Value range | +|:-:|:-:|---|---| +| 0 | 1 | `0b0_______` | `0x00` – `0x7F` | +| 1 | 2 | `0b10______` | `0x80` – `0x3FFF` | +| 2 | 3 | `0b110_____` | `0x4000` – `0x1F_FFFF` | +| 3 | 4 | `0b1110____` | `0x20_0000` – `0x0FFF_FFFF` | +| 4 | 5 | `0b11110___` | `0x1000_0000` – `0x07_FFFF_FFFF` | +| 5 | 6 | `0b111110__` | `0x08_0000_0000` – `0x03FF_FFFF_FFFF` | +| 6 | 7 | `0b1111110_` | `0x0400_0000_0000` – `0x01_FFFF_FFFF_FFFF` | +| 7 | 8 | `0b11111110` | `0x02_0000_0000_0000` – `0xFF_FFFF_FFFF_FFFF` | +| 8 | 9 | `0b11111111` | `0x0100_0000_0000_0000` – `0xFFFF_FFFF_FFFF_FFFF` | + +### 2.2 Measuring the byte count from encoded data + +Count the number of leading 1-bits in the first byte (equivalently, count leading zeros of the +bitwise complement). The byte count is that number plus one: + +``` +ByteCount = CountLeadingOnes(FirstByte) + 1 +``` + +### 2.3 Reading + +1. Determine `ByteCount` from the first byte. +2. Mask the first byte: `Value = FirstByte & (0xFF >> ByteCount)`. +3. For each subsequent byte (in order): `Value = (Value << 8) | NextByte`. + +### 2.4 Writing + +1. Determine `ByteCount` from the value magnitude (see table above). +2. Store `ByteCount - 1` trailing bytes from the value in big-endian order. +3. Set the first byte to the remaining most-significant bits of the value, OR'd with a prefix + mask of `0xFF << (9 - ByteCount)`. + +### 2.5 Canonical form + +A VarUInt is canonical when it uses the minimum number of bytes required for its value. Format +validation (§9) rejects non-canonical VarUInt encodings. + +### 2.6 Byte-order preservation + +Encoded VarUInt values sort identically in a byte-wise (lexicographic) comparison as when their +decoded unsigned values are compared numerically. This property does **not** hold for signed +integers encoded via ZigZag. + +### 2.7 Encoding examples + +| Value | Encoded bytes | +|-------|---------------| +| `0x01` | `01` | +| `0x7F` | `7F` | +| `0x80` | `80 80` | +| `0x123` | `81 23` | +| `0x1234` | `92 34` | +| `0x12345` | `C1 23 45` | +| `0x123456` | `D2 34 56` | +| `0x1234567` | `E1 23 45 67` | +| `0x12345678` | `F0 12 34 56 78` | +| `0x123456789ABCDEF0` | `FF 12 34 56 78 9A BC DE F0` | + +--- + +## 3. Field Type + +Every field has a type, stored as a single byte. The low 6 bits identify the type; the upper +2 bits are flags. + +### 3.1 Type byte layout + +``` + Bit 7 Bit 6 Bits 5..0 +┌───────────┬───────────┬──────────────────┐ +│ HasFieldName │ HasFieldType │ Type ID (0x00–0x3F) │ +└───────────┴───────────┴──────────────────┘ +``` + +### 3.2 Flags + +| Flag | Value | Meaning | +|------|-------|---------| +| `HasFieldType` | `0x40` | **Transient.** The type byte is stored inline before the field payload. Set on fields in non-uniform containers. This flag is **not** persisted when hashing or serializing the type for comparison purposes. | +| `HasFieldName` | `0x80` | **Persisted.** The field has a name (used inside objects). | + +### 3.3 Type identifiers + +> **Stability notice:** Type values are fixed for backward compatibility and must never change. + +| ID | Name | Payload | +|----|------|---------| +| `0x00` | None | *(invalid — must not appear in valid data)* | +| `0x01` | Null | Empty (0 bytes) | +| `0x02` | Object | `VarUInt(Size)` + fields (see §5) | +| `0x03` | UniformObject | `VarUInt(Size)` + `FieldType(1)` + fields (see §5) | +| `0x04` | Array | `VarUInt(Size)` + `VarUInt(Count)` + fields (see §6) | +| `0x05` | UniformArray | `VarUInt(Size)` + `VarUInt(Count)` + `FieldType(1)` + fields (see §6) | +| `0x06` | Binary | `VarUInt(Size)` + raw bytes | +| `0x07` | String | `VarUInt(Size)` + UTF-8 bytes (no null terminator) | +| `0x08` | IntegerPositive | `VarUInt(Value)` — non-negative integer (0 to 2^64−1) | +| `0x09` | IntegerNegative | `VarUInt(~Value)` — negative integer (−1 to −2^63) | +| `0x0A` | Float32 | `BE32(IEEE 754 binary32)` — 4 bytes | +| `0x0B` | Float64 | `BE64(IEEE 754 binary64)` — 8 bytes | +| `0x0C` | BoolFalse | Empty (0 bytes) | +| `0x0D` | BoolTrue | Empty (0 bytes) | +| `0x0E` | ObjectAttachment | 20 raw bytes — hash of a compact binary attachment | +| `0x0F` | BinaryAttachment | 20 raw bytes — hash of a binary attachment | +| `0x10` | Hash | 20 raw bytes — hash digest | +| `0x11` | Uuid | 16 bytes — UUID (see §4.10) | +| `0x12` | DateTime | `BE64(int64 ticks)` — 8 bytes (see §4.11) | +| `0x13` | TimeSpan | `BE64(int64 ticks)` — 8 bytes (see §4.12) | +| `0x14` | ObjectId | 12 raw bytes — opaque identifier | +| `0x1E` | CustomById | `VarUInt(Size)` + `VarUInt(TypeId)` + payload (see §4.14) | +| `0x1F` | CustomByName | `VarUInt(Size)` + `VarUInt(NameLen)` + name + payload (see §4.14) | +| `0x20` | *(Reserved)* | Reserved for future flags. Do not define types in this range. | + +### 3.4 Type family classification + +Several types form families that can be recognized by bitmask tests on the type ID (low 6 bits): + +| Family | Mask | Base | Members | +|--------|------|------|---------| +| Object | `0x3E` | `0x02` | Object, UniformObject | +| Array | `0x3E` | `0x04` | Array, UniformArray | +| Integer | `0x3E` | `0x08` | IntegerPositive, IntegerNegative | +| Float | `0x3C` | `0x08` | Float32, Float64, IntegerPositive, IntegerNegative | +| Bool | `0x3E` | `0x0C` | BoolFalse, BoolTrue | +| Attachment | `0x3E` | `0x0E` | ObjectAttachment, BinaryAttachment | + +A type belongs to a family when `(TypeID & Mask) == Base`. + +Note that the Float family intentionally includes integer types because integers can be +implicitly converted to floating-point when reading. + +--- + +## 4. Field Types in Detail + +### 4.1 Null (`0x01`) + +Represents an absent or null value. Payload is empty. + +### 4.2 Binary (`0x06`) + +An arbitrary byte sequence. + +``` +VarUInt(ByteCount) + Bytes[ByteCount] +``` + +### 4.3 String (`0x07`) + +A UTF-8 encoded text string, **not** null-terminated. `Size` is the byte length, not the +character count. + +``` +VarUInt(ByteCount) + UTF8Bytes[ByteCount] +``` + +Canonical form requires valid UTF-8 (validated in Format mode, §9). + +### 4.4 IntegerPositive (`0x08`) + +A non-negative integer in the range [0, 2^64−1]. + +``` +VarUInt(Value) +``` + +### 4.5 IntegerNegative (`0x09`) + +A negative integer in the range [−2^63, −1]. The payload is the ones' complement of the +value encoded as a VarUInt: + +``` +VarUInt(~Value) +``` + +Where `~` is bitwise NOT. For example, −1 is encoded as `VarUInt(0)`, −42 is encoded as +`VarUInt(41)`. + +To decode: read the VarUInt magnitude `M`, then `Value = ~M` (equivalently, `Value = M ^ -1`, +or `Value = -(M + 1)`). + +> **Important:** This is ones' complement encoding, **not** ZigZag encoding. The VarInt +> functions in the codebase (which use ZigZag) are a separate encoding used elsewhere; Compact +> Binary integer fields use the type-tag approach with ones' complement for negatives. + +### 4.6 Float32 (`0x0A`) + +A 32-bit IEEE 754 binary32 floating-point value in big-endian byte order. + +``` +BE32(float32_bits) — 4 bytes +``` + +### 4.7 Float64 (`0x0B`) + +A 64-bit IEEE 754 binary64 floating-point value in big-endian byte order. + +``` +BE64(float64_bits) — 8 bytes +``` + +**Canonical form:** A Float64 value that can be represented exactly as Float32 (i.e., where +`(double)(float)value == value`) should be encoded as Float32 instead. Format validation +(§9) flags this as `InvalidFloat`. + +### 4.8 Bool (`0x0C` / `0x0D`) + +Boolean values are encoded purely by their type — there is no payload. + +- `BoolFalse` (`0x0C`): payload is empty. +- `BoolTrue` (`0x0D`): payload is empty. + +### 4.9 Hash (`0x10`), ObjectAttachment (`0x0E`), BinaryAttachment (`0x0F`) + +All three are 20 raw bytes representing a hash digest. There is no length prefix — the size is +fixed. + +``` +Bytes[20] +``` + +- **Hash** — a general-purpose hash digest. +- **ObjectAttachment** — a hash referencing an external Compact Binary object. +- **BinaryAttachment** — a hash referencing external raw binary data. + +The hash algorithm is determined by the application context (the format itself does not +prescribe a specific hash algorithm, though the reference implementation uses BLAKE3 truncated +to 160 bits). + +### 4.10 Uuid (`0x11`) + +A 128-bit UUID/GUID, stored as four 32-bit unsigned integers in big-endian byte order. + +``` +BE32(A) + BE32(B) + BE32(C) + BE32(D) — 16 bytes total +``` + +The four components (A, B, C, D) correspond to the four 32-bit segments of the UUID when +read as a sequence of big-endian 32-bit words. For an RFC 4122 UUID string +`"aabbccdd-eeff-0011-2233-445566778899"`: + +- `A = 0xAABBCCDD` +- `B = 0xEEFF0011` +- `C = 0x22334455` +- `D = 0x66778899` + +### 4.11 DateTime (`0x12`) + +A date and time value encoded as a big-endian signed 64-bit integer counting 100-nanosecond +ticks since the epoch **0001-01-01 00:00:00.0000000**. + +``` +BE64(int64 Ticks) — 8 bytes +``` + +Valid range: 0001-01-01 00:00:00.0000000 through 9999-12-31 23:59:59.9999999. + +Reference tick constants: + +| Unit | Ticks | +|------|-------| +| Microsecond | 10 | +| Millisecond | 10,000 | +| Second | 10,000,000 | +| Minute | 600,000,000 | +| Hour | 36,000,000,000 | +| Day | 864,000,000,000 | + +### 4.12 TimeSpan (`0x13`) + +A duration encoded as a big-endian signed 64-bit integer counting 100-nanosecond ticks. May be +negative. + +``` +BE64(int64 Ticks) — 8 bytes +``` + +Uses the same tick unit as DateTime (§4.11). + +### 4.13 ObjectId (`0x14`) + +A 12-byte opaque identifier. There is no length prefix — the size is fixed. + +``` +Bytes[12] +``` + +### 4.14 Custom types + +Custom types allow extending the format with application-specific types. + +**CustomById (`0x1E`):** + +``` +VarUInt(TotalSize) + VarUInt(TypeId) + Payload[TotalSize - sizeof(VarUInt(TypeId))] +``` + +`TotalSize` is the combined byte count of the encoded TypeId VarUInt and the Payload. + +**CustomByName (`0x1F`):** + +``` +VarUInt(TotalSize) + VarUInt(NameByteCount) + Name[NameByteCount] + Payload[remainder] +``` + +`TotalSize` is the combined byte count of the encoded name-length VarUInt, the name bytes, and +the payload. The name is UTF-8 encoded, not null-terminated. + +--- + +## 5. Objects + +An object is an unordered collection of uniquely named fields. There are two encoding forms: + +### 5.1 Non-uniform Object (`0x02`) + +Used when fields have different types (or when the object is empty). + +``` +VarUInt(PayloadSize) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` is the total byte count of all encoded fields (not including the `PayloadSize` +VarUInt itself or the container's own type byte). + +Each field is encoded as: + +``` +TypeByte + VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload +``` + +The `TypeByte` includes both `HasFieldType` (`0x40`) and `HasFieldName` (`0x80`) flags OR'd +with the type ID — i.e., the stored type byte is `TypeID | 0xC0`. + +### 5.2 Uniform Object (`0x03`) + +Used when every field has the same type. The shared type is stored once in the header and +omitted from individual fields. + +``` +VarUInt(PayloadSize) + FieldType(1 byte) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` includes the 1-byte field type and all field bytes. + +Each field is encoded as: + +``` +VarUInt(NameByteCount) + Name[NameByteCount] + FieldPayload +``` + +The individual fields do **not** include a type byte. They do retain the `HasFieldName` flag +behavior (names are present), but the type is provided by the container header. + +### 5.3 Empty Object + +An empty non-uniform object is 2 bytes: type byte + `VarUInt(0)`. + +``` +0x02 0x00 +``` + +(A uniform object cannot be empty because there is no type to store.) + +### 5.4 Object field constraints + +- Field names must be non-empty. +- Field names must be unique within the object (case-sensitive comparison). +- Field names are UTF-8 encoded, not null-terminated. +- Field ordering is not prescribed by the format but is significant for equality comparison — + two objects with the same fields in a different order are byte-wise different. + +--- + +## 6. Arrays + +An array is an ordered collection of unnamed fields. There are two encoding forms: + +### 6.1 Non-uniform Array (`0x04`) + +Used when items have different types. + +``` +VarUInt(PayloadSize) + VarUInt(ItemCount) + Field₁ + Field₂ + … + Fieldₙ +``` + +`PayloadSize` is the total byte count of `VarUInt(ItemCount)` plus all encoded fields. + +Each field is encoded as: + +``` +TypeByte + FieldPayload +``` + +The `TypeByte` includes the `HasFieldType` flag (`0x40`) OR'd with the type ID. Fields in +arrays do **not** have names. + +### 6.2 Uniform Array (`0x05`) + +Used when every item has the same type **and** every item has a non-zero-byte encoding. + +``` +VarUInt(PayloadSize) + VarUInt(ItemCount) + FieldType(1 byte) + Field₁ + … + Fieldₙ +``` + +`PayloadSize` includes `VarUInt(ItemCount)`, the 1-byte field type, and all field bytes. + +Each field is encoded as just its payload — no type byte, no name. + +### 6.3 Empty Array + +An empty non-uniform array: + +``` +0x04 0x01 0x00 +``` + +That is: type `0x04` + `VarUInt(1)` (payload size = 1 byte for the count) + `VarUInt(0)` +(item count = 0). + +### 6.4 Uniform array constraints + +A uniform array **must not** be used when items have zero-byte payloads (e.g., all Null or all +Bool fields). Because such items encode as zero bytes each, they would be indistinguishable, +and the container would have no way to address individual items. Use a non-uniform array in +these cases. + +--- + +## 7. Top-Level Fields + +A Compact Binary payload at the top level is typically a single field. This field may or may +not include its type byte, depending on context: + +- **With type:** The field starts with its type byte (with `HasFieldType` flag set). This is + the self-describing form used when the consumer does not know the type in advance. +- **Without type:** The type is communicated out of band (e.g., the consumer knows to expect an + Object). The field begins directly with its payload. + +A top-level object field (the most common case) is encoded as: + +``` +TypeByte(0x02) + ObjectPayload +``` + +or without the type byte, just: + +``` +ObjectPayload +``` + +--- + +## 8. Packages + +A package bundles a Compact Binary object with its external attachments (referenced via +ObjectAttachment and BinaryAttachment fields). It is serialized as a sequence of unnamed +top-level fields: + +### 8.1 Package structure + +``` +[Attachment₁] [Attachment₂] … [Object] [ObjectHash] [Attachment₃] … [Null] +``` + +- **Object** — An `Object` field containing the root compact binary data. +- **ObjectHash** — An `ObjectAttachment` field (`0x0E`) containing the 20-byte hash of the + serialized root object. Omitted when the object is empty. +- **Attachments** — Each attachment is a pair of fields: + 1. A `Binary` field containing the attachment data. + 2. A `BinaryAttachment` or `ObjectAttachment` field containing the hash of that data. + The hash field is omitted when the binary data is empty. +- **Null** — A `Null` field (`0x01`) terminates the package. + +### 8.2 Ordering + +The canonical order is: + +1. Root object + its hash +2. Attachments ordered by hash +3. Null terminator + +However, it is valid for components to appear in any order as long as: +- There is at most one root object. +- The Null terminator is last. + +### 8.3 Package constraints + +- At most one root object per package. +- No duplicate attachments (by hash). +- No null/empty attachments. +- Attachment hashes must match their data. + +--- + +## 9. Validation + +Implementations should support the following validation modes: + +| Mode | Checks | +|------|--------| +| **Default** | All fields are within bounds and have recognized types. Minimum required for safe reading. | +| **Names** | Object fields have unique, non-empty names. Array fields have no names. | +| **Format** | Canonical encoding: minimal VarUInt sizes, Float64→Float32 demotion, uniform containers used when possible, valid UTF-8 in names and strings. | +| **Padding** | No trailing bytes after the top-level field. | +| **Package** | Package/attachment structure is well-formed. | +| **PackageHash** | Stored hashes match computed hashes. | + +--- + +## 10. Hashing + +The canonical hash of a field is computed over: + +1. The **serialized type byte** (type ID | `HasFieldName` flag; the `HasFieldType` flag is + stripped). +2. The **name** (if present): `VarUInt(NameByteCount) + NameBytes`. +3. The **payload**. + +This allows deterministic content-addressing of any field, object, or array. + +--- + +## 11. Complete Encoding Examples + +### 11.1 Simple object + +An object with fields `"name": "Alice"` (String) and `"age": 30` (IntegerPositive): + +``` +02 -- Object type + 17 -- VarUInt PayloadSize = 23 bytes + C7 -- String | HasFieldType | HasFieldName + 04 -- VarUInt NameLen = 4 + 6E 61 6D 65 -- "name" + 05 -- VarUInt StringLen = 5 + 41 6C 69 63 65 -- "Alice" + C8 -- IntegerPositive | HasFieldType | HasFieldName + 03 -- VarUInt NameLen = 3 + 61 67 65 -- "age" + 1E -- VarUInt Value = 30 +``` + +Total: 25 bytes. + +### 11.2 Uniform array of integers + +An array of three positive integers `[1, 2, 3]`: + +``` +05 -- UniformArray type + 06 -- VarUInt PayloadSize = 6 + 03 -- VarUInt ItemCount = 3 + 08 -- FieldType = IntegerPositive + 01 -- VarUInt 1 + 02 -- VarUInt 2 + 03 -- VarUInt 3 +``` + +Total: 8 bytes. + +### 11.3 Negative integer + +The value −42 encoded as a standalone field: + +``` +09 -- IntegerNegative type +29 -- VarUInt(~(-42)) = VarUInt(41) = 0x29 +``` + +### 11.4 Nested object + +An object containing a nested object: + +``` +02 -- Outer Object type + 0E -- VarUInt PayloadSize = 14 + C2 -- Object | HasFieldType | HasFieldName + 05 -- VarUInt NameLen = 5 + 69 6E 6E 65 72 -- "inner" + 05 -- VarUInt inner PayloadSize = 5 + C8 -- IntegerPositive | HasFieldType | HasFieldName + 01 -- VarUInt NameLen = 1 + 78 -- "x" + 0A -- VarUInt Value = 10 +``` + +--- + +## 12. Summary of Fixed-Size Payloads + +| Type | Payload size | +|------|-------------| +| Null | 0 | +| BoolFalse | 0 | +| BoolTrue | 0 | +| Float32 | 4 | +| Float64 | 8 | +| Hash | 20 | +| ObjectAttachment | 20 | +| BinaryAttachment | 20 | +| Uuid | 16 | +| DateTime | 8 | +| TimeSpan | 8 | +| ObjectId | 12 | + +## 13. Summary of Variable-Size Payloads + +| Type | Payload structure | +|------|-------------------| +| Binary | `VarUInt(Size) + Bytes` | +| String | `VarUInt(Size) + UTF8Bytes` | +| IntegerPositive | `VarUInt(Value)` | +| IntegerNegative | `VarUInt(~Value)` | +| Object | `VarUInt(Size) + Fields` | +| UniformObject | `VarUInt(Size) + Type + Fields` | +| Array | `VarUInt(Size) + VarUInt(Count) + Fields` | +| UniformArray | `VarUInt(Size) + VarUInt(Count) + Type + Fields` | +| CustomById | `VarUInt(Size) + VarUInt(TypeId) + Data` | +| CustomByName | `VarUInt(Size) + VarUInt(NameLen) + Name + Data` | diff --git a/docs/specs/CompressedBuffer.md b/docs/specs/CompressedBuffer.md new file mode 100644 index 000000000..11787e3e9 --- /dev/null +++ b/docs/specs/CompressedBuffer.md @@ -0,0 +1,185 @@ +# Compressed Buffer Format Specification + +**Version:** 1.0 + +## Overview + +Compressed Buffer is a self-describing binary container for compressed data. It encodes the +compression method, block layout, and integrity checksums so that a reader can decompress the +payload without any external metadata. + +Key design goals: + +- **Self-describing** -- decompression requires no out-of-band knowledge of the compression method or original size +- **Block-based** -- data is split into independently-decompressible blocks for random access and parallel processing +- **Integrity-checked** -- CRC-32 on the header and BLAKE3 hash on the raw data +- **Method-agnostic** -- supports multiple compression backends (None, Oodle, LZ4) + +## 1. Notation + +| Symbol | Meaning | +|--------------|---------| +| `byte` | An unsigned 8-bit integer (octet). | +| `BE32(v)` | A 32-bit value stored in big-endian byte order. | +| `BE64(v)` | A 64-bit value stored in big-endian byte order. | +| `+` | Concatenation of byte sequences. | + +All multi-byte numeric values are stored in **big-endian** byte order. + +--- + +## 2. Magic Number + +Every compressed buffer begins with the 4-byte magic value: + +``` +0xb7756362 +``` + +Stored big-endian. This corresponds to the ASCII bytes `.ucb`. + +--- + +## 3. Header Layout (64 bytes) + +The header is a fixed 64-byte structure at offset 0: + +| Offset | Field | Type | Size | Description | +|--------|--------------------|----------|------|-------------| +| 0 | Magic | uint32 | 4 | `0xb7756362` (big-endian) | +| 4 | Crc32 | uint32 | 4 | CRC-32 of header bytes 8..63 (polynomial `0x04c11db7`) | +| 8 | Method | uint8 | 1 | Compression method (see below) | +| 9 | Compressor | uint8 | 1 | Method-specific compressor ID | +| 10 | CompressionLevel | uint8 | 1 | Method-specific compression level | +| 11 | BlockSizeExponent | uint8 | 1 | Block size as a power of two: `BlockSize = 1 << BlockSizeExponent` | +| 12 | BlockCount | uint32 | 4 | Number of compressed blocks | +| 16 | TotalRawSize | uint64 | 8 | Total uncompressed data size in bytes | +| 24 | TotalCompressedSize| uint64 | 8 | Total buffer size including header | +| 32 | RawHash | byte[32] | 32 | BLAKE3 hash of the uncompressed data | + +### Header CRC-32 + +The `Crc32` field covers bytes 8 through 63 of the header (56 bytes). Readers should verify +this checksum before trusting any other header field. + +--- + +## 4. Compression Methods + +### Method 0: None (Uncompressed) + +Data is stored without compression. Used as a fallback when compression would increase size. + +**Compressor**: Ignored (0). + +**Layout**: + +``` +[Header (64 bytes)] [Raw Data] +``` + +`TotalCompressedSize = 64 + TotalRawSize`. There is no block size array; the payload is a +single uncompressed span. + +### Method 3: Oodle + +Block-based compression using Oodle. The `Compressor` field selects the algorithm: + +| Value | Compressor | +|-------|------------| +| 1 | Selkie | +| 2 | Mermaid | +| 3 | Kraken | +| 4 | Leviathan | + +`CompressionLevel` maps to Oodle compression levels (typically -4 through +8, from +HyperFast4 to Optimal4). The default compressor is Mermaid. + +### Method 4: LZ4 + +Block-based compression using LZ4. `Compressor` and `CompressionLevel` are method-specific. + +--- + +## 5. Block-Based Layout (Methods 3, 4) + +For block-based methods the data following the header is structured as: + +``` +[Header (64 bytes)] +[Block Size Array: BlockCount x BE32] +[Compressed Block 0] +[Compressed Block 1] +... +[Compressed Block N-1] +``` + +### Block Size Array + +Immediately after the header at offset 64. Each entry is a `BE32` giving the **compressed +size** of the corresponding block. Total metadata size: `BlockCount * 4` bytes. + +Compressed block data begins at offset `64 + BlockCount * 4`. + +### Block Sizing + +- All blocks except the last decompress to `1 << BlockSizeExponent` bytes (default: 256 KB, + exponent 18). +- The last block decompresses to `TotalRawSize - (BlockCount - 1) * BlockSize` bytes. +- If a block's compressed size equals or exceeds its raw size, the block is stored + **uncompressed** (the raw bytes are used directly). + +### Total Size Invariant + +``` +TotalCompressedSize = 64 + BlockCount * 4 + sum(CompressedBlockSize[i] for i in 0..BlockCount-1) +``` + +--- + +## 6. Decompression + +1. **Read header** at offset 0 and verify the magic number. +2. **Verify CRC-32** over bytes 8..63. +3. **Dispatch on Method**: + - Method 0: Copy `TotalRawSize` bytes starting at offset 64. + - Methods 3/4: Continue with block-based decompression. +4. **Read block size array** (`BlockCount` x `BE32` at offset 64). +5. **Decompress each block** sequentially: + - If `CompressedBlockSize[i] < RawBlockSize[i]`, decompress using the indicated method. + - Otherwise, copy the block data verbatim. +6. **Optionally verify** the BLAKE3 hash of the reassembled raw data against `RawHash`. + +### Random-Access Decompression + +Because blocks are independent, a reader can decompress an arbitrary byte range by: + +1. Computing the first and last block indices that overlap the range. +2. Summing compressed block sizes to seek to the correct offset. +3. Decompressing only the required blocks. +4. Trimming the first and last block outputs to the requested range. + +--- + +## 7. Range Extraction + +A compressed buffer can be sliced into a sub-range without full decompression. The result is +a new compressed buffer whose blocks are a subset of the original: + +1. Compute the first and last block indices covering the requested raw range. +2. Emit a new 64-byte header with updated `BlockCount`, `TotalRawSize`, and + `TotalCompressedSize`. The `RawHash` is zeroed (not recalculated for sub-ranges). +3. Copy the corresponding entries from the block size array. +4. Reference or copy the compressed block data for the selected blocks. + +This enables efficient sub-range serving without decompressing and recompressing. + +--- + +## 8. Constants + +| Name | Value | Description | +|-------------------|--------------|-------------| +| Magic | `0xb7756362` | Header magic number | +| HeaderSize | 64 | Fixed header size in bytes | +| DefaultBlockSize | 262144 | Default raw block size (256 KB, exponent 18) | |