aboutsummaryrefslogtreecommitdiff
path: root/docs/specs/CompressedBuffer.md
diff options
context:
space:
mode:
authorStefan Boberg <[email protected]>2026-03-18 11:19:10 +0100
committerGitHub Enterprise <[email protected]>2026-03-18 11:19:10 +0100
commiteba410c4168e23d7908827eb34b7cf0c58a5dc48 (patch)
tree3cda8e8f3f81941d3bb5b84a8155350c5bb2068c /docs/specs/CompressedBuffer.md
parentbugfix release - v5.7.23 (#851) (diff)
downloadzen-eba410c4168e23d7908827eb34b7cf0c58a5dc48.tar.xz
zen-eba410c4168e23d7908827eb34b7cf0c58a5dc48.zip
Compute batching (#849)
### Compute Batch Submission - Consolidate duplicated action submission logic in `httpcomputeservice` into a single `HandleSubmitAction` supporting both single-action and batch (actions array) payloads - Group actions by queue in `RemoteHttpRunner` and submit as batches with configurable chunk size, falling back to individual submission on failure - Extract shared helpers: `MakeErrorResult`, `ValidateQueueForEnqueue`, `ActivateActionInQueue`, `RemoveActionFromActiveMaps` ### Retracted Action State - Add `Retracted` state to `RunnerAction` for retry-free rescheduling — an explicit request to pull an action back and reschedule it on a different runner without incrementing `RetryCount` - Implement idempotent `RetractAction()` on `RunnerAction` and `ComputeServiceSession` - Add `POST jobs/{lsn}/retract` and `queues/{queueref}/jobs/{lsn}/retract` HTTP endpoints - Add state machine documentation and per-state comments to `RunnerAction` ### Compute Race Fixes - Fix race in `HandleActionUpdates` where actions enqueued between session abandon and scheduler tick were never abandoned, causing `GetActionResult` to return 202 indefinitely - Fix queue `ActiveCount` race where `NotifyQueueActionComplete` was called after releasing `m_ResultsLock`, allowing callers to observe stale counters immediately after `GetActionResult` returned OK ### Logging Optimization and ANSI improvements - Improve `AnsiColorStdoutSink` write efficiency — single write call, dirty-flag flush, `RwLock` instead of `std::mutex` - Move ANSI color emission from sink into formatters via `Formatter::SetColorEnabled()`; remove `ColorRangeStart`/`End` from `LogMessage` - Extract color helpers (`AnsiColorForLevel`, `StripAnsiSgrSequences`) into `helpers.h` - Strip upstream ANSI SGR escapes in non-color output mode. This enables colour in log messages without polluting log files with ANSI control sequences - Move `RotatingFileSink`, `JsonFormatter`, and `FullFormatter` from header-only to pimpl with `.cpp` files ### CLI / Exec Refactoring - Extract `ExecSessionRunner` class from ~920-line `ExecUsingSession` into focused methods and a `ExecSessionConfig` struct - Replace monolithic `ExecCommand` with subcommand-based architecture (`http`, `inproc`, `beacon`, `dump`, `buildlog`) - Allow parent options to appear after subcommand name by parsing subcommand args permissively and forwarding unmatched tokens to the parent parser ### Testing Improvements - Fix `--test-suite` filter being ignored due to accumulation with default wildcard filter - Add test suite banners to test listener output - Made `function.session.abandon_pending` test more robust ### Startup / Reliability Fixes - Fix silent exit when a second zenserver instance detects a port conflict — use `ZEN_CONSOLE_*` for log calls that precede `InitializeLogging()` - Fix two potential SIGSEGV paths during early startup: guard `sentry_options_new()` returning nullptr, and throw on `ZenServerState::Register()` returning nullptr instead of dereferencing - Fail on unrecognized zenserver `--mode` instead of silently defaulting to store ### Other - Show host details (hostname, platform, CPU count, memory) when discovering new compute workers - Move frontend `html.zip` from source tree into build directory - Add format specifications for Compact Binary and Compressed Buffer wire formats - Add `WriteCompactBinaryObject` to zencore - Extended `ConsoleTui` with additional functionality - Add `--vscode` option to `xmake sln` for clangd / `compile_commands.json` support - Disable compute/horde/nomad in release builds (not yet production-ready) - Disable unintended `ASIO_HAS_IO_URING` enablement - Fix crashpad patch missing leading whitespace - Clean up code triggering gcc false positives
Diffstat (limited to 'docs/specs/CompressedBuffer.md')
-rw-r--r--docs/specs/CompressedBuffer.md185
1 files changed, 185 insertions, 0 deletions
diff --git a/docs/specs/CompressedBuffer.md b/docs/specs/CompressedBuffer.md
new file mode 100644
index 000000000..11787e3e9
--- /dev/null
+++ b/docs/specs/CompressedBuffer.md
@@ -0,0 +1,185 @@
+# Compressed Buffer Format Specification
+
+**Version:** 1.0
+
+## Overview
+
+Compressed Buffer is a self-describing binary container for compressed data. It encodes the
+compression method, block layout, and integrity checksums so that a reader can decompress the
+payload without any external metadata.
+
+Key design goals:
+
+- **Self-describing** -- decompression requires no out-of-band knowledge of the compression method or original size
+- **Block-based** -- data is split into independently-decompressible blocks for random access and parallel processing
+- **Integrity-checked** -- CRC-32 on the header and BLAKE3 hash on the raw data
+- **Method-agnostic** -- supports multiple compression backends (None, Oodle, LZ4)
+
+## 1. Notation
+
+| Symbol | Meaning |
+|--------------|---------|
+| `byte` | An unsigned 8-bit integer (octet). |
+| `BE32(v)` | A 32-bit value stored in big-endian byte order. |
+| `BE64(v)` | A 64-bit value stored in big-endian byte order. |
+| `+` | Concatenation of byte sequences. |
+
+All multi-byte numeric values are stored in **big-endian** byte order.
+
+---
+
+## 2. Magic Number
+
+Every compressed buffer begins with the 4-byte magic value:
+
+```
+0xb7756362
+```
+
+Stored big-endian. This corresponds to the ASCII bytes `.ucb`.
+
+---
+
+## 3. Header Layout (64 bytes)
+
+The header is a fixed 64-byte structure at offset 0:
+
+| Offset | Field | Type | Size | Description |
+|--------|--------------------|----------|------|-------------|
+| 0 | Magic | uint32 | 4 | `0xb7756362` (big-endian) |
+| 4 | Crc32 | uint32 | 4 | CRC-32 of header bytes 8..63 (polynomial `0x04c11db7`) |
+| 8 | Method | uint8 | 1 | Compression method (see below) |
+| 9 | Compressor | uint8 | 1 | Method-specific compressor ID |
+| 10 | CompressionLevel | uint8 | 1 | Method-specific compression level |
+| 11 | BlockSizeExponent | uint8 | 1 | Block size as a power of two: `BlockSize = 1 << BlockSizeExponent` |
+| 12 | BlockCount | uint32 | 4 | Number of compressed blocks |
+| 16 | TotalRawSize | uint64 | 8 | Total uncompressed data size in bytes |
+| 24 | TotalCompressedSize| uint64 | 8 | Total buffer size including header |
+| 32 | RawHash | byte[32] | 32 | BLAKE3 hash of the uncompressed data |
+
+### Header CRC-32
+
+The `Crc32` field covers bytes 8 through 63 of the header (56 bytes). Readers should verify
+this checksum before trusting any other header field.
+
+---
+
+## 4. Compression Methods
+
+### Method 0: None (Uncompressed)
+
+Data is stored without compression. Used as a fallback when compression would increase size.
+
+**Compressor**: Ignored (0).
+
+**Layout**:
+
+```
+[Header (64 bytes)] [Raw Data]
+```
+
+`TotalCompressedSize = 64 + TotalRawSize`. There is no block size array; the payload is a
+single uncompressed span.
+
+### Method 3: Oodle
+
+Block-based compression using Oodle. The `Compressor` field selects the algorithm:
+
+| Value | Compressor |
+|-------|------------|
+| 1 | Selkie |
+| 2 | Mermaid |
+| 3 | Kraken |
+| 4 | Leviathan |
+
+`CompressionLevel` maps to Oodle compression levels (typically -4 through +8, from
+HyperFast4 to Optimal4). The default compressor is Mermaid.
+
+### Method 4: LZ4
+
+Block-based compression using LZ4. `Compressor` and `CompressionLevel` are method-specific.
+
+---
+
+## 5. Block-Based Layout (Methods 3, 4)
+
+For block-based methods the data following the header is structured as:
+
+```
+[Header (64 bytes)]
+[Block Size Array: BlockCount x BE32]
+[Compressed Block 0]
+[Compressed Block 1]
+...
+[Compressed Block N-1]
+```
+
+### Block Size Array
+
+Immediately after the header at offset 64. Each entry is a `BE32` giving the **compressed
+size** of the corresponding block. Total metadata size: `BlockCount * 4` bytes.
+
+Compressed block data begins at offset `64 + BlockCount * 4`.
+
+### Block Sizing
+
+- All blocks except the last decompress to `1 << BlockSizeExponent` bytes (default: 256 KB,
+ exponent 18).
+- The last block decompresses to `TotalRawSize - (BlockCount - 1) * BlockSize` bytes.
+- If a block's compressed size equals or exceeds its raw size, the block is stored
+ **uncompressed** (the raw bytes are used directly).
+
+### Total Size Invariant
+
+```
+TotalCompressedSize = 64 + BlockCount * 4 + sum(CompressedBlockSize[i] for i in 0..BlockCount-1)
+```
+
+---
+
+## 6. Decompression
+
+1. **Read header** at offset 0 and verify the magic number.
+2. **Verify CRC-32** over bytes 8..63.
+3. **Dispatch on Method**:
+ - Method 0: Copy `TotalRawSize` bytes starting at offset 64.
+ - Methods 3/4: Continue with block-based decompression.
+4. **Read block size array** (`BlockCount` x `BE32` at offset 64).
+5. **Decompress each block** sequentially:
+ - If `CompressedBlockSize[i] < RawBlockSize[i]`, decompress using the indicated method.
+ - Otherwise, copy the block data verbatim.
+6. **Optionally verify** the BLAKE3 hash of the reassembled raw data against `RawHash`.
+
+### Random-Access Decompression
+
+Because blocks are independent, a reader can decompress an arbitrary byte range by:
+
+1. Computing the first and last block indices that overlap the range.
+2. Summing compressed block sizes to seek to the correct offset.
+3. Decompressing only the required blocks.
+4. Trimming the first and last block outputs to the requested range.
+
+---
+
+## 7. Range Extraction
+
+A compressed buffer can be sliced into a sub-range without full decompression. The result is
+a new compressed buffer whose blocks are a subset of the original:
+
+1. Compute the first and last block indices covering the requested raw range.
+2. Emit a new 64-byte header with updated `BlockCount`, `TotalRawSize`, and
+ `TotalCompressedSize`. The `RawHash` is zeroed (not recalculated for sub-ranges).
+3. Copy the corresponding entries from the block size array.
+4. Reference or copy the compressed block data for the selected blocks.
+
+This enables efficient sub-range serving without decompressing and recompressing.
+
+---
+
+## 8. Constants
+
+| Name | Value | Description |
+|-------------------|--------------|-------------|
+| Magic | `0xb7756362` | Header magic number |
+| HeaderSize | 64 | Fixed header size in bytes |
+| DefaultBlockSize | 262144 | Default raw block size (256 KB, exponent 18) |