# Compressed Buffer Format Specification **Version:** 1.0 ## Overview Compressed Buffer is a self-describing binary container for compressed data. It encodes the compression method, block layout, and integrity checksums so that a reader can decompress the payload without any external metadata. Key design goals: - **Self-describing** -- decompression requires no out-of-band knowledge of the compression method or original size - **Block-based** -- data is split into independently-decompressible blocks for random access and parallel processing - **Integrity-checked** -- CRC-32 on the header and BLAKE3 hash on the raw data - **Method-agnostic** -- supports multiple compression backends (None, Oodle, LZ4) ## 1. Notation | Symbol | Meaning | |--------------|---------| | `byte` | An unsigned 8-bit integer (octet). | | `BE32(v)` | A 32-bit value stored in big-endian byte order. | | `BE64(v)` | A 64-bit value stored in big-endian byte order. | | `+` | Concatenation of byte sequences. | All multi-byte numeric values are stored in **big-endian** byte order. --- ## 2. Magic Number Every compressed buffer begins with the 4-byte magic value: ``` 0xb7756362 ``` Stored big-endian. This corresponds to the ASCII bytes `.ucb`. --- ## 3. Header Layout (64 bytes) The header is a fixed 64-byte structure at offset 0: | Offset | Field | Type | Size | Description | |--------|--------------------|----------|------|-------------| | 0 | Magic | uint32 | 4 | `0xb7756362` (big-endian) | | 4 | Crc32 | uint32 | 4 | CRC-32 of header bytes 8..63 (polynomial `0x04c11db7`) | | 8 | Method | uint8 | 1 | Compression method (see below) | | 9 | Compressor | uint8 | 1 | Method-specific compressor ID | | 10 | CompressionLevel | uint8 | 1 | Method-specific compression level | | 11 | BlockSizeExponent | uint8 | 1 | Block size as a power of two: `BlockSize = 1 << BlockSizeExponent` | | 12 | BlockCount | uint32 | 4 | Number of compressed blocks | | 16 | TotalRawSize | uint64 | 8 | Total uncompressed data size in bytes | | 24 | TotalCompressedSize| uint64 | 8 | Total buffer size including header | | 32 | RawHash | byte[32] | 32 | BLAKE3 hash of the uncompressed data | ### Header CRC-32 The `Crc32` field covers bytes 8 through 63 of the header (56 bytes). Readers should verify this checksum before trusting any other header field. --- ## 4. Compression Methods ### Method 0: None (Uncompressed) Data is stored without compression. Used as a fallback when compression would increase size. **Compressor**: Ignored (0). **Layout**: ``` [Header (64 bytes)] [Raw Data] ``` `TotalCompressedSize = 64 + TotalRawSize`. There is no block size array; the payload is a single uncompressed span. ### Method 3: Oodle Block-based compression using Oodle. The `Compressor` field selects the algorithm: | Value | Compressor | |-------|------------| | 1 | Selkie | | 2 | Mermaid | | 3 | Kraken | | 4 | Leviathan | `CompressionLevel` maps to Oodle compression levels (typically -4 through +8, from HyperFast4 to Optimal4). The default compressor is Mermaid. ### Method 4: LZ4 Block-based compression using LZ4. `Compressor` and `CompressionLevel` are method-specific. --- ## 5. Block-Based Layout (Methods 3, 4) For block-based methods the data following the header is structured as: ``` [Header (64 bytes)] [Block Size Array: BlockCount x BE32] [Compressed Block 0] [Compressed Block 1] ... [Compressed Block N-1] ``` ### Block Size Array Immediately after the header at offset 64. Each entry is a `BE32` giving the **compressed size** of the corresponding block. Total metadata size: `BlockCount * 4` bytes. Compressed block data begins at offset `64 + BlockCount * 4`. ### Block Sizing - All blocks except the last decompress to `1 << BlockSizeExponent` bytes (default: 256 KB, exponent 18). - The last block decompresses to `TotalRawSize - (BlockCount - 1) * BlockSize` bytes. - If a block's compressed size equals or exceeds its raw size, the block is stored **uncompressed** (the raw bytes are used directly). ### Total Size Invariant ``` TotalCompressedSize = 64 + BlockCount * 4 + sum(CompressedBlockSize[i] for i in 0..BlockCount-1) ``` --- ## 6. Decompression 1. **Read header** at offset 0 and verify the magic number. 2. **Verify CRC-32** over bytes 8..63. 3. **Dispatch on Method**: - Method 0: Copy `TotalRawSize` bytes starting at offset 64. - Methods 3/4: Continue with block-based decompression. 4. **Read block size array** (`BlockCount` x `BE32` at offset 64). 5. **Decompress each block** sequentially: - If `CompressedBlockSize[i] < RawBlockSize[i]`, decompress using the indicated method. - Otherwise, copy the block data verbatim. 6. **Optionally verify** the BLAKE3 hash of the reassembled raw data against `RawHash`. ### Random-Access Decompression Because blocks are independent, a reader can decompress an arbitrary byte range by: 1. Computing the first and last block indices that overlap the range. 2. Summing compressed block sizes to seek to the correct offset. 3. Decompressing only the required blocks. 4. Trimming the first and last block outputs to the requested range. --- ## 7. Range Extraction A compressed buffer can be sliced into a sub-range without full decompression. The result is a new compressed buffer whose blocks are a subset of the original: 1. Compute the first and last block indices covering the requested raw range. 2. Emit a new 64-byte header with updated `BlockCount`, `TotalRawSize`, and `TotalCompressedSize`. The `RawHash` is zeroed (not recalculated for sub-ranges). 3. Copy the corresponding entries from the block size array. 4. Reference or copy the compressed block data for the selected blocks. This enables efficient sub-range serving without decompressing and recompressing. --- ## 8. Constants | Name | Value | Description | |-------------------|--------------|-------------| | Magic | `0xb7756362` | Header magic number | | HeaderSize | 64 | Fixed header size in bytes | | DefaultBlockSize | 262144 | Default raw block size (256 KB, exponent 18) |