aboutsummaryrefslogtreecommitdiff
path: root/src/zenserver
Commit message (Collapse)AuthorAgeFilesLines
* sessions: persist to disk, prune, track client liveness, accept UE_LOGFMT ↵HEADmainStefan Boberg2026-05-0525-519/+3795
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (#1014) Branch started as a sessions-service overhaul (persistence, client liveness, UE_LOGFMT intake) and grew to pick up adjacent infrastructure work: an early-startup log backlog, a hardened `MemoryArena`, the `zen trace serve` viewer gaining a counter view + compact timeline + tabbed callsite panel, defensive fixes in the third-party `tourist` trace parser, a series of allocation reductions across the HTTP and compact-binary hot paths, and a new `zen sessions` CLI command tree. ## Sessions service **Persistence.** Each session lives on disk under `<DataRoot>/sessions/<id>/` as `info.cb` (metadata) plus `log.bin` (length-prefixed CbObject log records). On startup the service scans that directory and loads prior sessions as ended sessions, preloading the tail of each log so historical views work after a restart. `SessionLog` is noexcept-constructed and falls back to a disabled state on disk errors, so a bad disk can't take down `RegisterSession`. `GetSession` falls back to the ended-sessions list (fixes historical log fetches over HTTP). `LoadTail` counts only successfully-parsed records. **Pruning.** Periodic cleanup task drops ended sessions once any of three caps is exceeded: age (default 1 year), count (default 1000), or total on-disk footprint (default 50 MiB). Runs 30 s after startup, hourly thereafter. Active sessions never pruned; disk removal and directory stat happen outside the exclusive lock so a slow filesystem can't stall lookups. **Client liveness.** Sessions carry a `ProcessHandle` for the client-reported pid, captured at registration time so Windows pid recycling can't produce false positives. A 30 s asio timer probes liveness and ends dead sessions through the normal remove path, producing a synthetic `Session ended: process exited (...)` line persisted to `log.bin`. Windows decodes common NTSTATUS exit codes to human names (Ctrl-C, access violation, stack overflow, ...); POSIX stays at plain `process exited`. Clients auto-fill `ClientPid` only for local targets (unix socket / loopback); the server defensively accepts pids only from `IsLocalMachineRequest()` peers. zenserver also reports its own pid when registering its self-session, so it shows up with a real pid in the dashboard and `zen sessions ls`. **Synthetic end-of-session line.** `RemoveSession` takes an optional reason; before the session moves to the ended list it appends an Info-level `Session ended[: reason]` entry through the normal log path (released outside `m_Lock`). Current reasons: `client request` (HTTP DELETE), `server shutdown` (self-session), `process exited (...)` (liveness). **UE_LOGFMT structured entries.** `POST /sessions/{id}/log` now accepts `{level, logger, format, fields}` alongside the existing `{level, logger, message}` shape. New `logtemplate.{h,cpp}` implements UE's `StructuredLog.cpp` template grammar (field paths with `.name` / `[N]`, `{{`/`}}` escapes, `$text` / `$format` / `$locformat` object conventions, bounded recursion). Renders to a displayable message at intake while persisting raw format + fields so a future UI can drill into fields without another schema bump. Hot path is zero-alloc — renders into `ExtendableStringBuilder<256>` using stack-buffered `Oid::ToString` / `IoHash::ToHexString` overloads. UI shows a `{…}` marker with the raw template + JSON-pretty fields on hover. **Parent sessions.** `SessionInfo` gains `parent_session_id`; hub-managed storage server child processes inherit the hub's session id via `--parent-session=<id>`. `ZEN_SESSIONS_URL` env var becomes a fallback for `--sessions-url` / config when neither is provided. The in-process session log sink is disabled when a remote sessions target is configured (logs flow through `SessionsServiceClient` instead). The sessions UI groups child sessions under their parent (collapsible/expandable, sorts as a unit, supports nesting). **Platform reporting.** `SessionInfo` gains `Platform`, flowed end-to-end: client auto-fills via `GetRuntimePlatformName()`, server persists in `info.cb` (`plat`) and emits on GET. UI renders as a SimpleIcons-style inline SVG (windows / macOS / iOS / linux / wine / android / playstation / xbox / nintendo) with case-insensitive alias resolution (Win32/Win64, PS4/PS5, XSX/XSS, NintendoSwitch, iPhone/iPad, Darwin/OSX). Unknown values fall back to text; sorting runs on the underlying string. **WebSocket log streaming.** Sessions UI moves from 2 s polling to a WebSocket push model. New `WsSubscriber` has a stable id + helper methods. UI caps the log-line DOM at 5 000 entries with a shared cursor-regression helper, factored out of two call sites. Per-broadcast allocations trimmed on the push path; fixed a stack overrun in the WS log broadcast hex-id buffer. **Log memory.** `LogEntry::Level` is now `logging::LogLevel` (1 byte) instead of `std::string` (~32 B) — saves ~310 KB per full 10 k-entry deque and eliminates a per-message allocation in the in-proc sink. On-disk format writes an int32 and accepts either int or legacy string on read. `LogEntry` strings now live in a `MemoryArena`; logger names are interned across the deque. `SessionLog::Append` and `WriteSessionInfoFile` drop their `UniqueBuffer` round-trip and write `CbObject::GetView()` straight through `BasicFile` / `SafeWriteFile`. Multi-entry `POST /log` batched under one lock + one push. **In-proc log timestamps.** `InProcSessionLogSink::TimePointToDateTime` previously preserved only whole seconds, so every in-proc entry rendered at `.000` ms in the dashboard and `zen sessions tail`. It now adds the sub-second part (nanoseconds → 100 ns ticks) to keep ms precision end-to-end. **UI.** Side "Session Details" panel is gone — its info is inline in the table (appname, mode, platform, id, timestamps, this/log pills, active dot). Bottom panel is a tabbed `Log | Metadata` view with a right-side "Session Information" panel beside metadata; log-only controls (filter, newest-first, follow, log-level filter, expand/collapse) hide when Metadata is active, polling keeps running across tab switches. Wide-mode toggle fills the viewport edge-to-edge. Log lines show the logger category; timestamps render in 24 h with zero-padded fields regardless of locale. Sessions list defaults to All / 10 per page / created-desc, gains click-to-sort headers on the full dataset, a header filter box, and a pager aligned to the table's right edge. Duplicate auto-injected `<h1>Sessions</h1>` removed. ## `zen sessions` CLI New command tree on the `zen` client for inspecting the sessions service from the terminal: - **`zen sessions ls`** — lists sessions (active first, ended next; newest-first within each group) with id, status, app/mode, pid, created, duration, and log count. Supports `--status active|ended|all` (default `all`). - **`zen sessions status`** — prints the sessions service summary: self id, active / ended counts, and the read/write/delete/list/request/bad-request counters from `/stats/sessions`. - **`zen sessions tail [session]`** — tails a session's log. With no argument it tails zenserver's own session (resolved via `/sessions/list`'s `self_id`); an explicit 24-hex id targets any session, including ended ones (historical replay). `--lines N` (default 50, 0 = all buffered) trims the initial dump client-side. `--follow` prefers a WebSocket push subscription on `/sessions/ws` for sub-second latency; on upgrade failure (older server, blocked port, unix-socket transport) it falls back to HTTP cursor polling at `--interval-ms` (default 500), with sleeps chunked to 50 ms so Ctrl-C reacts quickly. Output matches `zen::logging::FullFormatter` (`[YY-MM-DD HH:MM:SS.mmm] [lvl] [logger] message`); on a TTY the level is colored and the logger is bold, with continuation lines indented under the message column using the *visible* prefix width. 404 surfaces as `(session ended)` and connection errors as `(server gone)` — both clean exits, so stopping the server mid-tail no longer prints a stack trace. - **`zen sessions ui`** — opens `<host>/dashboard/?page=sessions` in the user's default browser. Rejects unix-socket hosts. A small `ZenServiceClient::IsUnixSocket()` helper now wraps the unix-socket check used by `ui`, `sessions tail` (WS path), and `sessions ui`. ## Logging `BacklogSink` captures early-startup log entries in a fixed-capacity ring so late-attached sinks (session sink, file sink) can replay them. Detaches from the broadcast list when disabled; backed by destructor-only cleanup (no `unique_ptr` indirection per entry). Tuned defaults so the backlog covers typical bring-up without unbounded growth. ## `zen trace serve` viewer - Compact timeline mode for high-density views. - New `TRACE_INT_VALUE` / `TRACE_FLOAT_VALUE` counter trace points + a counters page in the viewer. - Callsite tables collapsed into a single tabbed panel. - Lossless `Oid <-> Guid` bridge for trace session ids; trace `SessionId` plumbed through. - `tourist` parser hardening: bounds-check `BufferStream::read`, validate `Type::info_size` before `patch()`, convert `parse_important_aux` to a loop (avoids deep recursion), widen `ParserPool` index to `uint32`, bounds-check field offsets in the dispatcher, pin `Types::parse` buffer up-front. ## `MemoryArena` Configurable chunk size, inline chunk list, oversize requests routed to truly-dedicated chunks (no slack waste, no fragmentation when one allocation is much larger than the chunk). ## Allocation cleanups across hot paths - `zenhttp::HttpRequestRouter::HandleRequest` and `FormatPackageMessageInternal`: drop heap allocations. - Compact-binary validation: `eastl::fixed_vector` + `eastl::sort`; eliminate `std::vector` churn. - `zenserverprocess`: trim transient allocations in spawn paths. - Sessions HTTP intake / broadcast: drop transient `std::string` allocs.
* hub async s3 client (#1024)Dan Engelbrecht2026-05-0510-332/+4212
| | | | | | | | | | | | | | - Feature: `AsyncHttpClient` adds cancellable request tokens, streaming GET to a file (`AsyncDownload`), zero-copy chunk-callback GET (`AsyncStream`), pull-mode body source for streaming `AsyncPut`, retry layer mirroring the synchronous client, and a submit-side in-flight cap (`HttpClientSettings::MaxConcurrentRequests`) so hub-scale fanout against a single host cannot stall queued handles into curl's connect-timeout window - Feature: Hub hydration can route S3 transfers through a non-blocking `AsyncHttpClient` (curl_multi + asio) backed by a single io thread; hydrate and dehydrate now pipeline requests instead of blocking worker threads - `--hub-hydration-async-enabled` (Lua: `hub.hydration.async.enabled`, default true) - `--hub-hydration-async-max-concurrent-requests` (Lua: `hub.hydration.async.maxconcurrentrequests`, default `clamp(cpu*4, 128, 512)`) - Feature: Hub provision/deprovision/obliterate now run as two phases on separate worker pools so per-module hydration cannot starve child-process spawn/despawn (and vice versa) - New `--hub-instance-spawn-threads` (Lua: `hub.instance.spawnthreads`, default `clamp(cpu/8, 4, 16)`) drives child-process spawn/despawn - `--hub-instance-provision-threads` (Lua: `hub.instance.provisionthreads`) now drives per-module hydrate/dehydrate scheduling only; default changed from `max(cpu/4, 2)` to `clamp(cpu/8, 4, 12)` - `--hub-hydration-threads` (Lua: `hub.hydration.threads`) now controls per-file workers inside a single hydrate/dehydrate; default changed from `max(cpu/4, 2)` to `clamp(cpu/8, 4, 12)` - Feature: `AsyncHttpClient` owns its `asio::io_context` and one io thread by default; the `(BaseUri, io_context&)` constructor is preserved for callers that want to share an externally-driven `io_context` across clients (caller MUST keep the loop running until the client destructs) - Feature: `Hub::Configuration` C++ struct fields renamed (`OptionalProvisionWorkerPool`/`OptionalHydrationWorkerPool` -> `OptionalProvisionPool`/`OptionalSpawnPool`/`OptionalHydrationPool`). Embedders constructing `Hub` directly must update field names; provision and spawn pools must both be set or both null (asserted at construction). - Bugfix: `S3Client` signing-key cache no longer returns stale signatures after IMDS-rotated credentials change `AccessKeyId`; cache is now keyed on `(DateStamp, AccessKeyId)`
* watchdog ephemeral port exhaust (#1022)Dan Engelbrecht2026-05-042-23/+55
| | | - Improvement: Hub pools HTTP connections to managed instances so provision/deprovision churn no longer exhausts Windows ephemeral ports
* zenhttp improvements (robustness / correctness) (#968)Stefan Boberg2026-05-048-80/+276
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A collection of security, correctness, and robustness fixes in `zenhttp` and `zencore` surfaced by security review. Most items are small, independent commits grouped here because they all tighten trust boundaries or fix UB along the same code paths. ## WebSocket protocol hardening (RFC 6455) - **Enforce the client-side mask bit**. Server-side frame loops now reject unmasked frames with close code 1002 per §5.1. Prevents HTTP intermediary smuggling. - **Validate control frames and RSV bits**. Fragmented control frames, oversized (>125 B) control payloads, and any non-zero RSV bit now fail the connection before allocation. - **Lower per-frame payload cap** from 256 MB → 4 MB. Bounds per-connection accumulator memory. - **Implement message fragmentation**. Continuation frames are coalesced and delivered as a single message; interleaved non-control frames close with 1002; assembled messages are capped at 4 MB (1009 on overflow). Previously partial fragments were delivered to handlers, bypassing payload validation. - **Parse the 101 handshake response properly** in `HttpWsClient`. Status-line, `Upgrade`, `Connection`, and `Sec-WebSocket-Accept` are now matched exactly rather than via substring searches against the full body. ## Auth / OIDC hardening - **Constant-time password compare** in `PasswordSecurity::IsAllowed` (closes a remote length/content timing oracle). Adds a shared `ConstantTimeEquals` helper. - **Harden Basic-auth header parsing**: trim trailing LWS, reject control bytes and DEL in the credential. - **OIDC discovery pinning**: require HTTPS (loopback exempt), verify `issuer` matches `BaseUrl`, require `token_endpoint` / `userinfo_endpoint` / `jwks_uri` to share origin with `BaseUrl`, reject empty `token_endpoint`. - **Restrict `POST /auth/oidc/refreshtoken`** to local-machine requests. Previously unauthenticated in default deployments — remote callers could evict or replace cached tokens. - **Stop logging OIDC provider response bodies** on refresh failure (IdPs echo `refresh_token` back in error bodies). - **Drop the unused `IdentityToken` field** from `OidcClient` / `OpenIdToken` so nothing in the tree accidentally trusts an unverified JWT. ## Auth state encryption migration - Add `AesGcm` AEAD primitive (BCrypt / OpenSSL backends, mbedTLS stubbed) and `CryptoRandom::Fill` CSPRNG helper in `zencore/crypto.h`. - Migrate authstate file from AES-256-CBC with a fixed IV to AES-GCM with a fresh 12-byte random nonce per write and the 4-byte `ZEN1` magic bound as AAD. Legacy-CBC files are transparently read once and rewritten in the new format. ## Filesystem / IO robustness - `IoBufferExtendedCore::Materialize` now checks `MAP_FAILED` on POSIX (was comparing to `nullptr`, which let the failure sentinel propagate into later reads and `munmap(MAP_FAILED, ...)`). - `IoBufferBuilder::MakeFromFile / MakeFromTemporaryFile`: close the FD/HANDLE on exception via a dismissable `ScopeGuard`; actually check the `fstat()` return value (previously used an uninitialized `FileSize`). - `ReadFromFileMaybe`: loop short reads, retry `EINTR`, chunk Windows `ReadFile` at `0xFFFFFFFF` bytes (fixes silent truncation of multi-GiB reads). - `WipeDirectory`: compare `FindFirstFileW` handle against `INVALID_HANDLE_VALUE` rather than `nullptr`. - `RemoveFileNative` (Linux/macOS): report non-`ENOENT` stat failures via the `std::error_code` out-param and stop reading `st_mode` after a failed stat. ## Buffer / compression correctness - Avoid per-copy `IoBufferCore` heap allocations in `CompositeBuffer::CopyTo / ViewOrCopyRange` iterators; add fast path for `BufferHeader::Read` when the 64-byte header fits in the first plain-memory segment. - `BufferHeader`: add `IsHeaderValid()` gate covering `BlockSizeExponent` range, `BlockCount * BlockSize` overflow, and `TotalRawSize` bounds before any arithmetic uses them. Defends against attacker-controlled headers that can pass the CRC and trigger OOB writes in `DecompressBlock`.
* GetEnvVariable: return std::optional<std::string> (#1017)Stefan Boberg2026-04-273-8/+8
| | | | | | | - `GetEnvVariable` now returns `std::optional<std::string>` so callers can distinguish an unset variable from one set to an empty value. - Windows path uses `SetLastError(ERROR_SUCCESS)` + `ERROR_ENVVAR_NOT_FOUND` to detect "not found"; POSIX path returns `nullopt` when `getenv` returns `nullptr`. - All call sites migrated. Most use `.value_or("")` to preserve current empty-or-unset behavior. The diagnostic helpers in `zen-test/artifactprovider-tests.cpp` now report `<unset>` vs `<empty>` distinctly. - Added a check in the `ExpandEnvironmentVariables` test confirming `nullopt` for an unset variable; PATH/HOME lookups in that test use `REQUIRE(has_value())` so a missing var fails cleanly instead of throwing `bad_optional_access`.
* hydration with pack (#1016)Dan Engelbrecht2026-04-279-516/+1963
| | | | | | | | | | | | | | | - Feature: Hub hydration packs small files into raw CAS pack blobs to reduce request count for modules dominated by tiny metadata files - `--hub-hydration-enable-pack` (Lua: `hub.hydration.enablepack`, default true) - `--hub-hydration-pack-threshold-bytes` (Lua: `hub.hydration.packthresholdbytes`, default 256 KiB) - `--hub-hydration-max-pack-bytes` (Lua: `hub.hydration.maxpackbytes`, default 4 MiB) - Feature: Hub hydration and dehydration can be disabled per direction - `--hub-enable-hydration` (Lua: `hub.enablehydration`, default true) - `--hub-enable-dehydration` (Lua: `hub.enabledehydration`, default true) - Feature: Hub hydration accepts a configurable file exclude list via `HydrationOptions` `excludes` (array of wildcards). Built-in defaults skip transient runtime files (`.lock`, `.sentry-native/*`, `state_marker`, `*.bak`, `gc/reserve.gc`, `auth/*`) so they no longer participate in dehydrate scans. Override semantics: a present field replaces the default outright; explicit `[]` opts out of all defaults. - Improvement: Hub hydration completion logs now report per-request average and max latency, peak in-flight workers, queue wait, and hash-cache hit percentage; loose and pack-blob transfers are reported separately - Improvement: Hub hydration pre-creates unique parent directories before scheduling parallel writes - Improvement: S3 hydration retries transient HTTP failures (timeouts, 429 throttling, 5xx server errors, connection errors) up to 3 times via the HTTP client retry layer - Improvement: S3 hydration multipart chunk size is persisted in `state.cbo` per module so hydrate replays the partitioning used at dehydrate; default raised to 64 MiB (was 32 MiB) - Improvement: Hub hydration `Obliterate` retries backend delete once before falling back to local cleanup
* hub execution stats (#1011)Dan Engelbrecht2026-04-224-264/+488
| | | | | - Improvement: Hub hydration and dehydration completion logs now include per-phase wall time, bytes transferred, bits/s throughput, number of unique worker threads used, and the storage source/target URI - Improvement: Hub storage server instance lifecycle logs now report elapsed time for spawn and shutdown - Improvement: Hub deprovisioning now logs GC completion status and elapsed time; a GC that does not complete within the 5s deadline is logged as a warning and shutdown proceeds anyway
* dashboard prominent version (#1009)Dan Engelbrecht2026-04-222-2/+31
| | | - Improvement: Dashboard banner displays the zenserver version next to the wordmark
* hub provision acceptable states (#1008)Dan Engelbrecht2026-04-221-0/+2
| | | - Bugfix: Hub provision requests now return 202 Accepted when the module is `Recovering` or `Waking` instead of rejecting
* chunk-size -> chunksize for lua config (#1004)Dan Engelbrecht2026-04-211-1/+1
|
* filesystem.h surface error codes (#998)Dan Engelbrecht2026-04-211-2/+10
| | | - Improvement: File copy, scan, clone, and move operations now report the underlying OS error in failure messages
* improved s3 hydration (#997)Dan Engelbrecht2026-04-216-1149/+1111
| | | | | | | | | - Improvement: Hub shares a single S3 client and IMDS credential provider across all modules, reducing IMDS load and surviving transient IMDS blips during bulk provisioning - Improvement: Hub validates hydration config at startup; bad `--hub-hydration-target-spec` or `--hub-hydration-target-config` now fails `zen hub` at boot instead of per-module at first hydrate - Improvement: S3 hydration multipart chunk size configurable via `settings.chunk-size` (default 32 MiB) - Improvement: S3 client extracts `<Error><Code>` and `<Message>` from XML error bodies (previously logged as `<unhandled content format>`) - Improvement: S3 client fails fast with a "no credentials available" error when AWS credentials are missing, instead of sending an unsigned request that S3 rejects with a generic 400 - Improvement: IMDS credential provider retries transient connection failures (up to 3 attempts with backoff) - Improvement: HTTP clients with `RetryCount > 0` also retry on `CURLE_COULDNT_CONNECT`
* async consul register/deregister (#992)Dan Engelbrecht2026-04-211-39/+49
| | | - Improvement: Hub Consul service registration and deregistration are now dispatched on a dedicated background thread so instance state transitions no longer stall when the Consul agent is slow or unreachable
* structured cache: fix some minor issues (#995)Stefan Boberg2026-04-211-11/+16
| | | | | | | | | | | | | | - Fix double WriteResponse in PUT record failure path; the detail-body branch now short-circuits instead of falling through to a second WriteResponse call - Return 405 Method Not Allowed for unsupported verbs in the root, namespace, bucket, record, and chunk handlers (previously fell through to no response) - Clamp exec$/replay-recording thread_count so a bogus query value cannot spawn an unbounded worker pool ## Performance / cleanup - NamespaceMap now uses TransparentStringHash + std::equal_to<>, so Get/Put/Find/Drop can probe the map with a std::string_view directly instead of constructing a temporary std::string on every request - Replace insert_or_assign with try_emplace under the exclusive lock in GetNamespace; the find() re-check already guarantees the key is absent, so try_emplace matches intent better ## Reverted - The earlier change to erase the pinned entry from m_DroppedNamespaces after DropNamespace's post-drop work was reverted: other threads may still hold pointers into a dropped namespace, so tearing it down eagerly is unsafe. Dropped namespaces remain pinned for the lifetime of the process as before.
* Zen CLI common server interface (#920)Stefan Boberg2026-04-202-5/+17
| | | | | | | | | | | | | | | | | | Introduces a common `ZenServiceClient` RAII wrapper for zen CLI commands that interact with a zenserver instance. CLI operations (admin, builds, cache, exec, hub, info, projectstore, trace, ui, version, vfs, workspaces) automatically register sessions so they become visible in the server's session list, and forward log output to the server's session log endpoint. All session HTTP I/O (announce, remove, log batches) runs on a single background worker thread, so CLI startup and shutdown never block on server availability. ### Key changes - **`ZenServiceClient`** — new RAII class that wraps host resolution, HTTP client creation, and session lifecycle (register on connect, remove on exit). Replaces ad-hoc boilerplate across all command files that talk to a server, including the new `trace` subcommands (`start`, `stop`, `status`). - **Async session I/O** — `SessionsServiceClient` now owns a single worker thread and command queue. `Announce()`, `Remove()`, and `UpdateMetadata()` enqueue commands and return immediately. The worker creates one `HttpClient` with a 5-second total timeout, bounding any individual request. Eliminates main-thread stalls when the server is unreachable. - **Session log forwarding** — `SessionLogSink` is a thin enqueuer that posts log messages to the same worker queue (no separate thread or HTTP client). Log levels are serialized as integers; the server-side ingest handles both string and integer formats for backwards compatibility, with bounds checking on integer values. - **Build & projectstore session registration** — Long-running `builds` and projectstore cache (oplog-download) connections register sessions too, making them visible alongside regular CLI command sessions. ### Cleanup - Extract `SetupCacheSession` helper on `StorageInstance` to reduce duplication. - Remove unused `HttpClient` reference in ui command.
* Rename logging::ToStringView to ToString for consistency (#993)Stefan Boberg2026-04-202-3/+3
| | | | | | | - Renames `logging::ToStringView` → `ToString` and `ShortToStringView` → `ShortToString` for consistency with the rest of the codebase, where `ToString` is the convention for enum-to-string conversions (return type already communicates it's a view). - Updates all call sites in logbase, logging helpers, session log sink, admin service, and tcplogstreamsink. Split off from the `sb/zen-monitor` branch so the ZenServiceClient refactor PR stays focused.
* hide secrets from log and sentry (#989)Dan Engelbrecht2026-04-201-2/+6
| | | * scrub sensitive command line options from log and sentry
* zen trace analysis support (#945)Stefan Boberg2026-04-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Integrates the **tourist** trace analysis library and builds a full `zen trace` command suite for working with Unreal Engine `.utrace` files. ### Trace analysis library (`thirdparty/tourist/`) - Adds the tourist library as a third-party dependency with three modules: **foundation** (platform primitives, memory, scheduling), **trace** (UE Trace protocol decoding), and **analysis** (event dispatching and analyzer framework). - Cross-platform support for Windows, Linux, and macOS. ### `zen trace` CLI commands (`src/zen/cmds/`, `src/zen/trace/`) - **`zen trace analyze`** — Summarize a `.utrace` file: session metadata, thread inventory, command line + build configuration, CPU profiling scopes, timing, event rates, log messages, and (with symbols) memory allocation metrics including live-allocs dumps, callstack-keyed aggregation, and allocation churn. Optional HTML output for memory reports. - **`zen trace inspect`** — Dump the event schema (declared types, fields, sizes) from a trace file. - **`zen trace trim`** — Extract a time-window from a trace into a new `.utrace` file. - **`zen trace serve`** — Launch a local HTTP server hosting an interactive trace viewer; opens in the default browser. ### Symbolication (`src/zen/trace/symbol_resolver.*`, `thirdparty/raw_pdb/`) - Pluggable resolver with multiple backends: `pdb` (in-tree raw_pdb), `dbghelp` (Windows), `llvm-symbolizer` (all platforms), `atos` (macOS). An `auto` backend picks the best available tool per platform. - Microsoft Symbol Server support: downloads PDBs on demand using a redirect-aware HTTP client. - Local PDB cache keyed by image GUID preserves symbols across binary recompilation. - Callstack trimming heuristic strips UE internal noise from reports. - Binary analysis cache (`.ucache_z`) avoids re-resolving the same trace. ### Interactive trace viewer (`src/zen/frontend/html/`, `src/zen/trace/trace_viewer_service.*`) - Timeline: scope-level detail, horizontal zoom/pan, vertical scrolling, viewport-driven loading with pre-computed LOD for responsive navigation of large traces. - Thread grouping (collapsible sidebar sections) synthesized from name suffixes, natural sort order, visual distinction between lane threads and OS threads. - Bookmark and region annotations; region categories with per-category toggles; bookmark marker toggle in the toolbar. - Filterable Logs tab showing captured `UE_LOG` output. - Stats tab with per-scope aggregate statistics. - Memory tab with interactive allocation analysis and an allocation size histogram. - CsvProfiler event parsing and chart UI. ### Other in-branch supporting changes - **Cross-platform browser launcher** (`browser_launcher.{h,cpp}`) used by `trace serve`. - **`ReciprocalU64`** fast 64-bit integer division (zencore/intmath) for trace analyzers. - **`parallelsort`** cross-platform parallel sort helper (zenutil). - Frontend zip build rule so the viewer's HTML assets are bundled into `zen.exe`. - `/Zo` flag for better optimized debug info on Windows release builds. - `trace-tests.cpp` in the `zen-test` harness (harness itself landed on main via #985).
* Use eastl::deque for queues with many small elements (#991)Stefan Boberg2026-04-201-7/+6
| | | | | | | | | | | Switch several deque-based queues from `std::deque` to `eastl::deque` to reduce per-element heap allocation overhead. MSVC's `std::deque` allocates one node per element for anything larger than ~16 bytes; `eastl::deque` groups 4, 8, or 32 elements per block depending on element size. Converted call sites: - `BlockingQueue` and `WorkerThreadPool` (generic — downstream callers benefit automatically) - Session log entry buffer (~10k-entry ring of large log records — 4 per block vs 1) - Job queue (`Ref<Job>` — 32 per block vs 2) - RPC recording request queue (large `QueuedRequest` struct — 4 per block vs 1) - StatsD client message queues (~32-byte buffers — 8 per block vs 1)
* s3 dehydration touch cas (#977)Dan Engelbrecht2026-04-201-19/+57
| | | | * add Touch() function to s3 client * touch all used cas files in s3 dehydration path
* zen history command (#987)Dan Engelbrecht2026-04-203-3/+53
| | | | | | | | | - Feature: Per-user invocation history for `zen` and `zenserver`; each startup appends a record to a JSONL file capped at the most recent 100 entries. Location: `%LOCALAPPDATA%\Epic\Zen\History\invocations.jsonl` on Windows, `~/.zen/History/invocations.jsonl` on POSIX - `zen history` opens an interactive picker; selecting a zen row re-runs it inline and forwards the exit code, selecting a zenserver row spawns it detached - `zen history --list` (`-l`) prints the table to stdout instead of showing the picker - `zen history --filter zen|zenserver` restricts the listing to one executable - `zen history --print` prints the reconstructed command line of the selected row instead of launching it - `--enable-execution-history` global option on both binaries (default `true`) to opt out per invocation - The history file is attached to Sentry crash reports (alongside the existing zenserver log)
* Move ZipFs from zenserver frontend into zenhttp (#980)Stefan Boberg2026-04-205-479/+1
| | | Moves `ZipFs` from `src/zenserver/frontend/` to `src/zenhttp/` so any binary linking `zenhttp` can serve a bundled web UI from a zip archive (motivator: the upcoming `zen trace serve` subcommand).
* zencore: promote ScopedEnvVar to a shared filesystem helper (#979)Stefan Boberg2026-04-201-56/+0
| | | | | - Moves the RAII `ScopedEnvVar` helper out of `hydration.cpp`'s anonymous test namespace and into `zencore/filesystem.{h,cpp}` next to `GetEnvVariable` so it can be reused by other subsystems. - Makes the class non-copyable/non-movable and moves its members to `private`.
* builds cmd refactor (#975)Dan Engelbrecht2026-04-201-5/+1
| | | | | | | | | - Bugfix: `builds download` partial-block fetch decisions now account for build storage host latency - Bugfix: Transfer rate displays in `builds` commands now smooth correctly - Split `buildstorageoperations.cpp` (8.5k lines) into per-operation TUs: buildinspect, buildprimecache, buildstorageresolve, buildupdatefolder, builduploadfolder, buildvalidatebuildpart; stats moved to buildstoragestats.h. - FilteredRate extracted to zenutil. - BuildsCommand shared state consolidated into a BuildsConfiguration struct; subcommands inherit from BuildsSubCmdBase holding a `const BuildsConfiguration&` instead of a `BuildsCommand&`. - `ProgressBar` renamed to `ConsoleProgressBar`; mode enum (`ConsoleProgressMode`) lifted to namespace scope; `PushLogOperation`/`PopLogOperation`/`ForceLinebreak` promoted to virtuals on `ProgressBase`. - Free-function wrappers (`UploadFolder`, `DownloadFolder`, `ValidateBuildPart`) added around the existing operation classes so callers stop reimplementing setup + stats logging.
* zenbase hardening (#971)Stefan Boberg2026-04-173-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A series of correctness and API hygiene fixes to the intrusive refcount primitives in `zenbase`, culminating in the removal of `RefPtr<T>` in favour of a single unified `Ref<T>` smart pointer. The changes are motivated by two pieces of latent UB sitting under every `Ref<T>` / `TRefCounted<T>` in the codebase, plus a handful of API footguns on the smart-pointer side (silent raw-pointer decay, missing converting moves, unconstrained conversions from unrelated types). ## Correctness fixes - **Strict-aliasing UB in atomic helpers** — `AtomicIncrement`/`Decrement`/`Add` took a `volatile uint32_t&` and reinterpret-cast it to `std::atomic<T>*`. The object was never constructed as a `std::atomic`, so the access was type-punning UB. Fixed by changing `m_RefCount` to `std::atomic<uint32_t>` directly in `RefCounted`, `TRefCounted<T>` and `IoBufferCore`. The helpers (and `zenbase/atomic.h`) are later removed entirely — the three callers now invoke `fetch_add`/`fetch_sub` directly. - **const_cast of non-mutable member** — `AddRef()` / `Release()` are `const` but mutated `m_RefCount` via `const_cast`. Since `m_RefCount` wasn't `mutable`, writing through the cast was UB for any `const`-qualified holder (e.g. a `static const` refcounted singleton). Fixed by marking `m_RefCount` `mutable` and dropping the `const_cast` in `AddRef`/`Release`. - **Public non-virtual `TRefCounted` destructor** — allowed `delete basePtr;` to slice past the CRTP `DeleteThis()` contract. Moved to `protected`. ## Memory-ordering cleanup - `AddRef` weakened from seq_cst to **relaxed** (a thread can only take a new reference via one it already holds; nothing needs to synchronize). - `Release` weakened from seq_cst to **acq_rel** (sufficient to order prior writes before the destructor, and make the decrement visible to observers). - Diagnostic `RefCount()` / `GetRefCount()` reads made **relaxed** and spelled out as explicit `.load()` — the returned value is stale the moment it's observed, so stronger ordering gives no guarantee. - No-op on x86 (`lock xadd` either way), but removes a full barrier on every `Ref<T>` copy on ARM64 (Apple silicon / Windows-on-ARM). ## `RefPtr` / `Ref` unification Before this branch, `RefPtr<T>` and `Ref<T>` were subtly different in ways that made the safer of the two (`Ref`) harder to use and the looser one (`RefPtr`) dangerous: - `RefPtr::operator T*()` was implicit — `delete refPtr;` compiled silently (double-delete), and the raw pointer could outlive the temporary `RefPtr` it was extracted from. Made `explicit`, then removed entirely once call sites were migrated to `.Get()`. - `RefPtr(T*)` was implicit while `RefPtr(RefPtr<Derived>&&)` was `explicit` — exactly the opposite of the safety intent. Reversed. - `RefPtr`'s converting move was unconstrained (any `RefPtr<U>` with an implicitly-convertible `U*` satisfied it, including `void*` and multiple-inheritance base offsets). Added a `DerivedFrom<U, T>` constraint matching `Ref<T>`. - `Ref<T>` was missing a converting move ctor / move-assignment from `Ref<Derived>` — upcasts of rvalues were going through `AddRef`+`Release` instead of a pointer steal. Added. - `Release()` and the non-move smart-pointer ops were not `noexcept`, despite being so in practice. Marked `noexcept` throughout. After all of the above, the two types were functionally identical. The final commit deletes `RefPtr` and rewrites the ~10 consumer files to use `Ref`.
* log cleanup (#969)Dan Engelbrecht2026-04-171-4/+7
| | | | - Improvement: New `ZEN_SCOPED_LOG(Expr)` macro routes `ZEN_INFO`/`ZEN_WARN`/`ZEN_DEBUG` in the enclosing block through the given logger expression instead of the default - Improvement: `BuildContainer`, `SaveOplog`, and `LoadOplogContext` now take a caller-provided `LoggerRef` so diagnostic messages route through the caller's logger
* operationlogoutput refactor (#967)Dan Engelbrecht2026-04-171-4/+1
| | | - Improvement: Replaced `OperationLogOutput` with `ProgressBase` in `zenutil`; logging and progress reporting are now separate concerns. Operation classes receive a `LoggerRef` for logging and a `ProgressBase&` for progress bars
* add --buildstore-disksizelimit-percent to zenstorageserver (#966)Dan Engelbrecht2026-04-163-3/+44
| | | - Improvement: Add option `--buildstore-disksizelimit-percent` - Max percentage of total drive capacity (of --data-dir drive) for build storage. When combined with `--buildstore-disksizelimit`, the lower value wins.
* add dashboard copy button on select information lines (#962)Dan Engelbrecht2026-04-168-9/+103
|
* Fix strings passed to RegisterRoute in HttpProjectStore: strings are … (#963)Matt Peters2026-04-151-4/+4
| | | | Fix strings passed to RegisterRoute in HttpProjectStore: strings are literals now instead of regexes, so '\$' needs to be changed to just '$'.
* hub reset inactivity timer (#961)Dan Engelbrecht2026-04-151-0/+2
| | | * update last activity time for hub instances that are asked to be provisioned but are already provisioned
* add sessions to hub and proxy (#960)Dan Engelbrecht2026-04-157-43/+78
| | | | * move session service to zenserver base class and make it available in all zenserver modes * fix deadlock in sessionsclient shutdown
* Hub proxy returns graceful responses when an instance is unavailable instead ↵Dan Engelbrecht2026-04-141-3/+27
| | | | of a generic bad gateway error (#956)
* fix utf characters in source code (#953)Dan Engelbrecht2026-04-138-15/+15
|
* Compute OIDC auth, async Horde agents, and orchestrator improvements (#913)Stefan Boberg2026-04-138-1610/+342
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework of the Horde agent subsystem from synchronous per-thread I/O to an async ASIO-driven architecture, plus provisioner scale-down with graceful draining, OIDC authentication, scheduler improvements, and dashboard UI for provisioner control. ### Async Horde Agent Rewrite - Replace synchronous `HordeAgent` (one thread per agent, blocking I/O) with `AsyncHordeAgent` — an ASIO state machine running on a shared `io_context` thread pool - Replace `TcpComputeTransport`/`AesComputeTransport` with `AsyncTcpComputeTransport`/`AsyncAesComputeTransport` - Replace `AgentMessageChannel` with `AsyncAgentMessageChannel` using frame queuing and ASIO timers - Delete `ComputeBuffer` and `ComputeChannel` ring-buffer classes (no longer needed) ### Provisioner Drain / Scale-Down - `HordeProvisioner` can now drain agents when target core count is lowered: queries each agent's `/compute/session/status` for workload, selects candidates by largest-fit/lowest-workload, and sends `/compute/session/drain` - Configurable `--horde-drain-grace-period` (default 300s) before force-kill - Implement `IProvisionerStateProvider` interface to expose provisioner state to the orchestrator HTTP layer - Forward `--coordinator-session`, `--provision-clean`, and `--provision-tracehost` through both Horde and Nomad provisioners to spawned workers ### OIDC Authentication - `HordeClient` accepts an `AccessTokenProvider` (refreshable token function) as alternative to static `--horde-token` - Wire up `OidcToken.exe` auto-discovery via `httpclientauth::CreateFromOidcTokenExecutable` with `--HordeUrl` mode - New `--horde-oidctoken-exe-path` CLI option for explicit path override ### Orchestrator & Scheduler - Orchestrator generates a session ID at startup; workers include `coordinator_session` in announcements so the orchestrator can reject stale-session workers - New `Rejected` action state — when a remote runner declines at capacity, the action is rescheduled without retry count increment - Reduce scheduler lock contention: snapshot pending actions under shared lock, sort/trim outside the lock - Parallelize remote action submission across runners via `WorkerThreadPool` with slow-submit warnings - New action field `FailureReason` populated by all runner types (exit codes, sandbox failures, exceptions) - New endpoints: `session/drain`, `session/status`, `session/sunset`, `provisioner/status`, `provisioner/target` ### Remote Execution - Eager-attach mode for `RemoteHttpRunner` — bundles all attachments upfront in a `CbPackage` for single-roundtrip submits - Track in-flight submissions to prevent over-queuing - Show remote runner hostname in `GetDisplayName()` - `--announce-url` to override the endpoint announced to the coordinator (e.g. relay-visible address) ### Frontend Dashboard - Delete standalone `compute.html` (925 lines) and `orchestrator.html` (669 lines), consolidated into JS page modules - Add provisioner panel to orchestrator dashboard: target/active/estimated core counts, draining agent count - Editable target-cores input with debounced POST to `/orch/provisioner/target` - Per-agent provisioning status badges (active / draining / deallocated) in the agents table - Active vs total CPU counts in agents summary row ### CLI - New `zen compute record-start` / `record-stop` subcommands - `zen exec` progress bar with submit and completion phases, atomic work counters, `--progress` mode (Pretty/Plain/Quiet) ### Other - `DataDir` supports environment variable expansion - Worker manifest validation checks for `worker.zcb` marker to detect incomplete cached directories - Linux/Mac runners `nice(5)` child processes to avoid starving the main server - `ComputeService::SetShutdownCallback` wired to `RequestExit` via `session/sunset` - Curl HTTP client logs effective URL on failure - `MachineInfo` carries `Pool` and `Mode` from Horde response - Horde bundle creation includes `.pdb` on Windows
* Add MemoryCidStore and ChunkStore interface (#940)Stefan Boberg2026-04-133-8/+11
| | | | | | | | | | This PR introduces an in-memory `CidStore` option primarily for use with compute, to avoid hitting disk for ephemeral data which is not really worth persisting. And in particular not worth paying the critical path cost of persistence. - **MemoryCidStore**: In-memory CidStore implementation backed by a hash map, optionally layered over a standard CidStore. Writes to the backing store are dispatched asynchronously via a dedicated flush thread to avoid blocking callers on disk I/O. Reads check memory first, then fall back to the backing store without caching the result. - **ChunkStore interface**: Extract `ChunkStore` abstract class (`AddChunk`, `ContainsChunk`, `FilterChunks`) and `FallbackChunkResolver` into `zenstore.h` so `HttpComputeService` can accept different storage backends for action inputs vs worker binaries. `CidStore` and `MemoryCidStore` both implement `ChunkStore`. - **Compute service wiring**: `HttpComputeService` takes two `ChunkStore&` params (action + worker). The compute server uses `MemoryCidStore` for actions (no disk persistence needed) and disk-backed `CidStore` for workers (cross-action reuse). The storage server passes its `CidStore` for both (unchanged behavior).
* minor fixups (#948)Dan Engelbrecht2026-04-132-11/+26
| | | | | | | * objectstore.cpp - m_TotalBytesServed now tracks all range cases (single, multi, 416) * async http: docstring corrected: curl_multi_socket_action() / ASIO socket async_wait remove non-ascii characters * fix singlethreaded gc option in lua to not use dash * fix changelog order
* Logging and diagnostics improvements (#941)Stefan Boberg2026-04-131-1/+1
| | | | | | | | | | | | | | | | Core logging and system diagnostics improvements, extracted from the compute branch. ### Logging - **Elapsed timestamps**: Console log now shows elapsed time since launch `[HH:MM:SS.mmm]` instead of full date/time; file logging is unchanged - **Short level names**: 3-letter short level names (`trc`/`dbg`/`inf`/`wrn`/`err`/`crt`) used by both console and file formatters via `ShortToStringView()` - **Consistent field order**: Standardized to `[timestamp] [level] [logger]` across both console and file formatters - **Slim LogMessage/LogPoint**: Remove redundant fields from `LogMessage` (derive level/source from `LogPoint`), flatten `LogPoint` to inline filename/line fields, shrink `LogLevel` to `int8_t` with `static_assert(sizeof(LogPoint) <= 32)` - **Remove default member initializers** and static default `LogPoint` from `LogMessage` — all fields initialized by constructor - **LoggerRef string constructor**: Convenience constructor accepting a string directly - **Fix SendMessage macro collision**: Replace `thread.h` include in `logmsg.h` with a forward declaration of `GetCurrentThreadId()` to avoid pulling in `windows.h` transitively ### System Diagnostics - **Cache static system metrics**: Add `RefreshDynamicSystemMetrics()` that only queries values that change at runtime (available memory, uptime, swap). `SystemMetricsTracker` snapshots full `GetSystemMetrics()` once at construction and reuses cached topology/total memory on each `Query()`, avoiding repeated `GetLogicalProcessorInformationEx` traversal on Windows, `/proc/cpuinfo` parsing on Linux, and `sysctl` topology calls on macOS
* hub instance malloc trace (#946)Dan Engelbrecht2026-04-136-0/+82
| | | | | | | `--hub-instance-malloc` selects the memory allocator for child instances `--hub-instance-trace` sets trace channels for child instances `--hub-instance-tracehost` sets the trace streaming host for child instances `--hub-instance-tracefile` sets the trace output file for child instances add {moduleid} and {port} placeholder support for tracefile
* Dashboard stats tiles no longer flicker (#943)Dan Engelbrecht2026-04-116-190/+197
|
* `--consul-register-hub` option to disable hub parent service Consul ↵Dan Engelbrecht2026-04-112-17/+34
| | | | registration (#939)
* hub deprovision all (#938)Dan Engelbrecht2026-04-112-0/+81
| | | * implement "deprovision all" for hub
* dashboard search (#936)Dan Engelbrecht2026-04-116-13/+184
| | | | | - Improvement: Dashboard paginated lists now include a search input that jumps to the page containing the first match and highlights the row - Improvement: Dashboard paginated lists show a loading indicator while fetching data - Improvement: Hub dashboard navigates to and highlights newly provisioned instances
* HTTP range responses (RFC 7233) - httpobjectstore (#928)Dan Engelbrecht2026-04-102-116/+78
| | | | | | | | | - Improvement: HTTP range responses (RFC 7233) are now fully compliant across the object store and build store - 206 Partial Content responses now include a `Content-Range` header; previously absent for single-range requests, which broke `HttpClient::GetRanges()` - 416 Range Not Satisfiable responses now include `Content-Range: bytes */N` as required by RFC 7233 - Out-of-bounds range requests return 416 Range Not Satisfiable (was 400 Bad Request) - Single-byte ranges (`bytes=N-N`) are now correctly accepted (were previously rejected) - Range byte positions widened from 32-bit to 64-bit; RFC 7233 imposes no size limit on byte range values - Build store binary GET requests with a Range header now return 206 Partial Content with `Content-Range` (previously returned 200 OK without it)
* reduce test runtime (#933)Dan Engelbrecht2026-04-102-2/+2
| | | | | | | | * reduce zenserver spawns in tests * fix filesystemutils wrong test suite name * tweak tests for faster runtime * reduce more test runtime * more wall time improvements * fast http and processmanager tests
* migrate from http_parser to llhttp (#929)Dan Engelbrecht2026-04-093-44/+41
|
* fully provisioned hub instances now sets initial check status to "passing" ↵Dan Engelbrecht2026-04-081-6/+3
| | | | in consul (#930)
* use correct return code for unsupported multirange requests in objectstore ↵Dan Engelbrecht2026-04-081-2/+2
| | | | (#927)
* hydration data obliteration (#923)Dan Engelbrecht2026-04-0812-145/+714
| | | | - Feature: Hub obliterate operation deletes all local and backend hydration data for a module - Improvement: Hub dashboard adds obliterate button for individual, bulk, and by-name module deletion
* sort items on dashboard (#924)Dan Engelbrecht2026-04-072-79/+113
| | | * add pagination and consistent sorting on cache and projects ui pages