aboutsummaryrefslogtreecommitdiff
path: root/src/zenserver
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'main' into sb/reduce-allocsStefan Boberg15 hours18-227/+351
|\
| * Add MemoryCidStore and ChunkStore interface (#940)Stefan Boberg15 hours3-8/+11
| | | | | | | | | | | | | | | | | | | | This PR introduces an in-memory `CidStore` option primarily for use with compute, to avoid hitting disk for ephemeral data which is not really worth persisting. And in particular not worth paying the critical path cost of persistence. - **MemoryCidStore**: In-memory CidStore implementation backed by a hash map, optionally layered over a standard CidStore. Writes to the backing store are dispatched asynchronously via a dedicated flush thread to avoid blocking callers on disk I/O. Reads check memory first, then fall back to the backing store without caching the result. - **ChunkStore interface**: Extract `ChunkStore` abstract class (`AddChunk`, `ContainsChunk`, `FilterChunks`) and `FallbackChunkResolver` into `zenstore.h` so `HttpComputeService` can accept different storage backends for action inputs vs worker binaries. `CidStore` and `MemoryCidStore` both implement `ChunkStore`. - **Compute service wiring**: `HttpComputeService` takes two `ChunkStore&` params (action + worker). The compute server uses `MemoryCidStore` for actions (no disk persistence needed) and disk-backed `CidStore` for workers (cross-action reuse). The storage server passes its `CidStore` for both (unchanged behavior).
| * minor fixups (#948)Dan Engelbrecht16 hours2-11/+26
| | | | | | | | | | | | | | * objectstore.cpp - m_TotalBytesServed now tracks all range cases (single, multi, 416) * async http: docstring corrected: curl_multi_socket_action() / ASIO socket async_wait remove non-ascii characters * fix singlethreaded gc option in lua to not use dash * fix changelog order
| * Logging and diagnostics improvements (#941)Stefan Boberg16 hours1-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Core logging and system diagnostics improvements, extracted from the compute branch. ### Logging - **Elapsed timestamps**: Console log now shows elapsed time since launch `[HH:MM:SS.mmm]` instead of full date/time; file logging is unchanged - **Short level names**: 3-letter short level names (`trc`/`dbg`/`inf`/`wrn`/`err`/`crt`) used by both console and file formatters via `ShortToStringView()` - **Consistent field order**: Standardized to `[timestamp] [level] [logger]` across both console and file formatters - **Slim LogMessage/LogPoint**: Remove redundant fields from `LogMessage` (derive level/source from `LogPoint`), flatten `LogPoint` to inline filename/line fields, shrink `LogLevel` to `int8_t` with `static_assert(sizeof(LogPoint) <= 32)` - **Remove default member initializers** and static default `LogPoint` from `LogMessage` — all fields initialized by constructor - **LoggerRef string constructor**: Convenience constructor accepting a string directly - **Fix SendMessage macro collision**: Replace `thread.h` include in `logmsg.h` with a forward declaration of `GetCurrentThreadId()` to avoid pulling in `windows.h` transitively ### System Diagnostics - **Cache static system metrics**: Add `RefreshDynamicSystemMetrics()` that only queries values that change at runtime (available memory, uptime, swap). `SystemMetricsTracker` snapshots full `GetSystemMetrics()` once at construction and reuses cached topology/total memory on each `Query()`, avoiding repeated `GetLogicalProcessorInformationEx` traversal on Windows, `/proc/cpuinfo` parsing on Linux, and `sysctl` topology calls on macOS
| * hub instance malloc trace (#946)Dan Engelbrecht17 hours6-0/+82
| | | | | | | | | | | | | | `--hub-instance-malloc` selects the memory allocator for child instances `--hub-instance-trace` sets trace channels for child instances `--hub-instance-tracehost` sets the trace streaming host for child instances `--hub-instance-tracefile` sets the trace output file for child instances add {moduleid} and {port} placeholder support for tracefile
| * Dashboard stats tiles no longer flicker (#943)Dan Engelbrecht3 days6-190/+197
| |
| * `--consul-register-hub` option to disable hub parent service Consul ↵Dan Engelbrecht3 days2-17/+34
| | | | | | | | registration (#939)
* | Merge branch 'main' into sb/reduce-allocsStefan Boberg3 days8-13/+265
|\|
| * hub deprovision all (#938)Dan Engelbrecht3 days2-0/+81
| | | | | | * implement "deprovision all" for hub
| * dashboard search (#936)Dan Engelbrecht3 days6-13/+184
| | | | | | | | | | - Improvement: Dashboard paginated lists now include a search input that jumps to the page containing the first match and highlights the row - Improvement: Dashboard paginated lists show a loading indicator while fetching data - Improvement: Hub dashboard navigates to and highlights newly provisioned instances
* | Reduce short-lived heap allocations in zenserverStefan Boberg3 days4-14/+19
|/ | | | | | | - Replace fmt::format float temporaries in CbXmlWriter with fmt::format_to via StringBuilderAppender - Replace std::vector<std::string> URL tokenizer with eastl::fixed_vector<std::string_view, 5> in cache details handler - Replace std::strtoull/strtoul(std::string(...).c_str()) with zen::ParseInt in session log handler - Replace StrCaseCompare(std::string(sv).c_str(), ...) with string_view overload in workspace handlers
* HTTP range responses (RFC 7233) - httpobjectstore (#928)Dan Engelbrecht3 days2-116/+78
| | | | | | | | | - Improvement: HTTP range responses (RFC 7233) are now fully compliant across the object store and build store - 206 Partial Content responses now include a `Content-Range` header; previously absent for single-range requests, which broke `HttpClient::GetRanges()` - 416 Range Not Satisfiable responses now include `Content-Range: bytes */N` as required by RFC 7233 - Out-of-bounds range requests return 416 Range Not Satisfiable (was 400 Bad Request) - Single-byte ranges (`bytes=N-N`) are now correctly accepted (were previously rejected) - Range byte positions widened from 32-bit to 64-bit; RFC 7233 imposes no size limit on byte range values - Build store binary GET requests with a Range header now return 206 Partial Content with `Content-Range` (previously returned 200 OK without it)
* reduce test runtime (#933)Dan Engelbrecht3 days2-2/+2
| | | | | | | | * reduce zenserver spawns in tests * fix filesystemutils wrong test suite name * tweak tests for faster runtime * reduce more test runtime * more wall time improvements * fast http and processmanager tests
* migrate from http_parser to llhttp (#929)Dan Engelbrecht5 days3-44/+41
|
* fully provisioned hub instances now sets initial check status to "passing" ↵Dan Engelbrecht5 days1-6/+3
| | | | in consul (#930)
* use correct return code for unsupported multirange requests in objectstore ↵Dan Engelbrecht6 days1-2/+2
| | | | (#927)
* hydration data obliteration (#923)Dan Engelbrecht6 days12-145/+714
| | | | - Feature: Hub obliterate operation deletes all local and backend hydration data for a module - Improvement: Hub dashboard adds obliterate button for individual, bulk, and by-name module deletion
* sort items on dashboard (#924)Dan Engelbrecht6 days2-79/+113
| | | * add pagination and consistent sorting on cache and projects ui pages
* add pagination of cooked projects and caches on dashboard front page (#922)Dan Engelbrecht6 days2-65/+171
|
* incremental dehydrate (#921)Dan Engelbrecht6 days9-964/+1509
| | | | | | | | | | | | | | | - Feature: Incremental CAS-based hydration/dehydration replacing the previous full-copy approach - Feature: S3 hydration backend with multipart upload/download support - Feature: Configurable thread pools for hub instance provisioning and hydration `--hub-instance-provision-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. `--hub-hydration-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. - Improvement: Hub triggers GC on instance before deprovisioning to compact storage before dehydration - Improvement: GC status now reports pending triggers as running - Improvement: S3 client debug logging gated behind verbose mode to reduce log noise at default verbosity - Improvement: Hub dashboard Resources tile now shows total memory - Improvement: `filesystemutils` moved from `zenremotestore` to `zenutil` for broader reuse - Improvement: Hub uses separate provision and hydration worker pools to avoid deadlocks - Improvement: Hibernate/wake/deprovision on non-existent or already-in-target-state modules are idempotent - Improvement: `ScopedTemporaryDirectory` with empty path now creates a temporary directory instead of asserting
* fix hub consule health endpoint registration (#917)Dan Engelbrecht11 days2-1/+2
| | | | * use correct health endpoint for zenhubserver consul registration * add total disk space on hub resource pane
* s3 and consul fixes (#916)Dan Engelbrecht12 days1-1/+1
| | | | | | | | | | | * fix endpoint for stats/hub in compute/hub.html page * fix api token call failure for imds (using wrong overload for Put) * add "localhost" to healt check url in consul when no address is given * add consul fallback deregister if normal deregister fails * add consul registration unit test
* add provision button to hub ui (#915)Dan Engelbrecht12 days1-0/+133
|
* hub instance dashboard proxy (#914)Dan Engelbrecht12 days12-19/+682
| | | - Feature: Hub dashboard proxy - instance dashboards are accessible through the hub server at `/hub/proxy/{port}/` without requiring direct port access
* consul env token refresh (#912)Dan Engelbrecht13 days1-2/+5
| | | - Improvement: Consul token is now re-read from the environment variable on every request, allowing token rotation without restarting the service
* add lua config options for all zenhubserver command line options (#904)Dan Engelbrecht2026-03-301-3/+81
| | | | | | | | | | | - Improvement: Hub server now supports Lua config file for all hub-specific options - `hub.upstreamnotification.*` - upstream notification endpoint and instance ID - `hub.consul.*` - service registration endpoint, token, health interval, deregister timeout - `hub.instance.*` - base port, HTTP class, thread count, core limit, config path - `hub.instance.limits.*` - instance count cap, disk and memory usage limits - `hub.hydration.*` - hydration target spec and config path - `hub.watchdog.*` - cycle timing, inactivity timeouts, and activity check timeouts - Improvement: Added `--hub-instance-base-port-number` as an alias for `--hub-base-port-number`, and `--upstream-notification-instance-id` as an alias for `--instance-id` - Improvement: Added hub mode documentation at docs/hub.md
* Request validation and resilience improvements (#864)Stefan Boberg2026-03-3010-29/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### Security: Input validation & path safety - **Reject local file references by default** in package parsing — only allow when explicitly opted in by the service (`ParseFlags::kAllowLocalReferences`) and validated by an `ILocalRefPolicy` (fail-closed: no policy = rejected) - **`DataRootLocalRefPolicy`** restricts local ref paths to the server's data root via canonical path prefix matching - **Validate attachment hashes** in compute HTTP handlers — decompresses and re-hashes each attachment at ingestion time to reject tampered payloads - **Path traversal validation** for worker descriptions (`pathvalidation.h`) — rejects absolute paths, `..` components, Windows reserved device names, and invalid filename characters - **Harden CbPackage parsing** against corrupt inputs — overflow-safe attachment count, bounds checks on local ref offset/size, graceful failure instead of `ZEN_ASSERT` for untrusted data - **Harden legacy package parser** — reject zero-size binary fields, missing mappers, and optionally validate resolved attachment hashes - **Bounds check in `CbPackageReader::MarshalLocalChunkReference`** — detect when `MakeFromFile` silently clamps offset+size to file size ### Reliability: Lock consolidation & bug fixes - **Consolidate three action map locks into one** (`m_ActionMapLock`) — eliminates deadlock risk from multi-lock ordering, simplifies state transitions, and fixes a race where newly enqueued actions were briefly invisible to `GetActionResult`/`FindActionResult` - **Fix infinite loop in `BaseRunnerGroup::SubmitActions`** when actions exceed total runner capacity — cap round-robin at `TotalCapacity` and default unassigned results to "No capacity" - **Fix `MakeSafeAbsolutePathInPlace` for UNC paths** — `\server\share` now correctly becomes `\?\UNC\server\share` instead of `\?\server\share` - **Fix `max_retries=0`** — previously fell through to the default of 3; now correctly means "no retries" ### New: ManagedProcessRunner - Cross-platform process runner backed by `SubprocessManager` — uses async exit callbacks instead of polling, delegates CPU/memory metrics to the manager's built-in sampler - `ProcessGroup` (JobObject on Windows, process group on POSIX) for bulk cancellation on shutdown - `--managed` flag on `zen exec inproc` to select this runner - Refactored monitor thread lifecycle — `StartMonitorThread()` now called from derived constructors to avoid calling virtual functions from base constructor ### Process management - **Suppress crash dialogs** via `JOB_OBJECT_UILIMIT_ERRORMODE` + `SEM_NOGPFAULTERRORBOX` in both `WindowsProcessRunner` and `JobObject::Initialize` — prevents WER/Dr. Watson modal dialogs from blocking the monitor thread - **CREATE_SUSPENDED → AssignProcessToJobObject → ResumeThread** pattern in `WindowsProcessRunner` — ensures job object assignment before process execution - **Move stdout/stderr callbacks to `Spawn()` parameters** in `SubprocessManager` — prevents race where early output could be missed before callback installation - Consistent PID logging across all runner types ### Test infrastructure - **`zentest-appstub`**: Added `Fail` (configurable exit code) and `Crash` (abort / nullptr deref) test functions - **Compute integration tests**: exit code handling, auto-retry exhaustion, manual reschedule after failure, mixed success/failure queues, crash handling (abort + nullptr), crash auto-retry, immediate query visibility after enqueue - **Package format tests**: truncated header, bad magic, attachment count overflow, truncated data, local ref rejection/acceptance, policy enforcement (inside/outside root, traversal, no-policy fail-closed) - **Legacy package parser tests**: empty input, zero-size binary, hash resolution with/without mapper, hash mismatch detection - **UNC path tests** for `MakeSafeAbsolutePath` ### Misc - ANSI color helper macros (`ZEN_RED`, `ZEN_BRIGHT_WHITE`, etc.) and `ZEN_BOLD`/`ZEN_DIM`/etc. - Generic `fmt::formatter` for types with free `ToString` functions - Compute dashboard: truncated hash display with monospace font and hover for full value - Renamed `usonpackage_forcelink` → `cbpackage_forcelink` - Compute enabled by default in xmake config (releases still explicitly disable)
* hub s3 hydrate improvements (#902)Dan Engelbrecht2026-03-308-99/+374
| | | | | | | | | | | | | | | | | | | | | | | | - Feature: Added `--hub-hydration-target-config` option to specify the hydration target via a JSON config file (mutually exclusive with `--hub-hydration-target-spec`); supports `file` and `s3` types with structured settings ```json { "type": "file", "settings": { "path": "/path/to/hydration/storage" } } ``` ```json { "type": "s3", "settings": { "uri": "s3://bucket[/prefix]", "region": "us-east-1", "endpoint": "http://localhost:9000", "path-style": true } } ``` - Improvement: Hub hydration dehydration skips the `.sentry-native` directory - Bugfix: Fixed `MakeSafeAbsolutePathInPlace` when a UNC prefix is present but path uses mixed delimiters
* hub resource limits (#900)Dan Engelbrecht2026-03-3016-306/+499
| | | | | | | | | | | | - Feature: Hub dashboard now shows a Resources tile with disk and memory usage against configured limits - Feature: Hub module listing now shows state-change timestamps and duration for each instance - Improvement: Hub provisioning rejects new instances when disk or memory usage exceeds configurable thresholds; limits are disabled by default (0 = no limit) - `--hub-provision-disk-limit-bytes` - Reject provisioning when used disk exceeds this many bytes - `--hub-provision-disk-limit-percent` - Reject provisioning when used disk exceeds this percentage of total disk - `--hub-provision-memory-limit-bytes` - Reject provisioning when used memory exceeds this many bytes - `--hub-provision-memory-limit-percent` - Reject provisioning when used memory exceeds this percentage of total RAM - Improvement: Hub process metrics are now tracked atomically per active instance slot, eliminating per-query process handle lookups - Improvement: Hub, Build Store, and Workspaces service stats sections in the dashboard are now collapsible - Bugfix: Hub watchdog loop did not check `m_ShutdownFlag`, causing it to spin indefinitely on shutdown
* Misc small fixes (#897)Stefan Boberg2026-03-271-1/+1
| | | | | | | | | | - **Eliminate `<regex>` usage** — Replaced `std::regex`-based URL parsing in `jupiterbuildstorage.cpp` with manual `string_view` parsing. Added `CXXOPTS_NO_REGEX` to disable regex in cxxopts. Includes comprehensive tests for the new URL parser. - **Add missing HTTP response codes** — Added `102`, `103`, `203`, `207`, `208`, `226`, `306`, `421`, `425`, `451` to the enum and reason string lookup. - **Add `ForceColor` support to zen CLI** — Plumbed the `ForceColor` logging option through to the zen client. - **Add `.clangd` config** — Strips MSVC-specific flags clangd can't handle and suppresses noisy clang-tidy checks. - **Generic `fmt::formatter` for `ToString`** — Concept-based formatter that auto-formats any type with a free `ToString()` function, removing the need for per-type specializations. - **Fix OpenSSL dependency** — Changed `zenhorde` to use `openssl3` package on Linux/macOS. - **Add `<cmath>` include** — Missing include in `hyperloglog.h`. - **GCC compile fix** — Moved `static constinit` variable inside lambda in `logging.cpp`.
* dashboard improvements (#896)Dan Engelbrecht2026-03-2717-250/+633
| | | | | | - Feature: Added Workspaces dashboard page with HTTP request stats and per-workspace metrics - Feature: Added Build Storage dashboard page with service-specific HTTP request stats - Improvement: Front page now shows Hub and Object Store activity tiles; HTTP panel is fixed above the tiles grid - Improvement: HTTP stats tiles now include 5m/15m rates and p999/max latency across all service pages
* remove CPR HTTP client backend (#894)Stefan Boberg2026-03-272-2/+2
| | | CPR is no longer needed now that HttpClient has fully transitioned to raw libcurl. This removes the CPR library, its build integration, implementation files, and all conditional compilation guards, leaving curl as the sole HTTP client backend.
* idle deprovision in hub (#895)Dan Engelbrecht2026-03-2726-295/+813
| | | | | | | | | | | | | - Feature: Hub watchdog automatically deprovisions inactive provisioned and hibernated instances - Feature: Added `stats/activity_counters` endpoint to measure server activity - Feature: Added configuration options for hub watchdog - `--hub-watchdog-provisioned-inactivity-timeout-seconds` Inactivity timeout before a provisioned instance is deprovisioned - `--hub-watchdog-hibernated-inactivity-timeout-seconds` Inactivity timeout before a hibernated instance is deprovisioned - `--hub-watchdog-inactivity-check-margin-seconds` Margin before timeout at which an activity check is issued - `--hub-watchdog-cycle-interval-ms` Watchdog poll interval in milliseconds - `--hub-watchdog-cycle-processing-budget-ms` Maximum time budget per watchdog cycle in milliseconds - `--hub-watchdog-instance-check-throttle-ms` Minimum delay between checks on a single instance - `--hub-watchdog-activity-check-connect-timeout-ms` Connect timeout for activity check requests - `--hub-watchdog-activity-check-request-timeout-ms` Request timeout for activity check requests
* hub instance state refactor (#892)Dan Engelbrecht2026-03-277-682/+721
| | | | | | - Improvement: Provisioning a hibernated instance now automatically wakes it instead of requiring an explicit wake call first - Improvement: Deprovisioning now accepts instances in Crashed or Hibernated states, not just Provisioned - Improvement: Added `--consul-health-interval-seconds` and `--consul-deregister-after-seconds` options to control Consul health check behavior (defaults: 10s and 30s) - Improvement: Consul registration now occurs when provisioning starts; health check intervals are applied once provisioning completes
* hub async provision/deprovision/hibernate/wake (#891)Dan Engelbrecht2026-03-244-364/+947
| | | | | - Improvement: Hub provision, deprovision, hibernate, and wake operations are now async. HTTP requests returns 202 Accepted while the operation completes in the background - Improvement: Hub returns 202 Accepted (instead of 409 Conflict) when the same async operation is already in progress for a module - Improvement: Hub returns 200 OK when a requested state transition is already satisfied
* refactor hub notifications (#888)Dan Engelbrecht2026-03-237-223/+371
| | | | * refactor hub callbacks * improve http responses
* Cross-platform process metrics support (#887)Stefan Boberg2026-03-233-24/+8
| | | | | | | - **Cross-platform `GetProcessMetrics`**: Implement Linux (`/proc/{pid}/stat`, `/proc/{pid}/statm`, `/proc/{pid}/status`) and macOS (`proc_pidinfo(PROC_PIDTASKINFO)`) support for CPU times and memory metrics. Fix Windows to populate the `MemoryBytes` field (was always 0). All platforms now set `MemoryBytes = WorkingSetSize`. - **`ProcessMetricsTracker`**: Experimental utility class (`zenutil`) that periodically samples resource usage for a set of tracked child processes. Supports both a dedicated background thread and an ASIO steady_timer mode. Computes delta-based CPU usage percentage across samples, with batched sampling (8 processes per tick) to limit per-cycle overhead. - **`ProcessHandle` documentation**: Add Doxygen comments to all public methods describing platform-specific behavior. - **Cleanup**: Remove unused `ZEN_RUN_TESTS` macro (inlined at its single call site in `zenserver/main.cpp`), remove dead `#if 0` thread-shutdown workaround block. - **Minor fixes**: Use `HttpClientAccessToken` constructor in hordeclient instead of setting private members directly. Log ASIO version at startup and include it in the server settings list.
* add tests for s3 and file hydrators (#886)Dan Engelbrecht2026-03-232-4/+699
|
* Dashboard refresh (logs, storage, network, object store, docs) (#835)Stefan Boberg2026-03-2343-331/+9229
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## Summary This PR adds a session management service, several new dashboard pages, and a number of infrastructure improvements. ### Sessions Service - `SessionsServiceClient` in `zenutil` announces sessions to a remote zenserver with a 15s heartbeat (POST/PUT/DELETE lifecycle) - Storage server registers itself with its own local sessions service on startup - Session mode attribute coupled to server mode (Compute, Proxy, Hub, etc.) - Ended sessions tracked with `ended_at` timestamp; status filtering (Active/Ended/All) - `--sessions-url` config option for remote session announcement - In-process log sink (`InProcSessionLogSink`) forwards server log output to the server's own session, visible in the dashboard ### Session Log Viewer - POST/GET endpoints for session logs (`/sessions/{id}/log`) supporting raw text and structured JSON/CbObject with batch `entries` array - In-memory log storage per session (capped at 10k entries) with cursor-based pagination for efficient incremental fetching - Log panel in the sessions dashboard with incremental DOM updates, auto-scroll (Follow toggle), newest-first toggle, text filter, and log-level coloring - Auto-selects the server's own session on page load ### TCP Log Streaming - `LogStreamListener` and `TcpLogStreamSink` for log delivery over TCP - Sequence numbers on each message with drop detection and synthetic "dropped" notice on gaps - Gathered buffer writes to reduce syscall overhead when flushing batches - Tests covering basic delivery, multi-line splitting, drop detection, and sequencing ### New Dashboard Pages - **Sessions**: master-detail layout with selectable rows, metadata panel, live WebSocket updates, paging, abbreviated date formatting, and "this" pill for the local session - **Object Store**: summary stats tiles and bucket table with click-to-expand inline object listing (`GET /obj/`) - **Storage**: per-volume disk usage breakdown (`GET /admin/storage`), Garbage Collection status section (next-run countdown, last-run stats), and GC History table with paginated rows and expandable detail panels - **Network**: overview tiles, per-service request table, proxy connections, and live WebSocket updates; distinct client IPs and session counts via HyperLogLog ### Documentation Page - In-dashboard Docs page with sidebar navigation, markdown rendering (via `marked`), Mermaid diagram support (theme-aware), collapsible sections, text filtering with highlighting, and cross-document linking - New user-facing docs: `overview.md` (with architecture and per-mode diagrams), `sessions.md`, `cache.md`, `projects.md`; updated `compute.md` - Dev docs moved to `docs/dev/` ### Infrastructure & Bug Fixes - **Deflate compression** for the embedded frontend zip (~3.4MB → ~950KB); zlib inflate support added to `ZipFs` with cached decompressed buffers - **Local IP addresses**: `GetLocalIpAddresses()` (Windows via `GetAdaptersAddresses`, Linux/Mac via `getifaddrs`); surfaced in `/status/status`, `/health/info`, and the dashboard banner - **Dashboard nav**: unified into `zen-nav` web component with `MutationObserver` for dynamically added links, CSS `::part()` to merge banner/nav border radii, and prefix-based active link detection - Stats broadcast refactored from manual JSON string concatenation to `CbObjectWriter`; `CbObject`-to-JS conversion improved for `TimeSpan`, `DateTime`, and large integers - Stats WebSocket boilerplate consolidated into `ZenPage.connect_stats_ws()`
* add hub instance crash recovery (#885)Dan Engelbrecht2026-03-238-35/+317
| | | * add hub instance crash recovery
* Logger simplification (#883)Stefan Boberg2026-03-232-8/+14
| | | | | | | | | | | - **`Logger` now holds a single `SinkPtr`** instead of a `std::vector<SinkPtr>`. The `SetSinks`/`AddSink` API is replaced with a single `SetSink`. This removes complexity from `Logger` itself and makes `Clone()` cheaper (no vector copy). - **New `BroadcastSink`** (`zencore/logging/broadcastsink.h`) acts as a thread-safe, shared indirection point that fans out to a dynamic list of child sinks. Adding or removing a child sink via `AddSink`/`RemoveSink` is immediately visible to every `Logger` that holds a reference to it — including cloned loggers — without requiring each logger to be updated individually. - **`GetDefaultBroadcastSink()`** (exposed from `zenutil/logging.h`) gives server-layer code access to the shared broadcast sink so it can register optional sinks (OTel, TCP log stream) after logging is initialized, without going through `Default()->AddSink()`. ### Motivation Previously, dynamically adding sinks post-initialization mutated the default logger's internal sink vector directly. This was fragile: cloned loggers (created before `AddSink` was called) would not pick up the new sinks. `BroadcastSink` fixes this by making the sink list a shared, mutable object that all loggers sharing the same broadcast instance observe uniformly.
* Process management improvements (#881)Stefan Boberg2026-03-231-5/+27
| | | | | | | | | | | This PR improves process lifecycle handling and resilience across several areas: - **Reclaim stale shared-memory entries instead of exiting** (`zenserver.cpp`): When a zenserver instance fails to attach as a sponsor to an existing process (e.g. because the PID was reused by an unrelated process), the server now clears the stale shared-memory entry and proceeds with normal startup instead of calling `std::exit(1)`. - **Wait for child process exit in `Kill()` and `Terminate()` on Unix** (`process.cpp`): After sending `SIGTERM` in `Kill()`, the code now waits up to 5s for graceful shutdown (escalating to `SIGKILL` on timeout), matching the Windows behavior. `Terminate()` also waits after `SIGKILL` so the child is properly reaped and doesn't linger as a zombie clogging up the process table. - **Fix sysctl buffer race in macOS `FindProcess`** (`process.cpp`): The macOS process enumeration now retries the `sysctl` call (up to 3 attempts with 25% buffer padding) to handle the race where the process list changes between the sizing call and the data-fetching call. Also flattens the nesting and fixes the guard/free scoping. - **Terminate stale processes before integration tests** (`zenserver-test.cpp`, `test.lua`): The integration test runner now accepts a `--kill-stale-processes` flag (passed automatically by `test.lua`) that scans for and terminates any leftover `zenserver`, `zenserver-test`, and `zentest-appstub` processes from previous test runs, logging the executable name and PID of each. This addresses flaky test failures caused by stale processes from prior runs holding ports or other resources.
* improve zenserver startup time (#879)Dan Engelbrecht2026-03-221-1/+1
| | | | - Improvement: Lazy initialize CpuSampler - reduces startup time by ~600 ms - Bugfix: Don't try to wipe .sentry-native folder at missing manifest - sentry is already running. Reduces startup time by ~450 ms when data folder is empty
* S3 hydration backend for hub mode (#873)Dan Engelbrecht2026-03-221-0/+379
| | | | | | - Feature: Added S3 hydration backend for hub mode (`--hub-hydration-target-spec s3://<bucket>[/<prefix>]`) - Credentials resolved from `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY` env vars, falling back to EC2 instance profile via IMDS - Each dehydration uploads to a new timestamped folder and commits a `current-state.json` pointer on success, so a failed upload never invalidates the previous state - Hydration downloads to a temp directory first and only replaces the server state on full success; failures leave the existing state intact
* hub web UI improvements (#878)Dan Engelbrecht2026-03-225-20/+746
| | | | | | | | | | - Improvement: Hub dashboard module list improved - Animated state dots, with flashing transitions for hibernating, waking, provisioning, and deprovisioning - Per-row port used by the provisioned instance - Per-row hibernate, wake, and deprovision actions with confirmation for destructive operations - Per-row button to open the instance dashboard in a new window - Multi-select with bulk hibernate/wake/deprovision and select-all - Pagination at 50 lines per page - Inline fold-out panel per row with human-readable process metrics
* Upgrade mimalloc to v2.2.7 and log active memory allocator (#876)Stefan Boberg2026-03-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Upgrade mimalloc from v2.1.2 to v2.2.7. Note that mimalloc is no longer the default allocator so this only impacts users who somehow opt into mimalloc via `--malloc=mimalloc` or compile with different defaults - Add all available mimalloc versions (1.6.7–3.2.8) to the package definition for testing - Log the active memory allocator (with version where available) at server startup - Annotate vendored rpmalloc with its source commit and version ## Notable changes in mimalloc 2.1.2 → 2.2.7 - **Memory release fix** (2.2.4): fix case where OS memory was not always fully released - **Race condition fix** (2.2.6): fixed rare race condition and potential buffer overflow in debug statistics - **Windows arm64 support** (2.1.9) - **Guarded build** (2.1.9): new build mode that places OS guard pages behind objects to catch buffer overflows - **THP awareness** (2.2.6): auto-detects transparent huge pages and adjusts purge size to avoid fragmentation - **Faster TLS access on Windows** (2.2.6) - **Improved calloc and aligned allocation performance** (2.2.6) - **New diagnostic APIs** (2.2.2): `mi_options_print`, `mi_arenas_print`, `mi_stat_get` / `mi_stat_get_json` - **macOS**: use `MADV_FREE_REUSABLE` for better memory behavior (2.2.4) - **Build fixes**: Android, Xbox, musl, mingw, arm32, Debian 32-bit, non-BMI1 x64 systems ## Allocator logging Added `FMalloc::GetName()` pure virtual so the server logs which allocator is active at startup: ``` zenserver - memory allocator: mimalloc 2.2.7 ``` Allocator names include version where available: - `mimalloc 2.2.7` (runtime version via `mi_version()`) - `rpmalloc 1.5.0-dev.20250810` (ad-hoc version from vendored develop branch commit) - `ansi`, `stomp` (no version info available) ## Test plan - [x] Builds successfully on Windows (release) - [x] Verify server startup log shows allocator name - [x] Test with `--malloc=mimalloc` (default) and `--malloc=rpmalloc` - [x] Run test suites to check for regressions
* zen hub command (#877)Dan Engelbrecht2026-03-215-145/+353
| | | | | | | | | | | | | | | - Feature: Added `zen hub` command for managing a hub server and its provisioned module instances: - `zen hub up` - Start a hub server (equivalent to `zen up` in hub mode) - `zen hub down` - Shut down a hub server - `zen hub provision <moduleid>` - Provision a storage server instance for a module - `zen hub deprovision <moduleid>` - Deprovision a storage server instance - `zen hub hibernate <moduleid>` - Hibernate a provisioned instance (shut down, data preserved) - `zen hub wake <moduleid>` - Wake a hibernated instance - `zen hub status [moduleid]` - Show state of all instances or a specific module - Feature: Added new hub HTTP endpoints for instance lifecycle management: - `POST /hub/modules/{moduleid}/hibernate` - Hibernate the instance for the given module - `POST /hub/modules/{moduleid}/wake` - Wake a hibernated instance for the given module - Improvement: `zen up` refactored to use shared `StartupZenServer`/`ShutdownZenServer` helpers (also used by `zen hub up`/`zen hub down`) - Bugfix: Fixed shutdown event not being cleared after the server process exits in `ZenServerInstance::Shutdown()`, which could cause stale state on reuse
* Interprocess pipe support (for stdout/stderr capture) (#866)Stefan Boberg2026-03-212-4/+23
| | | | | | | | | | | | | | | | | - **RAII pipe handles for child process stdout/stderr capture**: `StdoutPipeHandles` is now a proper RAII type with automatic cleanup, move semantics, and partial close support. This makes it safe to use pipes for capturing child process output without risking handle/fd leaks. - **Optional separate stderr pipe**: `CreateProcOptions` now accepts a `StderrPipe` field so callers can capture stdout and stderr independently. When null (default), stderr shares the stdout pipe as before. - **LogStreamListener with pluggable handler**: The TCP log stream listener accepts connections from remote processes and delivers parsed log lines through a `LogStreamHandler` interface, set dynamically via `SetHandler()`. This allows any client to receive log messages without depending on a specific console implementation. - **TcpLogStreamSink for zen::logging**: A logging sink that forwards log messages to a `LogStreamListener` over TCP, using the native `zen::logging::Sink` infrastructure with proper thread-safe synchronization. - **Reliable child process exit codes on Linux**: `waitpid` result handling is fixed so `ProcessHandle::GetExitCode()` returns the real exit code. `ProcessHandle::Reset()` reaps zombies directly, replacing the global `IgnoreChildSignals()` which prevented exit code collection entirely. Also fixes a TOCTOU race in `ProcessHandle::Wait()` on Linux/Mac. - **Pipe capture test suite**: Tests covering stdout/stderr capture via pipes (both shared and separate modes), RAII cleanup, move semantics, and exit code propagation using `zentest-appstub` as the child process. - **Service command integration tests**: Shell-based integration tests for `zen service` covering the full lifecycle (install, status, start, stop, uninstall) on all three platforms — Linux (systemd), macOS (launchd), and Windows (SCM via PowerShell). - **Test script reorganization**: Platform-specific test scripts moved from `scripts/test_scripts/` into `scripts/test_linux/`, `test_mac/`, and `test_windows/`.
* fix null stats provider crash when build store is not configured (#875)v5.7.25-pre0Stefan Boberg2026-03-212-1/+8
| | | | | | - When build store is not configured, `m_BuildCidStore` is null but was unconditionally added as a stats provider, causing a null pointer dereference crash in `StatsReporter::ReportStats` - Added a null guard in `AddProvider` to reject null pointers defensively - Added a conditional check at the call site in `InitializeStructuredCache`
* add hub instance info (#869)Dan Engelbrecht2026-03-2010-180/+798
| | | | | | | - Improvement: Hub module listing now includes per-instance process metrics (memory, CPU time, working set, pagefile usage) - Improvement: Hub now monitors provisioned instance health in the background and refreshes process metrics periodically - Improvement: Hub no longer exposes raw `StorageServerInstance` pointers to callers; instance state is returned as value snapshots (`Hub::InstanceInfo`) - Improvement: Hub instance access is now guarded by RAII per-instance locks (`SharedLockedPtr`/`ExclusiveLockedPtr`), preventing concurrent modifications during provisioning and deprovisioning - Improvement: Hub instance lifecycle is now tracked as a `HubInstanceState` enum covering transitional states (Provisioning, Deprovisioning, Hibernating, Waking); exposed as a string in the HTTP API and dashboard