| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
| |
* use correct health endpoint for zenhubserver consul registration
* add total disk space on hub resource pane
|
| |
|
|
|
|
|
|
|
|
|
| |
* fix endpoint for stats/hub in compute/hub.html page
* fix api token call failure for imds (using wrong overload for Put)
* add "localhost" to healt check url in consul when no address is given
* add consul fallback deregister if normal deregister fails
* add consul registration unit test
|
| | |
|
| |
|
| |
- Feature: Hub dashboard proxy - instance dashboards are accessible through the hub server at `/hub/proxy/{port}/` without requiring direct port access
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### Security: Input validation & path safety
- **Reject local file references by default** in package parsing — only allow when explicitly opted in by the service (`ParseFlags::kAllowLocalReferences`) and validated by an `ILocalRefPolicy` (fail-closed: no policy = rejected)
- **`DataRootLocalRefPolicy`** restricts local ref paths to the server's data root via canonical path prefix matching
- **Validate attachment hashes** in compute HTTP handlers — decompresses and re-hashes each attachment at ingestion time to reject tampered payloads
- **Path traversal validation** for worker descriptions (`pathvalidation.h`) — rejects absolute paths, `..` components, Windows reserved device names, and invalid filename characters
- **Harden CbPackage parsing** against corrupt inputs — overflow-safe attachment count, bounds checks on local ref offset/size, graceful failure instead of `ZEN_ASSERT` for untrusted data
- **Harden legacy package parser** — reject zero-size binary fields, missing mappers, and optionally validate resolved attachment hashes
- **Bounds check in `CbPackageReader::MarshalLocalChunkReference`** — detect when `MakeFromFile` silently clamps offset+size to file size
### Reliability: Lock consolidation & bug fixes
- **Consolidate three action map locks into one** (`m_ActionMapLock`) — eliminates deadlock risk from multi-lock ordering, simplifies state transitions, and fixes a race where newly enqueued actions were briefly invisible to `GetActionResult`/`FindActionResult`
- **Fix infinite loop in `BaseRunnerGroup::SubmitActions`** when actions exceed total runner capacity — cap round-robin at `TotalCapacity` and default unassigned results to "No capacity"
- **Fix `MakeSafeAbsolutePathInPlace` for UNC paths** — `\server\share` now correctly becomes `\?\UNC\server\share` instead of `\?\server\share`
- **Fix `max_retries=0`** — previously fell through to the default of 3; now correctly means "no retries"
### New: ManagedProcessRunner
- Cross-platform process runner backed by `SubprocessManager` — uses async exit callbacks instead of polling, delegates CPU/memory metrics to the manager's built-in sampler
- `ProcessGroup` (JobObject on Windows, process group on POSIX) for bulk cancellation on shutdown
- `--managed` flag on `zen exec inproc` to select this runner
- Refactored monitor thread lifecycle — `StartMonitorThread()` now called from derived constructors to avoid calling virtual functions from base constructor
### Process management
- **Suppress crash dialogs** via `JOB_OBJECT_UILIMIT_ERRORMODE` + `SEM_NOGPFAULTERRORBOX` in both `WindowsProcessRunner` and `JobObject::Initialize` — prevents WER/Dr. Watson modal dialogs from blocking the monitor thread
- **CREATE_SUSPENDED → AssignProcessToJobObject → ResumeThread** pattern in `WindowsProcessRunner` — ensures job object assignment before process execution
- **Move stdout/stderr callbacks to `Spawn()` parameters** in `SubprocessManager` — prevents race where early output could be missed before callback installation
- Consistent PID logging across all runner types
### Test infrastructure
- **`zentest-appstub`**: Added `Fail` (configurable exit code) and `Crash` (abort / nullptr deref) test functions
- **Compute integration tests**: exit code handling, auto-retry exhaustion, manual reschedule after failure, mixed success/failure queues, crash handling (abort + nullptr), crash auto-retry, immediate query visibility after enqueue
- **Package format tests**: truncated header, bad magic, attachment count overflow, truncated data, local ref rejection/acceptance, policy enforcement (inside/outside root, traversal, no-policy fail-closed)
- **Legacy package parser tests**: empty input, zero-size binary, hash resolution with/without mapper, hash mismatch detection
- **UNC path tests** for `MakeSafeAbsolutePath`
### Misc
- ANSI color helper macros (`ZEN_RED`, `ZEN_BRIGHT_WHITE`, etc.) and `ZEN_BOLD`/`ZEN_DIM`/etc.
- Generic `fmt::formatter` for types with free `ToString` functions
- Compute dashboard: truncated hash display with monospace font and hover for full value
- Renamed `usonpackage_forcelink` → `cbpackage_forcelink`
- Compute enabled by default in xmake config (releases still explicitly disable)
|
| |
|
|
|
|
|
|
|
|
|
|
| |
- Feature: Hub dashboard now shows a Resources tile with disk and memory usage against configured limits
- Feature: Hub module listing now shows state-change timestamps and duration for each instance
- Improvement: Hub provisioning rejects new instances when disk or memory usage exceeds configurable thresholds; limits are disabled by default (0 = no limit)
- `--hub-provision-disk-limit-bytes` - Reject provisioning when used disk exceeds this many bytes
- `--hub-provision-disk-limit-percent` - Reject provisioning when used disk exceeds this percentage of total disk
- `--hub-provision-memory-limit-bytes` - Reject provisioning when used memory exceeds this many bytes
- `--hub-provision-memory-limit-percent` - Reject provisioning when used memory exceeds this percentage of total RAM
- Improvement: Hub process metrics are now tracked atomically per active instance slot, eliminating per-query process handle lookups
- Improvement: Hub, Build Store, and Workspaces service stats sections in the dashboard are now collapsible
- Bugfix: Hub watchdog loop did not check `m_ShutdownFlag`, causing it to spin indefinitely on shutdown
|
| |
|
|
|
|
| |
- Feature: Added Workspaces dashboard page with HTTP request stats and per-workspace metrics
- Feature: Added Build Storage dashboard page with service-specific HTTP request stats
- Improvement: Front page now shows Hub and Object Store activity tiles; HTTP panel is fixed above the tiles grid
- Improvement: HTTP stats tiles now include 5m/15m rates and p999/max latency across all service pages
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
- Feature: Hub watchdog automatically deprovisions inactive provisioned and hibernated instances
- Feature: Added `stats/activity_counters` endpoint to measure server activity
- Feature: Added configuration options for hub watchdog
- `--hub-watchdog-provisioned-inactivity-timeout-seconds` Inactivity timeout before a provisioned instance is deprovisioned
- `--hub-watchdog-hibernated-inactivity-timeout-seconds` Inactivity timeout before a hibernated instance is deprovisioned
- `--hub-watchdog-inactivity-check-margin-seconds` Margin before timeout at which an activity check is issued
- `--hub-watchdog-cycle-interval-ms` Watchdog poll interval in milliseconds
- `--hub-watchdog-cycle-processing-budget-ms` Maximum time budget per watchdog cycle in milliseconds
- `--hub-watchdog-instance-check-throttle-ms` Minimum delay between checks on a single instance
- `--hub-watchdog-activity-check-connect-timeout-ms` Connect timeout for activity check requests
- `--hub-watchdog-activity-check-request-timeout-ms` Request timeout for activity check requests
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## Summary
This PR adds a session management service, several new dashboard pages, and a number of infrastructure improvements.
### Sessions Service
- `SessionsServiceClient` in `zenutil` announces sessions to a remote zenserver with a 15s heartbeat (POST/PUT/DELETE lifecycle)
- Storage server registers itself with its own local sessions service on startup
- Session mode attribute coupled to server mode (Compute, Proxy, Hub, etc.)
- Ended sessions tracked with `ended_at` timestamp; status filtering (Active/Ended/All)
- `--sessions-url` config option for remote session announcement
- In-process log sink (`InProcSessionLogSink`) forwards server log output to the server's own session, visible in the dashboard
### Session Log Viewer
- POST/GET endpoints for session logs (`/sessions/{id}/log`) supporting raw text and structured JSON/CbObject with batch `entries` array
- In-memory log storage per session (capped at 10k entries) with cursor-based pagination for efficient incremental fetching
- Log panel in the sessions dashboard with incremental DOM updates, auto-scroll (Follow toggle), newest-first toggle, text filter, and log-level coloring
- Auto-selects the server's own session on page load
### TCP Log Streaming
- `LogStreamListener` and `TcpLogStreamSink` for log delivery over TCP
- Sequence numbers on each message with drop detection and synthetic "dropped" notice on gaps
- Gathered buffer writes to reduce syscall overhead when flushing batches
- Tests covering basic delivery, multi-line splitting, drop detection, and sequencing
### New Dashboard Pages
- **Sessions**: master-detail layout with selectable rows, metadata panel, live WebSocket updates, paging, abbreviated date formatting, and "this" pill for the local session
- **Object Store**: summary stats tiles and bucket table with click-to-expand inline object listing (`GET /obj/`)
- **Storage**: per-volume disk usage breakdown (`GET /admin/storage`), Garbage Collection status section (next-run countdown, last-run stats), and GC History table with paginated rows and expandable detail panels
- **Network**: overview tiles, per-service request table, proxy connections, and live WebSocket updates; distinct client IPs and session counts via HyperLogLog
### Documentation Page
- In-dashboard Docs page with sidebar navigation, markdown rendering (via `marked`), Mermaid diagram support (theme-aware), collapsible sections, text filtering with highlighting, and cross-document linking
- New user-facing docs: `overview.md` (with architecture and per-mode diagrams), `sessions.md`, `cache.md`, `projects.md`; updated `compute.md`
- Dev docs moved to `docs/dev/`
### Infrastructure & Bug Fixes
- **Deflate compression** for the embedded frontend zip (~3.4MB → ~950KB); zlib inflate support added to `ZipFs` with cached decompressed buffers
- **Local IP addresses**: `GetLocalIpAddresses()` (Windows via `GetAdaptersAddresses`, Linux/Mac via `getifaddrs`); surfaced in `/status/status`, `/health/info`, and the dashboard banner
- **Dashboard nav**: unified into `zen-nav` web component with `MutationObserver` for dynamically added links, CSS `::part()` to merge banner/nav border radii, and prefix-based active link detection
- Stats broadcast refactored from manual JSON string concatenation to `CbObjectWriter`; `CbObject`-to-JS conversion improved for `TimeSpan`, `DateTime`, and large integers
- Stats WebSocket boilerplate consolidated into `ZenPage.connect_stats_ws()`
|
| |
|
| |
* add hub instance crash recovery
|
| |
|
|
|
|
|
|
|
|
| |
- Improvement: Hub dashboard module list improved
- Animated state dots, with flashing transitions for hibernating, waking, provisioning, and deprovisioning
- Per-row port used by the provisioned instance
- Per-row hibernate, wake, and deprovision actions with confirmation for destructive operations
- Per-row button to open the instance dashboard in a new window
- Multi-select with bulk hibernate/wake/deprovision and select-all
- Pagination at 50 lines per page
- Inline fold-out panel per row with human-readable process metrics
|
| |
|
|
|
|
|
| |
- Improvement: Hub module listing now includes per-instance process metrics (memory, CPU time, working set, pagefile usage)
- Improvement: Hub now monitors provisioned instance health in the background and refreshes process metrics periodically
- Improvement: Hub no longer exposes raw `StorageServerInstance` pointers to callers; instance state is returned as value snapshots (`Hub::InstanceInfo`)
- Improvement: Hub instance access is now guarded by RAII per-instance locks (`SharedLockedPtr`/`ExclusiveLockedPtr`), preventing concurrent modifications during provisioning and deprovisioning
- Improvement: Hub instance lifecycle is now tracked as a `HubInstanceState` enum covering transitional states (Provisioning, Deprovisioning, Hibernating, Waking); exposed as a string in the HTTP API and dashboard
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### Compute Batch Submission
- Consolidate duplicated action submission logic in `httpcomputeservice` into a single `HandleSubmitAction` supporting both single-action and batch (actions array) payloads
- Group actions by queue in `RemoteHttpRunner` and submit as batches with configurable chunk size, falling back to individual submission on failure
- Extract shared helpers: `MakeErrorResult`, `ValidateQueueForEnqueue`, `ActivateActionInQueue`, `RemoveActionFromActiveMaps`
### Retracted Action State
- Add `Retracted` state to `RunnerAction` for retry-free rescheduling — an explicit request to pull an action back and reschedule it on a different runner without incrementing `RetryCount`
- Implement idempotent `RetractAction()` on `RunnerAction` and `ComputeServiceSession`
- Add `POST jobs/{lsn}/retract` and `queues/{queueref}/jobs/{lsn}/retract` HTTP endpoints
- Add state machine documentation and per-state comments to `RunnerAction`
### Compute Race Fixes
- Fix race in `HandleActionUpdates` where actions enqueued between session abandon and scheduler tick were never abandoned, causing `GetActionResult` to return 202 indefinitely
- Fix queue `ActiveCount` race where `NotifyQueueActionComplete` was called after releasing `m_ResultsLock`, allowing callers to observe stale counters immediately after `GetActionResult` returned OK
### Logging Optimization and ANSI improvements
- Improve `AnsiColorStdoutSink` write efficiency — single write call, dirty-flag flush, `RwLock` instead of `std::mutex`
- Move ANSI color emission from sink into formatters via `Formatter::SetColorEnabled()`; remove `ColorRangeStart`/`End` from `LogMessage`
- Extract color helpers (`AnsiColorForLevel`, `StripAnsiSgrSequences`) into `helpers.h`
- Strip upstream ANSI SGR escapes in non-color output mode. This enables colour in log messages without polluting log files with ANSI control sequences
- Move `RotatingFileSink`, `JsonFormatter`, and `FullFormatter` from header-only to pimpl with `.cpp` files
### CLI / Exec Refactoring
- Extract `ExecSessionRunner` class from ~920-line `ExecUsingSession` into focused methods and a `ExecSessionConfig` struct
- Replace monolithic `ExecCommand` with subcommand-based architecture (`http`, `inproc`, `beacon`, `dump`, `buildlog`)
- Allow parent options to appear after subcommand name by parsing subcommand args permissively and forwarding unmatched tokens to the parent parser
### Testing Improvements
- Fix `--test-suite` filter being ignored due to accumulation with default wildcard filter
- Add test suite banners to test listener output
- Made `function.session.abandon_pending` test more robust
### Startup / Reliability Fixes
- Fix silent exit when a second zenserver instance detects a port conflict — use `ZEN_CONSOLE_*` for log calls that precede `InitializeLogging()`
- Fix two potential SIGSEGV paths during early startup: guard `sentry_options_new()` returning nullptr, and throw on `ZenServerState::Register()` returning nullptr instead of dereferencing
- Fail on unrecognized zenserver `--mode` instead of silently defaulting to store
### Other
- Show host details (hostname, platform, CPU count, memory) when discovering new compute workers
- Move frontend `html.zip` from source tree into build directory
- Add format specifications for Compact Binary and Compressed Buffer wire formats
- Add `WriteCompactBinaryObject` to zencore
- Extended `ConsoleTui` with additional functionality
- Add `--vscode` option to `xmake sln` for clangd / `compile_commands.json` support
- Disable compute/horde/nomad in release builds (not yet production-ready)
- Disable unintended `ASIO_HAS_IO_URING` enablement
- Fix crashpad patch missing leading whitespace
- Clean up code triggering gcc false positives
|
| |
|
|
|
|
|
| |
- Fix potential crash on startup caused by logging macros being invoked before the logging system is initialized (null logger dereference in `ZenServerState::Sweep()`). `LoggerRef::ShouldLog` now guards against a null logger pointer.
- Make CPR an optional dependency (`--zencpr` build option, enabled by default) so builds can proceed without it
- Make zenvfs Windows-only (platform-specific target)
- Generate the frontend zip at build time from source HTML files instead of checking in a binary blob which would accumulate with every single update
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds a **transparent TCP proxy mode** to zenserver (activated via `zenserver proxy`), allowing it to sit between clients and upstream Zen servers to inspect and monitor HTTP/1.x traffic in real time. Primarily useful during development, to be able to observe multi-server/client interactions in one place.
- **Dedicated proxy port** -- Proxy mode defaults to port 8118 with its own data directory to avoid collisions with a normal zenserver instance.
- **TCP proxy core** (`src/zenserver/proxy/`) -- A new transparent TCP proxy that forwards connections to upstream targets, with support for both TCP/IP and Unix socket listeners. Multi-threaded I/O for connection handling. Supports Unix domain sockets for both upstream/downstream.
- **HTTP traffic inspection** -- Parses HTTP/1.x request/response streams inline to extract method, path, status, content length, and WebSocket upgrades without breaking the proxied data.
- **Proxy dashboard** -- A web UI showing live connection stats, per-target request counts, active connections, bytes transferred, and client IP/session ID rollups.
- **Server mode display** -- Dashboard banner now shows the running server mode (Zen Proxy, Zen Compute, etc.).
Supporting changes included in this branch:
- **Wildcard log level matching** -- Log levels can now be set per-category using wildcard patterns (e.g. `proxy.*=debug`).
- **`zen down --all`** -- New flag to shut down all running zenserver instances; also used by the new `xmake kill` task.
- Minor test stability fixes (flaky hash collisions, per-thread RNG seeds).
- Support ZEN_MALLOC environment variable for default allocator selection and switch default to rpmalloc
- Fixed sentry-native build to allow LTO on Windows
|
| |
|
|
|
|
|
|
|
|
| |
- **Frontend dashboard overhaul**: Unified compute/main dashboards into a single shared UI. Added new pages for cache, projects, metrics, sessions, info (build/runtime config, system stats). Added live-update via WebSockets with pause control, sortable detail tables, themed styling. Refactored compute/hub/orchestrator pages into modular JS.
- **HTTP server fixes and stats**: Fixed http.sys local-only fallback when default port is in use, implemented root endpoint redirect for http.sys, fixed Linux/Mac port reuse. Added /stats endpoint exposing HTTP server metrics (bytes transferred, request rates). Added WebSocket stats tracking.
- **OTEL/diagnostics hardening**: Improved OTLP HTTP exporter with better error handling and resilience. Extended diagnostics services configuration.
- **Session management**: Added new sessions service with HTTP endpoints for registering, updating, querying, and removing sessions. Includes session log file support. This is still WIP.
- **CLI subcommand support**: Added support for commands with subcommands in the zen CLI tool, with improved command dispatch.
- **Misc**: Exposed CPU usage/hostname to frontend, fixed JS compact binary float32/float64 decoding, limited projects displayed on front page to 25 sorted by last access, added vscode:// link support.
Also contains some fixes from TSAN analysis.
|
| |
|
|
|
|
|
|
|
|
| |
- Added local process runners for Linux/Wine, Mac with some sandboxing support
- Horde & Nomad provisioning for development and testing
- Client session queues with lifecycle management (active/draining/cancelled), automatic retry with configurable limits, and manual reschedule API
- Improved web UI for orchestrator, compute, and hub dashboards with WebSocket push updates
- Some security hardening
- Improved scalability and `zen exec` command
Still experimental - compute support is disabled by default
|
| | |
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| | |
- zenhttp: added `GetServiceUri()`/`GetExternalHost()`
- enables code to quickly generate an externally reachable URI for a given service
- frontend: improved Uri handling (better defaults)
- added support for 404 page (to make it easier to find a good URL)
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
zencore fixes:
- filesystem.cpp: ReadFile error reporting logic
- compactbinaryvalue.h: CbValue::As*String error reporting logic
zenhttp fixes:
- httpasio BindAcceptor would `return 0;` in a function returning `std::string` (UB)
- httpsys async workpool initialization race
zenstore fixes:
- cas.cpp: GetFileCasResults Results param passed by value instead of reference (large chunk results were silently lost)
- structuredcachestore.cpp: MissCount unconditionally incremented (counted hits as misses)
- cacherpc.cpp: Wrong boolean in Incomplete response array (all entries marked incomplete)
- cachedisklayer.cpp: sizeof(sizeof(...)) in two validation checks computed sizeof(size_t) instead of struct size
- buildstore.cpp: Wrong hash tracked in GC key list (BlobHash pushed twice instead of MetadataHash)
- buildstore.cpp: Removed duplicate m_LastAccessTimeUpdateCount increment in PutBlob
zenserver fixes:
- httpbuildstore.cpp: Reversed subtraction in HTTP range calculation (unsigned underflow)
- hubservice.cpp: Deadlock in Provision() calling Wake() while holding m_Lock (extracted WakeLocked helper)
- zipfs.cpp: Data race in GetFile() lazy initialization (added RwLock with shared/exclusive paths)
|
| | |
| |
| |
| | |
This reverts commit 3c89c486338890ce39ddebe5be4722a09e85701a.
|
| | |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
zenstore fixes:
- cas.cpp: GetFileCasResults Results param passed by value instead of reference (large chunk results were silently lost)
- structuredcachestore.cpp: MissCount unconditionally incremented (counted hits as misses)
- cacherpc.cpp: Wrong boolean in Incomplete response array (all entries marked incomplete)
- cachedisklayer.cpp: sizeof(sizeof(...)) in two validation checks computed sizeof(size_t) instead of struct size
- buildstore.cpp: Wrong hash tracked in GC key list (BlobHash pushed twice instead of MetadataHash)
- buildstore.cpp: Removed duplicate m_LastAccessTimeUpdateCount increment in PutBlob
zenserver fixes:
- httpbuildstore.cpp: Reversed subtraction in HTTP range calculation (unsigned underflow)
- hubservice.cpp: Deadlock in Provision() calling Wake() while holding m_Lock (extracted WakeLocked helper)
- zipfs.cpp: Data race in GetFile() lazy initialization (added RwLock with shared/exclusive paths)
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
| | | |
|
| |/ |
|
| |\ |
|
| | |
| |
| |
| |
| |
| |
| |
| |
| | |
this change adds the `zencompute` component, which can be used to distribute work dispatched from UE using the DDB (Derived Data Build) APIs via zenserver
this change also adds a distinct zenserver compute mode (`zenserver compute`) which is intended to be used for leaf compute nodes
to exercise the compute functionality without directly involving UE, a `zen exec` subcommand is also added, which can be used to feed replays through the system
all new functionality is considered *experimental* and disabled by default at this time, behind the `zencompute` option in xmake config
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| | | |
|
| |/ |
|
| |
|
| |
* Fix formatting of stat pages
|
| |
|
| |
* Fix incorrect oplog navigation symbols
|
| |
|
| |
* Avoid rendering user text input as HTML
|
| |
|
|
|
|
|
|
|
|
|
| |
* Save references to the project and zcache tables
* Add an attribute to table rows with the actionable project/namespace id
* Drop-all option for projects and cache namespaces
* Updated frontend .zip archive
* Edited changelog
|
| |
|
|
|
|
|
|
|
|
|
| |
* Method to get plain text from an async request
* Include server's version on the dashboard start page
* Same paragraph style as the rest of the method
* Updated changelog
* Update frontend archive
|
| |
|
|
|
|
|
| |
* More robust dashboard stats summary
* Updated changelog
* Updated frontend zip archive
|
| |
|
|
|
|
|
| |
* Sorting oplog tree view by size would raise an error
* Updated frontend .zip archive
* Updated changelog
|
| |
|
|
|
| |
* clean up trace command line options
explicitly shut down worker pools
* some additional startup trace scopes
|
| | |
|
| | |
|
| | |
|
| | |
|