aboutsummaryrefslogtreecommitdiff
path: root/src/zenserver/compute/computeserver.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Separate action and worker chunk stores for compute servicesb/compute-oidc-authStefan Boberg2 hours1-5/+7
| | | | | | | | | | | | | | Extract ChunkStore interface from CidStore so HttpComputeService can accept different storage backends for action inputs vs worker binaries. Action inputs use MemoryCidStore (no disk persistence) while workers use a disk-backed CidStore for cross-action reuse. - Add ChunkStore abstract class (AddChunk, ContainsChunk, FilterChunks) and FallbackChunkResolver to zenstore.h - CidStore and MemoryCidStore both implement ChunkStore - HttpComputeService takes two ChunkStore& params (action + worker) - Compute server wires MemoryCidStore for actions, CidStore for workers - Storage server passes its CidStore for both (unchanged behavior)
* Add --provision-tracehost to forward trace collection to provisioned workersStefan Boberg16 hours1-2/+11
| | | | | | Allows the orchestrator to pass --tracehost=<ip> to Horde/Nomad-spawned zenserver instances so their UE traces can be collected remotely for performance analysis.
* Add --provision-clean option to pass --clean to provisioned workersStefan Boberg17 hours1-3/+13
| | | | | | Allows the orchestrator to tell Horde/Nomad-spawned zenserver instances to wipe their data directory on startup, avoiding stale cached state from previous runs on the same machine.
* Add orchestrator session ID to reject stale worker announcementsStefan Boberg18 hours1-2/+20
| | | | | | | | | | Horde/Nomad-spawned zenserver instances from a previous orchestrator session could remain alive and post announcements to a new orchestrator, creating ghost workers. The orchestrator now uses its process session ID and passes it to spawned workers via --coordinator-session. Workers include it in announce payloads, and the orchestrator rejects mismatches with 409 Conflict. Announcements without a session field are still accepted for backwards compatibility.
* Fix relay mode: case-insensitive mode parsing, AES send/recv deadlock, and ↵Stefan Boberg27 hours1-0/+12
| | | | | | | | | | | | | | | | | endpoint routing - Make FromString for ConnectionMode and Encryption case-insensitive so PascalCase values from the Horde API (e.g. "Relay") are recognized. - Split AesComputeTransport's single mutex into separate send/recv mutexes to prevent deadlock where the recv thread blocks on TCP while holding the lock, starving the send thread from sending Fork. - Add MachineInfo::GetZenServiceEndpoint() to resolve the relay-mapped address and port for the Zen service, used by the provisioner for both its own health-check endpoint and the remote zenserver's announce URL. - Add --announce-url CLI option so the provisioner can tell the remote zenserver which externally-visible URL to announce to the orchestrator (instead of its unreachable private IP in relay mode). - Log connection mode in machine-assigned message for diagnostics.
* Improve OidcToken auth diagnostics and use --HordeUrl for Horde serversStefan Boberg27 hours1-1/+2
| | | | | | | - Log full command line on OidcToken failure instead of just the exe path - Use --HordeUrl flag for Horde server URLs, --AuthConfigUrl for others (e.g. Jupiter), controlled by new IsHordeUrl parameter - Add specific warnings in HordeConfig::Validate for each failure reason
* Add provisioner target control and graceful agent deprovisioningStefan Boberg39 hours1-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Add IProvisionerStateProvider interface for decoupling orchestrator HTTP layer from provisioner implementations - HordeProvisioner implements the interface directly, exposing target, active, estimated core counts and per-agent provisioning status - Add orchestrator dashboard provisioner panel with editable target core count, active/estimated/agents/draining metric tiles - Add per-agent status badges in agents table (active/draining/deallocated) with dimming for non-provisioned or deallocated workers - Add GET/POST /orch/provisioner/status and /orch/provisioner/target endpoints; include provisioner stats in WebSocket push - Add session lifecycle HTTP endpoints on compute nodes: GET /compute/session/status, POST /compute/session/drain, POST /compute/session/sunset (with shutdown callback) - Implement graceful deprovisioning in HordeProvisioner: - Best-fit agent selection: largest agent that fits the remaining excess, breaking ties by workload (least busy first) - Never deprovision below target core count - Drain via HTTP: signal drain, poll status, send sunset on completion - Configurable grace period (--horde-drain-grace-period, default 300s) - Anti-oscillation: scale-up requires full agent-sized gap below target; estimated cores adjusted at drain-mark time, not agent-exit time - Fix JSON-to-CbObject number parsing: integral JSON numbers are now stored as integers instead of Float64 so AsInt32/AsInt64 work - Add effective URL to curl HTTP client error logging - Add WorkerAnnotator callback to OrchestratorService::GetWorkerList for per-worker provisioner status injection - Remove standalone compute/orchestrator.html and compute/compute.html; redirect /dashboard/compute/ to SPA
* Add OidcToken-based authentication for HordeProvisionerStefan Boberg45 hours1-0/+32
| | | | | | | | Support automatic token acquisition and refresh for Horde compute provisioning via the OidcToken executable, matching the existing auth flow used by the builds command. When no static --horde-token is provided, the compute server auto-discovers OidcToken.exe and uses it to obtain and refresh Bearer tokens.
* idle deprovision in hub (#895)Dan Engelbrecht7 days1-1/+1
| | | | | | | | | | | | | - Feature: Hub watchdog automatically deprovisions inactive provisioned and hibernated instances - Feature: Added `stats/activity_counters` endpoint to measure server activity - Feature: Added configuration options for hub watchdog - `--hub-watchdog-provisioned-inactivity-timeout-seconds` Inactivity timeout before a provisioned instance is deprovisioned - `--hub-watchdog-hibernated-inactivity-timeout-seconds` Inactivity timeout before a hibernated instance is deprovisioned - `--hub-watchdog-inactivity-check-margin-seconds` Margin before timeout at which an activity check is issued - `--hub-watchdog-cycle-interval-ms` Watchdog poll interval in milliseconds - `--hub-watchdog-cycle-processing-budget-ms` Maximum time budget per watchdog cycle in milliseconds - `--hub-watchdog-instance-check-throttle-ms` Minimum delay between checks on a single instance - `--hub-watchdog-activity-check-connect-timeout-ms` Connect timeout for activity check requests - `--hub-watchdog-activity-check-request-timeout-ms` Request timeout for activity check requests
* Compute batching (#849)Stefan Boberg2026-03-181-15/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### Compute Batch Submission - Consolidate duplicated action submission logic in `httpcomputeservice` into a single `HandleSubmitAction` supporting both single-action and batch (actions array) payloads - Group actions by queue in `RemoteHttpRunner` and submit as batches with configurable chunk size, falling back to individual submission on failure - Extract shared helpers: `MakeErrorResult`, `ValidateQueueForEnqueue`, `ActivateActionInQueue`, `RemoveActionFromActiveMaps` ### Retracted Action State - Add `Retracted` state to `RunnerAction` for retry-free rescheduling — an explicit request to pull an action back and reschedule it on a different runner without incrementing `RetryCount` - Implement idempotent `RetractAction()` on `RunnerAction` and `ComputeServiceSession` - Add `POST jobs/{lsn}/retract` and `queues/{queueref}/jobs/{lsn}/retract` HTTP endpoints - Add state machine documentation and per-state comments to `RunnerAction` ### Compute Race Fixes - Fix race in `HandleActionUpdates` where actions enqueued between session abandon and scheduler tick were never abandoned, causing `GetActionResult` to return 202 indefinitely - Fix queue `ActiveCount` race where `NotifyQueueActionComplete` was called after releasing `m_ResultsLock`, allowing callers to observe stale counters immediately after `GetActionResult` returned OK ### Logging Optimization and ANSI improvements - Improve `AnsiColorStdoutSink` write efficiency — single write call, dirty-flag flush, `RwLock` instead of `std::mutex` - Move ANSI color emission from sink into formatters via `Formatter::SetColorEnabled()`; remove `ColorRangeStart`/`End` from `LogMessage` - Extract color helpers (`AnsiColorForLevel`, `StripAnsiSgrSequences`) into `helpers.h` - Strip upstream ANSI SGR escapes in non-color output mode. This enables colour in log messages without polluting log files with ANSI control sequences - Move `RotatingFileSink`, `JsonFormatter`, and `FullFormatter` from header-only to pimpl with `.cpp` files ### CLI / Exec Refactoring - Extract `ExecSessionRunner` class from ~920-line `ExecUsingSession` into focused methods and a `ExecSessionConfig` struct - Replace monolithic `ExecCommand` with subcommand-based architecture (`http`, `inproc`, `beacon`, `dump`, `buildlog`) - Allow parent options to appear after subcommand name by parsing subcommand args permissively and forwarding unmatched tokens to the parent parser ### Testing Improvements - Fix `--test-suite` filter being ignored due to accumulation with default wildcard filter - Add test suite banners to test listener output - Made `function.session.abandon_pending` test more robust ### Startup / Reliability Fixes - Fix silent exit when a second zenserver instance detects a port conflict — use `ZEN_CONSOLE_*` for log calls that precede `InitializeLogging()` - Fix two potential SIGSEGV paths during early startup: guard `sentry_options_new()` returning nullptr, and throw on `ZenServerState::Register()` returning nullptr instead of dereferencing - Fail on unrecognized zenserver `--mode` instead of silently defaulting to store ### Other - Show host details (hostname, platform, CPU count, memory) when discovering new compute workers - Move frontend `html.zip` from source tree into build directory - Add format specifications for Compact Binary and Compressed Buffer wire formats - Add `WriteCompactBinaryObject` to zencore - Extended `ConsoleTui` with additional functionality - Add `--vscode` option to `xmake sln` for clangd / `compile_commands.json` support - Disable compute/horde/nomad in release builds (not yet production-ready) - Disable unintended `ASIO_HAS_IO_URING` enablement - Fix crashpad patch missing leading whitespace - Clean up code triggering gcc false positives
* zen hub port reuse (#850)Dan Engelbrecht2026-03-171-0/+1
| | | | | | | | - Feature: Added `--allow-port-probing` option to control whether zenserver searches for a free port on startup (default: true, automatically false when --dedicated is set) - Feature: Added new hub options for controlling provisioned storage server instances: - `--hub-instance-http` - HTTP server implementation for instances (asio/httpsys) - `--hub-instance-http-threads` - Number of HTTP connection threads per instance - `--hub-instance-corelimit` - Limit CPU concurrency per instance - Improvement: Hub now manages a deterministic port pool for provisioned instances allowing reuse of unused ports
* URI decoding, process env, compiler info, httpasio strands, regex route ↵Stefan Boberg2026-03-161-1/+1
| | | | | | | | | | | | | | | | | removal (#841) - Percent-decode URIs in ASIO HTTP server to match http.sys CookedUrl behavior, ensuring consistent decoded paths across backends - Add Environment field to CreateProcOptions for passing extra env vars to child processes (Windows: merged into Unicode environment block; Unix: setenv in fork) - Add GetCompilerName() and include it in build options startup logging - Suppress Windows CRT error dialogs in test harness for headless/CI runs - Fix mimalloc package: pass CMAKE_BUILD_TYPE, skip cfuncs test for cross-compile - Add virtual destructor to SentryAssertImpl to fix debug-mode warning - Simplify object store path handling now that URIs arrive pre-decoded - Add URI decoding test coverage for percent-encoded paths and query params - Simplify httpasio request handling by using strands (guarantees no parallel handlers per connection) - Removed deprecated regex-based route matching support - Fix full GC never triggering after cross-toolchain builds: The `gc_state` file stores `system_clock` ticks, but the tick resolution differs between toolchains (nanoseconds on GCC/standard clang, microseconds on UE clang). A nanosecond timestamp misinterpreted as microseconds appears far in the future (~year 58,000), bypassing the staleness check and preventing time-based full GC from ever running. Fixed by also resetting when the stored timestamp is in the future. - Clamp GC countdown display to configured interval: Prevents nonsensical log output (e.g. "Full GC in 492128002h") caused by the above or any other clock anomaly. The clamp applies to both the scheduler log and the status API.
* Transparent proxy mode (#823)Stefan Boberg2026-03-121-0/+1
| | | | | | | | | | | | | | | | | Adds a **transparent TCP proxy mode** to zenserver (activated via `zenserver proxy`), allowing it to sit between clients and upstream Zen servers to inspect and monitor HTTP/1.x traffic in real time. Primarily useful during development, to be able to observe multi-server/client interactions in one place. - **Dedicated proxy port** -- Proxy mode defaults to port 8118 with its own data directory to avoid collisions with a normal zenserver instance. - **TCP proxy core** (`src/zenserver/proxy/`) -- A new transparent TCP proxy that forwards connections to upstream targets, with support for both TCP/IP and Unix socket listeners. Multi-threaded I/O for connection handling. Supports Unix domain sockets for both upstream/downstream. - **HTTP traffic inspection** -- Parses HTTP/1.x request/response streams inline to extract method, path, status, content length, and WebSocket upgrades without breaking the proxied data. - **Proxy dashboard** -- A web UI showing live connection stats, per-target request counts, active connections, bytes transferred, and client IP/session ID rollups. - **Server mode display** -- Dashboard banner now shows the running server mode (Zen Proxy, Zen Compute, etc.). Supporting changes included in this branch: - **Wildcard log level matching** -- Log levels can now be set per-category using wildcard patterns (e.g. `proxy.*=debug`). - **`zen down --all`** -- New flag to shut down all running zenserver instances; also used by the new `xmake kill` task. - Minor test stability fixes (flaky hash collisions, per-thread RNG seeds). - Support ZEN_MALLOC environment variable for default allocator selection and switch default to rpmalloc - Fixed sentry-native build to allow LTO on Windows
* HttpClient using libcurl, Unix Sockets for HTTP. HTTPS support (#770)Stefan Boberg2026-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The main goal of this change is to eliminate the cpr back-end altogether and replace it with the curl implementation. I would expect to drop cpr as soon as we feel happy with the libcurl back-end. That would leave us with a direct dependency on libcurl only, and cpr can be eliminated as a dependency. ### HttpClient Backend Overhaul - Implemented a new **libcurl-based HttpClient** backend (`httpclientcurl.cpp`, ~2000 lines) as an alternative to the cpr-based one - Made HttpClient backend **configurable at runtime** via constructor arguments and `-httpclient=...` CLI option (for zen, zenserver, and tests) - Extended HttpClient test suite to cover multipart/content-range scenarios ### Unix Domain Socket Support - Added Unix domain socket support to **httpasio** (server side) - Added Unix domain socket support to **HttpClient** - Added Unix domain socket support to **HttpWsClient** (WebSocket client) - Templatized `HttpServerConnectionT<SocketType>` and `WsAsioConnectionT<SocketType>` to handle TCP, Unix, and SSL sockets uniformly via `if constexpr` dispatch ### HTTPS Support - Added **preliminary HTTPS support to httpasio** (for Mac/Linux via OpenSSL) - Added **basic HTTPS support for http.sys** (Windows) - Implemented HTTPS test for httpasio - Split `InitializeServer` into smaller sub-functions for http.sys ### Other Notable Changes - Improved **zenhttp-test stability** with dynamic port allocation - Enhanced port retry logic in http.sys (handles ERROR_ACCESS_DENIED) - Fatal signal/exception handlers for backtrace generation in tests - Added `zen bench http` subcommand to exercise network + HTTP client/server communication stack
* Dashboard overhaul, compute integration (#814)Stefan Boberg2026-03-091-2/+2
| | | | | | | | | | - **Frontend dashboard overhaul**: Unified compute/main dashboards into a single shared UI. Added new pages for cache, projects, metrics, sessions, info (build/runtime config, system stats). Added live-update via WebSockets with pause control, sortable detail tables, themed styling. Refactored compute/hub/orchestrator pages into modular JS. - **HTTP server fixes and stats**: Fixed http.sys local-only fallback when default port is in use, implemented root endpoint redirect for http.sys, fixed Linux/Mac port reuse. Added /stats endpoint exposing HTTP server metrics (bytes transferred, request rates). Added WebSocket stats tracking. - **OTEL/diagnostics hardening**: Improved OTLP HTTP exporter with better error handling and resilience. Extended diagnostics services configuration. - **Session management**: Added new sessions service with HTTP endpoints for registering, updating, querying, and removing sessions. Includes session log file support. This is still WIP. - **CLI subcommand support**: Added support for commands with subcommands in the zen CLI tool, with improved command dispatch. - **Misc**: Exposed CPU usage/hostname to frontend, fixed JS compact binary float32/float64 decoding, limited projects displayed on front page to 25 sorted by last access, added vscode:// link support. Also contains some fixes from TSAN analysis.
* compute orchestration (#763)Stefan Boberg2026-03-041-17/+708
| | | | | | | | | | - Added local process runners for Linux/Wine, Mac with some sandboxing support - Horde & Nomad provisioning for development and testing - Client session queues with lifecycle management (active/draining/cancelled), automatic retry with configurable limits, and manual reschedule API - Improved web UI for orchestrator, compute, and hub dashboards with WebSocket push updates - Some security hardening - Improved scalability and `zen exec` command Still experimental - compute support is disabled by default
* GC - fix handling of attachment ranges, http access token expiration, lock ↵Stefan Boberg2026-02-201-3/+3
| | | | | | | | file retry logic (#766) * GC - fix handling of attachment ranges * fix trace/log strings * fix HTTP access token expiration time logic * added missing lock retry in zenserver startup
* structured compute basics (#714)Stefan Boberg2026-02-181-0/+330
this change adds the `zencompute` component, which can be used to distribute work dispatched from UE using the DDB (Derived Data Build) APIs via zenserver this change also adds a distinct zenserver compute mode (`zenserver compute`) which is intended to be used for leaf compute nodes to exercise the compute functionality without directly involving UE, a `zen exec` subcommand is also added, which can be used to feed replays through the system all new functionality is considered *experimental* and disabled by default at this time, behind the `zencompute` option in xmake config