aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Add Overmind provisioner alongside Horde and Nomadsb/compute-overmindStefan Boberg8 days12-0/+991
| | | | | | | | | Introduces the zenovermind module with an HTTP client targeting the Overmind REST gateway (/v1/jobs) and a management-thread provisioner that schedules, polls, and cancels jobs following the same pattern as the existing Nomad provisioner. Wired into the compute server with full CLI options (--overmind-*), lifecycle management, and maintenance tick support behind the ZEN_WITH_OVERMIND compile flag.
* 5.8.4v5.8.4Dan Engelbrecht8 days1-1/+1
|
* Hub proxy returns graceful responses when an instance is unavailable instead ↵Dan Engelbrecht8 days2-3/+28
| | | | of a generic bad gateway error (#956)
* 5.8.4-pre3v5.8.4-pre3Dan Engelbrecht9 days1-1/+1
|
* Merge pull request #955 from ue-foundation/zs/shared-memory-open-flags-fixZousar Shaker9 days3-6/+7
|\ | | | | Stop using O_CLOEXEC in shm_open
| * Removing CLOEXEC use on shared memory descriptorszousar9 days2-5/+0
| | | | | | | | According to documentation, shm_open already sets O_CLOEXEC.
| * Changelogzousar9 days1-0/+1
| |
| * Fix copy and paste errorszousar9 days1-3/+3
| |
| * Stop using O_CLOEXEC in shm_openzousar9 days2-6/+11
|/
* fix utf characters in source code (#953)Dan Engelbrecht9 days88-321/+321
|
* use mimalloc by default (#952)Dan Engelbrecht9 days2-2/+3
| | | * make mimalloc default again
* Compute OIDC auth, async Horde agents, and orchestrator improvements (#913)Stefan Boberg9 days62-3970/+3649
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework of the Horde agent subsystem from synchronous per-thread I/O to an async ASIO-driven architecture, plus provisioner scale-down with graceful draining, OIDC authentication, scheduler improvements, and dashboard UI for provisioner control. ### Async Horde Agent Rewrite - Replace synchronous `HordeAgent` (one thread per agent, blocking I/O) with `AsyncHordeAgent` — an ASIO state machine running on a shared `io_context` thread pool - Replace `TcpComputeTransport`/`AesComputeTransport` with `AsyncTcpComputeTransport`/`AsyncAesComputeTransport` - Replace `AgentMessageChannel` with `AsyncAgentMessageChannel` using frame queuing and ASIO timers - Delete `ComputeBuffer` and `ComputeChannel` ring-buffer classes (no longer needed) ### Provisioner Drain / Scale-Down - `HordeProvisioner` can now drain agents when target core count is lowered: queries each agent's `/compute/session/status` for workload, selects candidates by largest-fit/lowest-workload, and sends `/compute/session/drain` - Configurable `--horde-drain-grace-period` (default 300s) before force-kill - Implement `IProvisionerStateProvider` interface to expose provisioner state to the orchestrator HTTP layer - Forward `--coordinator-session`, `--provision-clean`, and `--provision-tracehost` through both Horde and Nomad provisioners to spawned workers ### OIDC Authentication - `HordeClient` accepts an `AccessTokenProvider` (refreshable token function) as alternative to static `--horde-token` - Wire up `OidcToken.exe` auto-discovery via `httpclientauth::CreateFromOidcTokenExecutable` with `--HordeUrl` mode - New `--horde-oidctoken-exe-path` CLI option for explicit path override ### Orchestrator & Scheduler - Orchestrator generates a session ID at startup; workers include `coordinator_session` in announcements so the orchestrator can reject stale-session workers - New `Rejected` action state — when a remote runner declines at capacity, the action is rescheduled without retry count increment - Reduce scheduler lock contention: snapshot pending actions under shared lock, sort/trim outside the lock - Parallelize remote action submission across runners via `WorkerThreadPool` with slow-submit warnings - New action field `FailureReason` populated by all runner types (exit codes, sandbox failures, exceptions) - New endpoints: `session/drain`, `session/status`, `session/sunset`, `provisioner/status`, `provisioner/target` ### Remote Execution - Eager-attach mode for `RemoteHttpRunner` — bundles all attachments upfront in a `CbPackage` for single-roundtrip submits - Track in-flight submissions to prevent over-queuing - Show remote runner hostname in `GetDisplayName()` - `--announce-url` to override the endpoint announced to the coordinator (e.g. relay-visible address) ### Frontend Dashboard - Delete standalone `compute.html` (925 lines) and `orchestrator.html` (669 lines), consolidated into JS page modules - Add provisioner panel to orchestrator dashboard: target/active/estimated core counts, draining agent count - Editable target-cores input with debounced POST to `/orch/provisioner/target` - Per-agent provisioning status badges (active / draining / deallocated) in the agents table - Active vs total CPU counts in agents summary row ### CLI - New `zen compute record-start` / `record-stop` subcommands - `zen exec` progress bar with submit and completion phases, atomic work counters, `--progress` mode (Pretty/Plain/Quiet) ### Other - `DataDir` supports environment variable expansion - Worker manifest validation checks for `worker.zcb` marker to detect incomplete cached directories - Linux/Mac runners `nice(5)` child processes to avoid starving the main server - `ComputeService::SetShutdownCallback` wired to `RequestExit` via `session/sunset` - Curl HTTP client logs effective URL on failure - `MachineInfo` carries `Pool` and `Mode` from Horde response - Horde bundle creation includes `.pdb` on Windows
* 5.8.4-pre2v5.8.4-pre2Dan Engelbrecht9 days1-1/+1
|
* log curl raw error on retry, add retry on CURLE_PARTIAL_FILE error (#951)Dan Engelbrecht9 days2-1/+4
| | | * log curl raw error on retry, add retry on CURLE_PARTIAL_FILE error
* silence errors due to abort (#950)Dan Engelbrecht9 days2-188/+334
| | | * silence exceptions in threaded requests to build storage if already aborted
* Some minor polish from tourist branch (#949)Stefan Boberg9 days13-44/+413
| | | | | | | | | | - Replace per-type fmt::formatter specializations (StringBuilderBase, NiceBase) with a single generic formatter using a HasStringViewConversion concept - Add ThousandsNum for comma-separated integer formatting (e.g. "1,234,567") - Thread naming now accepts a sort hint for trace ordering - Fix main thread trace registration to use actual thread ID and sort first - Add ExpandEnvironmentVariables() for expanding %VAR% references in strings, with tests - Add ParseHexBytes() overload with expected byte count validation - Add Flag_BelowNormalPriority to CreateProcOptions (BELOW_NORMAL_PRIORITY_CLASS on Windows, setpriority on POSIX) - Add PrettyScroll progress bar mode that pins the status line to the bottom of the terminal using scroll regions, with signal handler cleanup for Ctrl+C/SIGTERM
* Add MemoryCidStore and ChunkStore interface (#940)Stefan Boberg9 days11-41/+331
| | | | | | | | | | This PR introduces an in-memory `CidStore` option primarily for use with compute, to avoid hitting disk for ephemeral data which is not really worth persisting. And in particular not worth paying the critical path cost of persistence. - **MemoryCidStore**: In-memory CidStore implementation backed by a hash map, optionally layered over a standard CidStore. Writes to the backing store are dispatched asynchronously via a dedicated flush thread to avoid blocking callers on disk I/O. Reads check memory first, then fall back to the backing store without caching the result. - **ChunkStore interface**: Extract `ChunkStore` abstract class (`AddChunk`, `ContainsChunk`, `FilterChunks`) and `FallbackChunkResolver` into `zenstore.h` so `HttpComputeService` can accept different storage backends for action inputs vs worker binaries. `CidStore` and `MemoryCidStore` both implement `ChunkStore`. - **Compute service wiring**: `HttpComputeService` takes two `ChunkStore&` params (action + worker). The compute server uses `MemoryCidStore` for actions (no disk persistence needed) and disk-backed `CidStore` for workers (cross-action reuse). The storage server passes its `CidStore` for both (unchanged behavior).
* 5.8.4-pre1v5.8.4-pre1Dan Engelbrecht9 days1-1/+1
|
* minor fixups (#948)Dan Engelbrecht9 days6-39/+54
| | | | | | | * objectstore.cpp - m_TotalBytesServed now tracks all range cases (single, multi, 416) * async http: docstring corrected: curl_multi_socket_action() / ASIO socket async_wait remove non-ascii characters * fix singlethreaded gc option in lua to not use dash * fix changelog order
* Logging and diagnostics improvements (#941)Stefan Boberg9 days15-128/+290
| | | | | | | | | | | | | | | | Core logging and system diagnostics improvements, extracted from the compute branch. ### Logging - **Elapsed timestamps**: Console log now shows elapsed time since launch `[HH:MM:SS.mmm]` instead of full date/time; file logging is unchanged - **Short level names**: 3-letter short level names (`trc`/`dbg`/`inf`/`wrn`/`err`/`crt`) used by both console and file formatters via `ShortToStringView()` - **Consistent field order**: Standardized to `[timestamp] [level] [logger]` across both console and file formatters - **Slim LogMessage/LogPoint**: Remove redundant fields from `LogMessage` (derive level/source from `LogPoint`), flatten `LogPoint` to inline filename/line fields, shrink `LogLevel` to `int8_t` with `static_assert(sizeof(LogPoint) <= 32)` - **Remove default member initializers** and static default `LogPoint` from `LogMessage` — all fields initialized by constructor - **LoggerRef string constructor**: Convenience constructor accepting a string directly - **Fix SendMessage macro collision**: Replace `thread.h` include in `logmsg.h` with a forward declaration of `GetCurrentThreadId()` to avoid pulling in `windows.h` transitively ### System Diagnostics - **Cache static system metrics**: Add `RefreshDynamicSystemMetrics()` that only queries values that change at runtime (available memory, uptime, swap). `SystemMetricsTracker` snapshots full `GetSystemMetrics()` once at construction and reuses cached topology/total memory on each `Query()`, avoiding repeated `GetLogicalProcessorInformationEx` traversal on Windows, `/proc/cpuinfo` parsing on Linux, and `sysctl` topology calls on macOS
* update minio (#947)Dan Engelbrecht9 days6-0/+2
|
* hub instance malloc trace (#946)Dan Engelbrecht9 days8-0/+91
| | | | | | | `--hub-instance-malloc` selects the memory allocator for child instances `--hub-instance-trace` sets trace channels for child instances `--hub-instance-tracehost` sets the trace streaming host for child instances `--hub-instance-tracefile` sets the trace output file for child instances add {moduleid} and {port} placeholder support for tracefile
* Add manual test workflow with configurable sanitizers and allocators (#944)Stefan Boberg11 days1-0/+298
| | | | | - Adds a `workflow_dispatch` workflow ("Manual Test Run") that can be triggered from the Actions tab - Configurable options: platform, memory allocator (`--malloc=stomp`/mimalloc/rpmalloc), sanitizer (asan/tsan/msan), test suite, and freeform extra arguments - Mirrors the build & test steps from `validate.yml` but always builds debug with sentry disabled, and with longer timeout (40min) to accommodate sanitizer overhead
* Dashboard stats tiles no longer flicker (#943)Dan Engelbrecht11 days7-190/+199
|
* removed s3 test program (#942)Stefan Boberg11 days3-535/+0
| | | Remove the `zens3-testbed` target and source files. This was a standalone test harness for S3 operations that is no longer needed.
* `--consul-register-hub` option to disable hub parent service Consul ↵Dan Engelbrecht11 days4-18/+37
| | | | registration (#939)
* hub deprovision all (#938)Dan Engelbrecht11 days4-5/+110
| | | * implement "deprovision all" for hub
* dashboard search (#936)Dan Engelbrecht11 days7-13/+187
| | | | | - Improvement: Dashboard paginated lists now include a search input that jumps to the page containing the first match and highlights the row - Improvement: Dashboard paginated lists show a loading indicator while fetching data - Improvement: Hub dashboard navigates to and highlights newly provisioned instances
* improve messaging when zen builds download target disk does not have enought ↵Dan Engelbrecht11 days2-1/+5
| | | | space (#935)
* update rpmalloc and tweak for commit/decommit churn (#934)Dan Engelbrecht11 days5-88/+226
| | | | - Improvement: Updated rpmalloc to develop branch commit feb43aee0d4d (2025-10-26), which fixes `VirtualAlloc(MEM_COMMIT)` failures being silently ignored under memory pressure - Improvement: Increased rpmalloc page decommit thresholds to reduce commit/decommit churn under high allocation turnover
* HTTP range responses (RFC 7233) - httpobjectstore (#928)Dan Engelbrecht12 days11-137/+515
| | | | | | | | | - Improvement: HTTP range responses (RFC 7233) are now fully compliant across the object store and build store - 206 Partial Content responses now include a `Content-Range` header; previously absent for single-range requests, which broke `HttpClient::GetRanges()` - 416 Range Not Satisfiable responses now include `Content-Range: bytes */N` as required by RFC 7233 - Out-of-bounds range requests return 416 Range Not Satisfiable (was 400 Bad Request) - Single-byte ranges (`bytes=N-N`) are now correctly accepted (were previously rejected) - Range byte positions widened from 32-bit to 64-bit; RFC 7233 imposes no size limit on byte range values - Build store binary GET requests with a Range header now return 206 Partial Content with `Content-Range` (previously returned 200 OK without it)
* reduce test runtime (#933)Dan Engelbrecht12 days15-1386/+1242
| | | | | | | | * reduce zenserver spawns in tests * fix filesystemutils wrong test suite name * tweak tests for faster runtime * reduce more test runtime * more wall time improvements * fast http and processmanager tests
* Update CHANGELOG.mdStefan Boberg13 days1-0/+4
|
* Fix ZenServerState stale entry detection on PID reuse (k8s) (#932)Stefan Boberg13 days1-0/+31
| | | | | | - Detect stale shared-memory entries whose PID matches the current process but predate our registration (m_OurEntry == nullptr) - Sweep() now reclaims such entries instead of skipping them - Lookup() and LookupByEffectivePort() skip stale same-PID entries - Fixes startup failure on k8s where PID 1 is always reused after an unclean shutdown
* Add async HTTP client (curl_multi + ASIO) (#918)Stefan Boberg13 days6-269/+1776
| | | | | | | | | | | | | | | | | | | | | | | - Adds `AsyncHttpClient` — an asynchronous HTTP client using `curl_multi_socket_action` integrated with ASIO for event-driven I/O. Supports GET, POST, PUT, DELETE, HEAD with both callback-based and `std::future`-based APIs. - Extracts shared curl helpers (callbacks, URL encoding, header construction, error mapping) into `httpclientcurlhelpers.h`, eliminating duplication between the sync and async implementations. ## Design - All curl_multi state is serialized on an `asio::strand`, safe with multi-threaded io_contexts. - Two construction modes: owned io_context (creates internal thread) or external io_context (caller runs the loop). - Socket readiness is detected via `asio::ip::tcp::socket::async_wait` driven by curl's `CURLMOPT_SOCKETFUNCTION`/`CURLMOPT_TIMERFUNCTION` — no polling, sub-millisecond latency. - Completion callbacks are dispatched off the strand onto the io_context so slow callbacks don't starve the curl event loop. Exceptions in callbacks are caught and logged. ## Files | File | Change | |------|--------| | `zenhttp/include/zenhttp/asynchttpclient.h` | New public header | | `zenhttp/clients/asynchttpclient.cpp` | Implementation (~1000 lines) | | `zenhttp/clients/httpclientcurlhelpers.h` | Shared curl helpers extracted from sync client | | `zenhttp/clients/httpclientcurl.cpp` | Removed duplicated helpers, uses shared header | | `zenhttp/asynchttpclient_test.cpp` | 8 test cases: verbs, payloads, callbacks, concurrency, external io_context, connection errors | | `zenhttp/zenhttp.cpp` | Forcelink registration for new tests |
* migrate from http_parser to llhttp (#929)Dan Engelbrecht13 days12-143/+423
|
* 5.8.3v5.8.3Dan Engelbrecht14 days1-1/+1
|
* 5.8.3-pre2v5.8.3-pre2Dan Engelbrecht14 days1-1/+1
|
* fully provisioned hub instances now sets initial check status to "passing" ↵Dan Engelbrecht14 days4-7/+15
| | | | in consul (#930)
* use correct return code for unsupported multirange requests in objectstore ↵Dan Engelbrecht2026-04-083-2/+65
| | | | (#927)
* don't hard fail if .pending folder is not empty on oplog export (#926)Dan Engelbrecht2026-04-083-2/+11
|
* fix missing chunk in oplog export (#925)Dan Engelbrecht2026-04-082-0/+161
| | | * add reused block to oplog during export
* hydration data obliteration (#923)Dan Engelbrecht2026-04-0816-153/+738
| | | | - Feature: Hub obliterate operation deletes all local and backend hydration data for a module - Improvement: Hub dashboard adds obliterate button for individual, bulk, and by-name module deletion
* sort items on dashboard (#924)Dan Engelbrecht2026-04-073-79/+114
| | | * add pagination and consistent sorting on cache and projects ui pages
* add pagination of cooked projects and caches on dashboard front page (#922)Dan Engelbrecht2026-04-073-65/+173
|
* incremental dehydrate (#921)Dan Engelbrecht2026-04-0722-1148/+1910
| | | | | | | | | | | | | | | - Feature: Incremental CAS-based hydration/dehydration replacing the previous full-copy approach - Feature: S3 hydration backend with multipart upload/download support - Feature: Configurable thread pools for hub instance provisioning and hydration `--hub-instance-provision-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. `--hub-hydration-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. - Improvement: Hub triggers GC on instance before deprovisioning to compact storage before dehydration - Improvement: GC status now reports pending triggers as running - Improvement: S3 client debug logging gated behind verbose mode to reduce log noise at default verbosity - Improvement: Hub dashboard Resources tile now shows total memory - Improvement: `filesystemutils` moved from `zenremotestore` to `zenutil` for broader reuse - Improvement: Hub uses separate provision and hydration worker pools to avoid deadlocks - Improvement: Hibernate/wake/deprovision on non-existent or already-in-target-state modules are idempotent - Improvement: `ScopedTemporaryDirectory` with empty path now creates a temporary directory instead of asserting
* disable zencompute in bundle stepStefan Boberg2026-04-031-0/+3
|
* 5.8.3-pre0v5.8.3-pre0Dan Engelbrecht2026-04-021-1/+1
|
* fix hub consule health endpoint registration (#917)Dan Engelbrecht2026-04-023-1/+6
| | | | * use correct health endpoint for zenhubserver consul registration * add total disk space on hub resource pane
* 5.8.2v5.8.2Dan Engelbrecht2026-04-021-1/+1
|