aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* if a downloaded blob should be sent to cache, make sure it is disk basedde/fix-memory-usage-for-cache-uploadDan Engelbrecht12 hours4-179/+154
| | | | keeping it in memory overloads memory when boost-worker-memory is enabled
* remove obsolete prime-cache-only flagDan Engelbrecht13 hours4-318/+174
|
* fix OAuth client credentials content type override (#957)HEADmainJoakim Lindqvist19 hours4-2/+20
| | | | - Bugfix: OAuth client credentials token request now sends correct `application/x-www-form-urlencoded` content type - Improvement: HTTP client Content-Type in additional headers now overrides the payload content type
* 5.8.4v5.8.4Dan Engelbrecht25 hours1-1/+1
|
* Hub proxy returns graceful responses when an instance is unavailable instead ↵Dan Engelbrecht26 hours2-3/+28
| | | | of a generic bad gateway error (#956)
* 5.8.4-pre3v5.8.4-pre3Dan Engelbrecht36 hours1-1/+1
|
* Merge pull request #955 from ue-foundation/zs/shared-memory-open-flags-fixZousar Shaker37 hours3-6/+7
|\ | | | | Stop using O_CLOEXEC in shm_open
| * Removing CLOEXEC use on shared memory descriptorszousar38 hours2-5/+0
| | | | | | | | According to documentation, shm_open already sets O_CLOEXEC.
| * Changelogzousar38 hours1-0/+1
| |
| * Fix copy and paste errorszousar38 hours1-3/+3
| |
| * Stop using O_CLOEXEC in shm_openzousar38 hours2-6/+11
|/
* fix utf characters in source code (#953)Dan Engelbrecht41 hours88-321/+321
|
* use mimalloc by default (#952)Dan Engelbrecht42 hours2-2/+3
| | | * make mimalloc default again
* Compute OIDC auth, async Horde agents, and orchestrator improvements (#913)Stefan Boberg44 hours62-3970/+3649
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework of the Horde agent subsystem from synchronous per-thread I/O to an async ASIO-driven architecture, plus provisioner scale-down with graceful draining, OIDC authentication, scheduler improvements, and dashboard UI for provisioner control. ### Async Horde Agent Rewrite - Replace synchronous `HordeAgent` (one thread per agent, blocking I/O) with `AsyncHordeAgent` — an ASIO state machine running on a shared `io_context` thread pool - Replace `TcpComputeTransport`/`AesComputeTransport` with `AsyncTcpComputeTransport`/`AsyncAesComputeTransport` - Replace `AgentMessageChannel` with `AsyncAgentMessageChannel` using frame queuing and ASIO timers - Delete `ComputeBuffer` and `ComputeChannel` ring-buffer classes (no longer needed) ### Provisioner Drain / Scale-Down - `HordeProvisioner` can now drain agents when target core count is lowered: queries each agent's `/compute/session/status` for workload, selects candidates by largest-fit/lowest-workload, and sends `/compute/session/drain` - Configurable `--horde-drain-grace-period` (default 300s) before force-kill - Implement `IProvisionerStateProvider` interface to expose provisioner state to the orchestrator HTTP layer - Forward `--coordinator-session`, `--provision-clean`, and `--provision-tracehost` through both Horde and Nomad provisioners to spawned workers ### OIDC Authentication - `HordeClient` accepts an `AccessTokenProvider` (refreshable token function) as alternative to static `--horde-token` - Wire up `OidcToken.exe` auto-discovery via `httpclientauth::CreateFromOidcTokenExecutable` with `--HordeUrl` mode - New `--horde-oidctoken-exe-path` CLI option for explicit path override ### Orchestrator & Scheduler - Orchestrator generates a session ID at startup; workers include `coordinator_session` in announcements so the orchestrator can reject stale-session workers - New `Rejected` action state — when a remote runner declines at capacity, the action is rescheduled without retry count increment - Reduce scheduler lock contention: snapshot pending actions under shared lock, sort/trim outside the lock - Parallelize remote action submission across runners via `WorkerThreadPool` with slow-submit warnings - New action field `FailureReason` populated by all runner types (exit codes, sandbox failures, exceptions) - New endpoints: `session/drain`, `session/status`, `session/sunset`, `provisioner/status`, `provisioner/target` ### Remote Execution - Eager-attach mode for `RemoteHttpRunner` — bundles all attachments upfront in a `CbPackage` for single-roundtrip submits - Track in-flight submissions to prevent over-queuing - Show remote runner hostname in `GetDisplayName()` - `--announce-url` to override the endpoint announced to the coordinator (e.g. relay-visible address) ### Frontend Dashboard - Delete standalone `compute.html` (925 lines) and `orchestrator.html` (669 lines), consolidated into JS page modules - Add provisioner panel to orchestrator dashboard: target/active/estimated core counts, draining agent count - Editable target-cores input with debounced POST to `/orch/provisioner/target` - Per-agent provisioning status badges (active / draining / deallocated) in the agents table - Active vs total CPU counts in agents summary row ### CLI - New `zen compute record-start` / `record-stop` subcommands - `zen exec` progress bar with submit and completion phases, atomic work counters, `--progress` mode (Pretty/Plain/Quiet) ### Other - `DataDir` supports environment variable expansion - Worker manifest validation checks for `worker.zcb` marker to detect incomplete cached directories - Linux/Mac runners `nice(5)` child processes to avoid starving the main server - `ComputeService::SetShutdownCallback` wired to `RequestExit` via `session/sunset` - Curl HTTP client logs effective URL on failure - `MachineInfo` carries `Pool` and `Mode` from Horde response - Horde bundle creation includes `.pdb` on Windows
* 5.8.4-pre2v5.8.4-pre2Dan Engelbrecht46 hours1-1/+1
|
* log curl raw error on retry, add retry on CURLE_PARTIAL_FILE error (#951)Dan Engelbrecht46 hours2-1/+4
| | | * log curl raw error on retry, add retry on CURLE_PARTIAL_FILE error
* silence errors due to abort (#950)Dan Engelbrecht46 hours2-188/+334
| | | * silence exceptions in threaded requests to build storage if already aborted
* Some minor polish from tourist branch (#949)Stefan Boberg46 hours13-44/+413
| | | | | | | | | | - Replace per-type fmt::formatter specializations (StringBuilderBase, NiceBase) with a single generic formatter using a HasStringViewConversion concept - Add ThousandsNum for comma-separated integer formatting (e.g. "1,234,567") - Thread naming now accepts a sort hint for trace ordering - Fix main thread trace registration to use actual thread ID and sort first - Add ExpandEnvironmentVariables() for expanding %VAR% references in strings, with tests - Add ParseHexBytes() overload with expected byte count validation - Add Flag_BelowNormalPriority to CreateProcOptions (BELOW_NORMAL_PRIORITY_CLASS on Windows, setpriority on POSIX) - Add PrettyScroll progress bar mode that pins the status line to the bottom of the terminal using scroll regions, with signal handler cleanup for Ctrl+C/SIGTERM
* Add MemoryCidStore and ChunkStore interface (#940)Stefan Boberg47 hours11-41/+331
| | | | | | | | | | This PR introduces an in-memory `CidStore` option primarily for use with compute, to avoid hitting disk for ephemeral data which is not really worth persisting. And in particular not worth paying the critical path cost of persistence. - **MemoryCidStore**: In-memory CidStore implementation backed by a hash map, optionally layered over a standard CidStore. Writes to the backing store are dispatched asynchronously via a dedicated flush thread to avoid blocking callers on disk I/O. Reads check memory first, then fall back to the backing store without caching the result. - **ChunkStore interface**: Extract `ChunkStore` abstract class (`AddChunk`, `ContainsChunk`, `FilterChunks`) and `FallbackChunkResolver` into `zenstore.h` so `HttpComputeService` can accept different storage backends for action inputs vs worker binaries. `CidStore` and `MemoryCidStore` both implement `ChunkStore`. - **Compute service wiring**: `HttpComputeService` takes two `ChunkStore&` params (action + worker). The compute server uses `MemoryCidStore` for actions (no disk persistence needed) and disk-backed `CidStore` for workers (cross-action reuse). The storage server passes its `CidStore` for both (unchanged behavior).
* 5.8.4-pre1v5.8.4-pre1Dan Engelbrecht48 hours1-1/+1
|
* minor fixups (#948)Dan Engelbrecht48 hours6-39/+54
| | | | | | | * objectstore.cpp - m_TotalBytesServed now tracks all range cases (single, multi, 416) * async http: docstring corrected: curl_multi_socket_action() / ASIO socket async_wait remove non-ascii characters * fix singlethreaded gc option in lua to not use dash * fix changelog order
* Logging and diagnostics improvements (#941)Stefan Boberg2 days15-128/+290
| | | | | | | | | | | | | | | | Core logging and system diagnostics improvements, extracted from the compute branch. ### Logging - **Elapsed timestamps**: Console log now shows elapsed time since launch `[HH:MM:SS.mmm]` instead of full date/time; file logging is unchanged - **Short level names**: 3-letter short level names (`trc`/`dbg`/`inf`/`wrn`/`err`/`crt`) used by both console and file formatters via `ShortToStringView()` - **Consistent field order**: Standardized to `[timestamp] [level] [logger]` across both console and file formatters - **Slim LogMessage/LogPoint**: Remove redundant fields from `LogMessage` (derive level/source from `LogPoint`), flatten `LogPoint` to inline filename/line fields, shrink `LogLevel` to `int8_t` with `static_assert(sizeof(LogPoint) <= 32)` - **Remove default member initializers** and static default `LogPoint` from `LogMessage` — all fields initialized by constructor - **LoggerRef string constructor**: Convenience constructor accepting a string directly - **Fix SendMessage macro collision**: Replace `thread.h` include in `logmsg.h` with a forward declaration of `GetCurrentThreadId()` to avoid pulling in `windows.h` transitively ### System Diagnostics - **Cache static system metrics**: Add `RefreshDynamicSystemMetrics()` that only queries values that change at runtime (available memory, uptime, swap). `SystemMetricsTracker` snapshots full `GetSystemMetrics()` once at construction and reuses cached topology/total memory on each `Query()`, avoiding repeated `GetLogicalProcessorInformationEx` traversal on Windows, `/proc/cpuinfo` parsing on Linux, and `sysctl` topology calls on macOS
* update minio (#947)Dan Engelbrecht2 days6-0/+2
|
* hub instance malloc trace (#946)Dan Engelbrecht2 days8-0/+91
| | | | | | | `--hub-instance-malloc` selects the memory allocator for child instances `--hub-instance-trace` sets trace channels for child instances `--hub-instance-tracehost` sets the trace streaming host for child instances `--hub-instance-tracefile` sets the trace output file for child instances add {moduleid} and {port} placeholder support for tracefile
* Add manual test workflow with configurable sanitizers and allocators (#944)Stefan Boberg4 days1-0/+298
| | | | | - Adds a `workflow_dispatch` workflow ("Manual Test Run") that can be triggered from the Actions tab - Configurable options: platform, memory allocator (`--malloc=stomp`/mimalloc/rpmalloc), sanitizer (asan/tsan/msan), test suite, and freeform extra arguments - Mirrors the build & test steps from `validate.yml` but always builds debug with sentry disabled, and with longer timeout (40min) to accommodate sanitizer overhead
* Dashboard stats tiles no longer flicker (#943)Dan Engelbrecht4 days7-190/+199
|
* removed s3 test program (#942)Stefan Boberg4 days3-535/+0
| | | Remove the `zens3-testbed` target and source files. This was a standalone test harness for S3 operations that is no longer needed.
* `--consul-register-hub` option to disable hub parent service Consul ↵Dan Engelbrecht4 days4-18/+37
| | | | registration (#939)
* hub deprovision all (#938)Dan Engelbrecht4 days4-5/+110
| | | * implement "deprovision all" for hub
* dashboard search (#936)Dan Engelbrecht4 days7-13/+187
| | | | | - Improvement: Dashboard paginated lists now include a search input that jumps to the page containing the first match and highlights the row - Improvement: Dashboard paginated lists show a loading indicator while fetching data - Improvement: Hub dashboard navigates to and highlights newly provisioned instances
* improve messaging when zen builds download target disk does not have enought ↵Dan Engelbrecht4 days2-1/+5
| | | | space (#935)
* update rpmalloc and tweak for commit/decommit churn (#934)Dan Engelbrecht4 days5-88/+226
| | | | - Improvement: Updated rpmalloc to develop branch commit feb43aee0d4d (2025-10-26), which fixes `VirtualAlloc(MEM_COMMIT)` failures being silently ignored under memory pressure - Improvement: Increased rpmalloc page decommit thresholds to reduce commit/decommit churn under high allocation turnover
* HTTP range responses (RFC 7233) - httpobjectstore (#928)Dan Engelbrecht5 days11-137/+515
| | | | | | | | | - Improvement: HTTP range responses (RFC 7233) are now fully compliant across the object store and build store - 206 Partial Content responses now include a `Content-Range` header; previously absent for single-range requests, which broke `HttpClient::GetRanges()` - 416 Range Not Satisfiable responses now include `Content-Range: bytes */N` as required by RFC 7233 - Out-of-bounds range requests return 416 Range Not Satisfiable (was 400 Bad Request) - Single-byte ranges (`bytes=N-N`) are now correctly accepted (were previously rejected) - Range byte positions widened from 32-bit to 64-bit; RFC 7233 imposes no size limit on byte range values - Build store binary GET requests with a Range header now return 206 Partial Content with `Content-Range` (previously returned 200 OK without it)
* reduce test runtime (#933)Dan Engelbrecht5 days15-1386/+1242
| | | | | | | | * reduce zenserver spawns in tests * fix filesystemutils wrong test suite name * tweak tests for faster runtime * reduce more test runtime * more wall time improvements * fast http and processmanager tests
* Update CHANGELOG.mdStefan Boberg6 days1-0/+4
|
* Fix ZenServerState stale entry detection on PID reuse (k8s) (#932)Stefan Boberg6 days1-0/+31
| | | | | | - Detect stale shared-memory entries whose PID matches the current process but predate our registration (m_OurEntry == nullptr) - Sweep() now reclaims such entries instead of skipping them - Lookup() and LookupByEffectivePort() skip stale same-PID entries - Fixes startup failure on k8s where PID 1 is always reused after an unclean shutdown
* Add async HTTP client (curl_multi + ASIO) (#918)Stefan Boberg6 days6-269/+1776
| | | | | | | | | | | | | | | | | | | | | | | - Adds `AsyncHttpClient` — an asynchronous HTTP client using `curl_multi_socket_action` integrated with ASIO for event-driven I/O. Supports GET, POST, PUT, DELETE, HEAD with both callback-based and `std::future`-based APIs. - Extracts shared curl helpers (callbacks, URL encoding, header construction, error mapping) into `httpclientcurlhelpers.h`, eliminating duplication between the sync and async implementations. ## Design - All curl_multi state is serialized on an `asio::strand`, safe with multi-threaded io_contexts. - Two construction modes: owned io_context (creates internal thread) or external io_context (caller runs the loop). - Socket readiness is detected via `asio::ip::tcp::socket::async_wait` driven by curl's `CURLMOPT_SOCKETFUNCTION`/`CURLMOPT_TIMERFUNCTION` — no polling, sub-millisecond latency. - Completion callbacks are dispatched off the strand onto the io_context so slow callbacks don't starve the curl event loop. Exceptions in callbacks are caught and logged. ## Files | File | Change | |------|--------| | `zenhttp/include/zenhttp/asynchttpclient.h` | New public header | | `zenhttp/clients/asynchttpclient.cpp` | Implementation (~1000 lines) | | `zenhttp/clients/httpclientcurlhelpers.h` | Shared curl helpers extracted from sync client | | `zenhttp/clients/httpclientcurl.cpp` | Removed duplicated helpers, uses shared header | | `zenhttp/asynchttpclient_test.cpp` | 8 test cases: verbs, payloads, callbacks, concurrency, external io_context, connection errors | | `zenhttp/zenhttp.cpp` | Forcelink registration for new tests |
* migrate from http_parser to llhttp (#929)Dan Engelbrecht6 days12-143/+423
|
* 5.8.3v5.8.3Dan Engelbrecht7 days1-1/+1
|
* 5.8.3-pre2v5.8.3-pre2Dan Engelbrecht7 days1-1/+1
|
* fully provisioned hub instances now sets initial check status to "passing" ↵Dan Engelbrecht7 days4-7/+15
| | | | in consul (#930)
* use correct return code for unsupported multirange requests in objectstore ↵Dan Engelbrecht7 days3-2/+65
| | | | (#927)
* don't hard fail if .pending folder is not empty on oplog export (#926)Dan Engelbrecht7 days3-2/+11
|
* fix missing chunk in oplog export (#925)Dan Engelbrecht7 days2-0/+161
| | | * add reused block to oplog during export
* hydration data obliteration (#923)Dan Engelbrecht7 days16-153/+738
| | | | - Feature: Hub obliterate operation deletes all local and backend hydration data for a module - Improvement: Hub dashboard adds obliterate button for individual, bulk, and by-name module deletion
* sort items on dashboard (#924)Dan Engelbrecht8 days3-79/+114
| | | * add pagination and consistent sorting on cache and projects ui pages
* add pagination of cooked projects and caches on dashboard front page (#922)Dan Engelbrecht8 days3-65/+173
|
* incremental dehydrate (#921)Dan Engelbrecht8 days22-1148/+1910
| | | | | | | | | | | | | | | - Feature: Incremental CAS-based hydration/dehydration replacing the previous full-copy approach - Feature: S3 hydration backend with multipart upload/download support - Feature: Configurable thread pools for hub instance provisioning and hydration `--hub-instance-provision-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. `--hub-hydration-threads` defaults to `max(cpu_count / 4, 2)`. Set to 0 for synchronous operation. - Improvement: Hub triggers GC on instance before deprovisioning to compact storage before dehydration - Improvement: GC status now reports pending triggers as running - Improvement: S3 client debug logging gated behind verbose mode to reduce log noise at default verbosity - Improvement: Hub dashboard Resources tile now shows total memory - Improvement: `filesystemutils` moved from `zenremotestore` to `zenutil` for broader reuse - Improvement: Hub uses separate provision and hydration worker pools to avoid deadlocks - Improvement: Hibernate/wake/deprovision on non-existent or already-in-target-state modules are idempotent - Improvement: `ScopedTemporaryDirectory` with empty path now creates a temporary directory instead of asserting
* disable zencompute in bundle stepStefan Boberg12 days1-0/+3
|
* 5.8.3-pre0v5.8.3-pre0Dan Engelbrecht13 days1-1/+1
|