aboutsummaryrefslogtreecommitdiff
path: root/src/zenhorde/hordeagent.cpp
Commit message (Collapse)AuthorAgeFilesLines
* [zenhorde] security + robustness fixes (#972)Stefan Boberg2026-04-171-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | ### Critical (cryptographic correctness) - AES-GCM nonce: replace homebrew `N32[0]++; N32[1]--; N32[2] = ^` scheme with NIST SP 800-38D ยง8.2.1 deterministic construction (64-bit big-endian counter). Session tears down on counter exhaustion instead of reusing a nonce. - Remove `std::random_device` / `mt19937` nonce seed - the deterministic construction from the previous commit doesn't need an RNG, and `std::random_device` isn't guaranteed to be a CSPRNG. - BCrypt return values: check every `BCRYPT_SUCCESS`, cache the `BCRYPT_KEY_HANDLE` on the context instead of re-creating it per message, destroy under null-guards. Closes the silent-downgrade-to-non-GCM path. ### High - OpenSSL: check `EVP_CIPHER_CTX_new` / `EVP_EncryptInit_ex` / `EVP_DecryptInit_ex` return values in the constructor and set `HasErrors` on failure. - Log AES-GCM tag-verification failures distinctly from other decrypt errors (BCrypt `STATUS_AUTH_TAG_MISMATCH` / OpenSSL `EVP_DecryptFinal_ex` post-set-tag), with a sequence counter for correlation. - Thread a bounds-checked `ReadCursor` through every `Read*` parser helper; `ReadException` / `ReadExecuteResult` / `ReadBlobRequest` now return `bool` and callers treat malformed frames as protocol errors. Closes the `0xFF` varint OOB-read. - Validate `ReadBlobRequest` locator as a safe filename component (reject path separators, `..`, NUL/control, drive colons, leading/trailing dot/space, length > 255). Closes the path-traversal attack on the `BundleDir / (Locator + ".blob")` join. - Bind `AsyncAgentMessageChannel`'s timer and `AsyncReadResponse` entry onto the socket's strand; expose `AsyncComputeSocket::GetStrand()`. Removes the race between the bare-io_context timer completion and `OnFrame` on `m_PendingHandler` under the 3-thread pool. - Drop the long-lived `m_EncryptBuffer` member - encrypt into a fresh per-write buffer shared with the completion handler. Also fixes thread-safety of the encrypt path. - Validate server-returned `ClusterId` against `[A-Za-z0-9._-]{1,64}` before concatenating into the `api/v2/compute/<ClusterId>` URL. ### Medium - `EVP_CIPHER_CTX_reset` + re-bind cipher on every encrypt/decrypt so stale state cannot bleed across messages. Also logs EVP failures. - Malformed `ExecuteResult` (size != 4) now tears down the agent instead of silently reporting `ExitCode = -1`. - Replace `assert(Eq != nullptr)` on env var parsing with a `zen::runtime_error` - assert is compiled out in release and `*(Eq+1)` was UB. - Blob name uses `zen::Oid::NewOid()` (24 hex chars, seeded from `random_device` run-id + monotonic serial) instead of predictable `<pid>_<ms>_<counter>`. Refuse to overwrite an existing blob path. - Cap `m_RecentlyDrainedWorkerIds` at 256 entries with an FIFO eviction queue. - `Blob(Data, Length)` rejects `Length > INT32_MAX` instead of wrapping the int32 wire fields. - Static `AuthToken` path uses `HttpClientAccessToken::TimePoint::max()` (never-expires sentinel) instead of synthesizing `now + 24h`. - Remove dead `m_Transport` field and `else if (m_Transport)` branch in `AsyncHordeAgent::Cancel()`.
* fix utf characters in source code (#953)Dan Engelbrecht2026-04-131-4/+4
|
* Compute OIDC auth, async Horde agents, and orchestrator improvements (#913)Stefan Boberg2026-04-131-192/+359
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rework of the Horde agent subsystem from synchronous per-thread I/O to an async ASIO-driven architecture, plus provisioner scale-down with graceful draining, OIDC authentication, scheduler improvements, and dashboard UI for provisioner control. ### Async Horde Agent Rewrite - Replace synchronous `HordeAgent` (one thread per agent, blocking I/O) with `AsyncHordeAgent` โ€” an ASIO state machine running on a shared `io_context` thread pool - Replace `TcpComputeTransport`/`AesComputeTransport` with `AsyncTcpComputeTransport`/`AsyncAesComputeTransport` - Replace `AgentMessageChannel` with `AsyncAgentMessageChannel` using frame queuing and ASIO timers - Delete `ComputeBuffer` and `ComputeChannel` ring-buffer classes (no longer needed) ### Provisioner Drain / Scale-Down - `HordeProvisioner` can now drain agents when target core count is lowered: queries each agent's `/compute/session/status` for workload, selects candidates by largest-fit/lowest-workload, and sends `/compute/session/drain` - Configurable `--horde-drain-grace-period` (default 300s) before force-kill - Implement `IProvisionerStateProvider` interface to expose provisioner state to the orchestrator HTTP layer - Forward `--coordinator-session`, `--provision-clean`, and `--provision-tracehost` through both Horde and Nomad provisioners to spawned workers ### OIDC Authentication - `HordeClient` accepts an `AccessTokenProvider` (refreshable token function) as alternative to static `--horde-token` - Wire up `OidcToken.exe` auto-discovery via `httpclientauth::CreateFromOidcTokenExecutable` with `--HordeUrl` mode - New `--horde-oidctoken-exe-path` CLI option for explicit path override ### Orchestrator & Scheduler - Orchestrator generates a session ID at startup; workers include `coordinator_session` in announcements so the orchestrator can reject stale-session workers - New `Rejected` action state โ€” when a remote runner declines at capacity, the action is rescheduled without retry count increment - Reduce scheduler lock contention: snapshot pending actions under shared lock, sort/trim outside the lock - Parallelize remote action submission across runners via `WorkerThreadPool` with slow-submit warnings - New action field `FailureReason` populated by all runner types (exit codes, sandbox failures, exceptions) - New endpoints: `session/drain`, `session/status`, `session/sunset`, `provisioner/status`, `provisioner/target` ### Remote Execution - Eager-attach mode for `RemoteHttpRunner` โ€” bundles all attachments upfront in a `CbPackage` for single-roundtrip submits - Track in-flight submissions to prevent over-queuing - Show remote runner hostname in `GetDisplayName()` - `--announce-url` to override the endpoint announced to the coordinator (e.g. relay-visible address) ### Frontend Dashboard - Delete standalone `compute.html` (925 lines) and `orchestrator.html` (669 lines), consolidated into JS page modules - Add provisioner panel to orchestrator dashboard: target/active/estimated core counts, draining agent count - Editable target-cores input with debounced POST to `/orch/provisioner/target` - Per-agent provisioning status badges (active / draining / deallocated) in the agents table - Active vs total CPU counts in agents summary row ### CLI - New `zen compute record-start` / `record-stop` subcommands - `zen exec` progress bar with submit and completion phases, atomic work counters, `--progress` mode (Pretty/Plain/Quiet) ### Other - `DataDir` supports environment variable expansion - Worker manifest validation checks for `worker.zcb` marker to detect incomplete cached directories - Linux/Mac runners `nice(5)` child processes to avoid starving the main server - `ComputeService::SetShutdownCallback` wired to `RequestExit` via `session/sunset` - Curl HTTP client logs effective URL on failure - `MachineInfo` carries `Pool` and `Mode` from Horde response - Horde bundle creation includes `.pdb` on Windows
* compute orchestration (#763)Stefan Boberg2026-03-041-0/+297
- Added local process runners for Linux/Wine, Mac with some sandboxing support - Horde & Nomad provisioning for development and testing - Client session queues with lifecycle management (active/draining/cancelled), automatic retry with configurable limits, and manual reschedule API - Improved web UI for orchestrator, compute, and hub dashboards with WebSocket push updates - Some security hardening - Improved scalability and `zen exec` command Still experimental - compute support is disabled by default