Compute OIDC auth, async Horde agents, and orchestrator improvements (#913)

Rework of the Horde agent subsystem from synchronous per-thread I/O to an async ASIO-driven architecture, plus provisioner scale-down with graceful draining, OIDC authentication, scheduler improvements, and dashboard UI for provisioner control. ### Async Horde Agent Rewrite - Replace synchronous `HordeAgent` (one thread per agent, blocking I/O) with `AsyncHordeAgent` — an ASIO state machine running on a shared `io_context` thread pool - Replace `TcpComputeTransport`/`AesComputeTransport` with `AsyncTcpComputeTransport`/`AsyncAesComputeTransport` - Replace `AgentMessageChannel` with `AsyncAgentMessageChannel` using frame queuing and ASIO timers - Delete `ComputeBuffer` and `ComputeChannel` ring-buffer classes (no longer needed) ### Provisioner Drain / Scale-Down - `HordeProvisioner` can now drain agents when target core count is lowered: queries each agent's `/compute/session/status` for workload, selects candidates by largest-fit/lowest-workload, and sends `/compute/session/drain` - Configurable `--horde-drain-grace-period` (default 300s) before force-kill - Implement `IProvisionerStateProvider` interface to expose provisioner state to the orchestrator HTTP layer - Forward `--coordinator-session`, `--provision-clean`, and `--provision-tracehost` through both Horde and Nomad provisioners to spawned workers ### OIDC Authentication - `HordeClient` accepts an `AccessTokenProvider` (refreshable token function) as alternative to static `--horde-token` - Wire up `OidcToken.exe` auto-discovery via `httpclientauth::CreateFromOidcTokenExecutable` with `--HordeUrl` mode - New `--horde-oidctoken-exe-path` CLI option for explicit path override ### Orchestrator & Scheduler - Orchestrator generates a session ID at startup; workers include `coordinator_session` in announcements so the orchestrator can reject stale-session workers - New `Rejected` action state — when a remote runner declines at capacity, the action is rescheduled without retry count increment - Reduce scheduler lock contention: snapshot pending actions under shared lock, sort/trim outside the lock - Parallelize remote action submission across runners via `WorkerThreadPool` with slow-submit warnings - New action field `FailureReason` populated by all runner types (exit codes, sandbox failures, exceptions) - New endpoints: `session/drain`, `session/status`, `session/sunset`, `provisioner/status`, `provisioner/target` ### Remote Execution - Eager-attach mode for `RemoteHttpRunner` — bundles all attachments upfront in a `CbPackage` for single-roundtrip submits - Track in-flight submissions to prevent over-queuing - Show remote runner hostname in `GetDisplayName()` - `--announce-url` to override the endpoint announced to the coordinator (e.g. relay-visible address) ### Frontend Dashboard - Delete standalone `compute.html` (925 lines) and `orchestrator.html` (669 lines), consolidated into JS page modules - Add provisioner panel to orchestrator dashboard: target/active/estimated core counts, draining agent count - Editable target-cores input with debounced POST to `/orch/provisioner/target` - Per-agent provisioning status badges (active / draining / deallocated) in the agents table - Active vs total CPU counts in agents summary row ### CLI - New `zen compute record-start` / `record-stop` subcommands - `zen exec` progress bar with submit and completion phases, atomic work counters, `--progress` mode (Pretty/Plain/Quiet) ### Other - `DataDir` supports environment variable expansion - Worker manifest validation checks for `worker.zcb` marker to detect incomplete cached directories - Linux/Mac runners `nice(5)` child processes to avoid starving the main server - `ComputeService::SetShutdownCallback` wired to `RequestExit` via `session/sunset` - Curl HTTP client logs effective URL on failure - `MachineInfo` carries `Pool` and `Mode` from Horde response - Horde bundle creation includes `.pdb` on Windows
author: Stefan Boberg <[email protected]> 2026-04-13 16:38:16 +0200
committer: GitHub Enterprise <[email protected]> 2026-04-13 16:38:16 +0200
commit: 795345e5fd7974a1f5227d507a58bb3ed75eafd5 (patch)
tree: 7a0f142bf562c3590400586c82b0e7a1b5ad6493 /src/zencompute/runners/functionrunner.cpp
parent: 5.8.4-pre2 (diff)
download: zen-795345e5fd7974a1f5227d507a58bb3ed75eafd5.tar.xz
zen-795345e5fd7974a1f5227d507a58bb3ed75eafd5.zip
1 files changed, 99 insertions, 21 deletions
diff --git a/src/zencompute/runners/functionrunner.cpp b/src/zencompute/runners/functionrunner.cpp
index 67e12b84e..ab22c6363 100644
--- a/src/zencompute/runners/functionrunner.cpp
+++ b/src/zencompute/runners/functionrunner.cpp
@@ -6,9 +6,15 @@
 
 #	include <zencore/compactbinary.h>
 #	include <zencore/filesystem.h>
+#	include <zencore/fmtutils.h>
+#	include <zencore/logging.h>
+#	include <zencore/string.h>
+#	include <zencore/timer.h>
 #	include <zencore/trace.h>
+#	include <zencore/workthreadpool.h>
 
 #	include <fmt/format.h>
+#	include <future>
 #	include <vector>
 
 namespace zen::compute {
@@ -118,23 +124,34 @@ std::vector<SubmitResult>
 BaseRunnerGroup::SubmitActions(const std::vector<Ref<RunnerAction>>& Actions)
 {
 	ZEN_TRACE_CPU("BaseRunnerGroup::SubmitActions");
-	RwLock::SharedLockScope _(m_RunnersLock);
 
-	const int RunnerCount = gsl::narrow<int>(m_Runners.size());
+	// Snapshot runners and query capacity under the lock, then release
+	// before submitting — HTTP submissions to remote runners can take
+	// hundreds of milliseconds and we must not hold m_RunnersLock during I/O.
 
-	if (RunnerCount == 0)
-	{
-		return std::vector<SubmitResult>(Actions.size(), SubmitResult{.IsAccepted = false, .Reason = "No runners available"});
-	}
+	std::vector<Ref<FunctionRunner>>			Runners;
+	std::vector<size_t>							Capacities;
+	std::vector<std::vector<Ref<RunnerAction>>> PerRunnerActions;
+	size_t										TotalCapacity = 0;
 
-	// Query capacity per runner and compute total
-	std::vector<size_t> Capacities(RunnerCount);
-	size_t				TotalCapacity = 0;
+	m_RunnersLock.WithSharedLock([&] {
+		const int RunnerCount = gsl::narrow<int>(m_Runners.size());
+		Runners.assign(m_Runners.begin(), m_Runners.end());
+		Capacities.resize(RunnerCount);
+		PerRunnerActions.resize(RunnerCount);
 
-	for (int i = 0; i < RunnerCount; ++i)
+		for (int i = 0; i < RunnerCount; ++i)
+		{
+			Capacities[i] = Runners[i]->QueryCapacity();
+			TotalCapacity += Capacities[i];
+		}
+	});
+
+	const int RunnerCount = gsl::narrow<int>(Runners.size());
+
+	if (RunnerCount == 0)
 	{
-		Capacities[i] = m_Runners[i]->QueryCapacity();
-		TotalCapacity += Capacities[i];
+		return std::vector<SubmitResult>(Actions.size(), SubmitResult{.IsAccepted = false, .Reason = "No runners available"});
 	}
 
 	if (TotalCapacity == 0)
@@ -143,9 +160,8 @@ BaseRunnerGroup::SubmitActions(const std::vector<Ref<RunnerAction>>& Actions)
 	}
 
 	// Distribute actions across runners proportionally to their available capacity
-	std::vector<std::vector<Ref<RunnerAction>>> PerRunnerActions(RunnerCount);
-	std::vector<size_t>							ActionRunnerIndex(Actions.size());
-	size_t										ActionIdx = 0;
+	std::vector<size_t> ActionRunnerIndex(Actions.size());
+	size_t				ActionIdx = 0;
 
 	for (int i = 0; i < RunnerCount; ++i)
 	{
@@ -176,14 +192,74 @@ BaseRunnerGroup::SubmitActions(const std::vector<Ref<RunnerAction>>& Actions)
 		}
 	}
 
-	// Submit batches per runner
+	// Submit batches per runner — in parallel when a worker pool is available
+
 	std::vector<std::vector<SubmitResult>> PerRunnerResults(RunnerCount);
 
+	int ActiveRunnerCount = 0;
 	for (int i = 0; i < RunnerCount; ++i)
 	{
 		if (!PerRunnerActions[i].empty())
 		{
-			PerRunnerResults[i] = m_Runners[i]->SubmitActions(PerRunnerActions[i]);
+			++ActiveRunnerCount;
+		}
+	}
+
+	static constexpr uint64_t SubmitWarnThresholdMs = 500;
+
+	auto SubmitToRunner = [&](int RunnerIndex) {
+		auto& Runner = Runners[RunnerIndex];
+		Runner->m_LastSubmitStats.Reset();
+
+		Stopwatch Timer;
+
+		PerRunnerResults[RunnerIndex] = Runner->SubmitActions(PerRunnerActions[RunnerIndex]);
+
+		uint64_t ElapsedMs = Timer.GetElapsedTimeMs();
+		if (ElapsedMs >= SubmitWarnThresholdMs)
+		{
+			size_t	 Attachments	 = Runner->m_LastSubmitStats.TotalAttachments.load(std::memory_order_relaxed);
+			uint64_t AttachmentBytes = Runner->m_LastSubmitStats.TotalAttachmentBytes.load(std::memory_order_relaxed);
+
+			ZEN_WARN("submit of {} actions ({} attachments, {}) to '{}' took {}ms",
+					 PerRunnerActions[RunnerIndex].size(),
+					 Attachments,
+					 NiceBytes(AttachmentBytes),
+					 Runner->GetDisplayName(),
+					 ElapsedMs);
+		}
+	};
+
+	if (m_WorkerPool && ActiveRunnerCount > 1)
+	{
+		std::vector<std::future<void>> Futures(RunnerCount);
+
+		for (int i = 0; i < RunnerCount; ++i)
+		{
+			if (!PerRunnerActions[i].empty())
+			{
+				std::packaged_task<void()> Task([&SubmitToRunner, i]() { SubmitToRunner(i); });
+
+				Futures[i] = m_WorkerPool->EnqueueTask(std::move(Task), WorkerThreadPool::EMode::EnableBacklog);
+			}
+		}
+
+		for (int i = 0; i < RunnerCount; ++i)
+		{
+			if (Futures[i].valid())
+			{
+				Futures[i].get();
+			}
+		}
+	}
+	else
+	{
+		for (int i = 0; i < RunnerCount; ++i)
+		{
+			if (!PerRunnerActions[i].empty())
+			{
+				SubmitToRunner(i);
+			}
 		}
 	}
 
@@ -309,10 +385,11 @@ RunnerAction::RetractAction()
 bool
 RunnerAction::ResetActionStateToPending()
 {
-	// Only allow reset from Failed, Abandoned, or Retracted states
+	// Only allow reset from Failed, Abandoned, Rejected, or Retracted states
 	State CurrentState = m_ActionState.load();
 
-	if (CurrentState != State::Failed && CurrentState != State::Abandoned && CurrentState != State::Retracted)
+	if (CurrentState != State::Failed && CurrentState != State::Abandoned && CurrentState != State::Rejected &&
+		CurrentState != State::Retracted)
 	{
 		return false;
 	}
@@ -333,11 +410,12 @@ RunnerAction::ResetActionStateToPending()
 
 	// Clear execution fields
 	ExecutionLocation.clear();
+	FailureReason.clear();
 	CpuUsagePercent.store(-1.0f, std::memory_order_relaxed);
 	CpuSeconds.store(0.0f, std::memory_order_relaxed);
 
-	// Increment retry count (skip for Retracted — nothing failed)
-	if (CurrentState != State::Retracted)
+	// Increment retry count (skip for Retracted/Rejected — nothing failed)
+	if (CurrentState != State::Retracted && CurrentState != State::Rejected)
 	{
 		RetryCount.fetch_add(1, std::memory_order_relaxed);
 	}
author	Stefan Boberg <[email protected]>	2026-04-13 16:38:16 +0200
committer	GitHub Enterprise <[email protected]>	2026-04-13 16:38:16 +0200
commit	795345e5fd7974a1f5227d507a58bb3ed75eafd5 (patch)
tree	7a0f142bf562c3590400586c82b0e7a1b5ad6493 /src/zencompute/runners/functionrunner.cpp
parent	5.8.4-pre2 (diff)
download	zen-795345e5fd7974a1f5227d507a58bb3ed75eafd5.tar.xz zen-795345e5fd7974a1f5227d507a58bb3ed75eafd5.zip