aboutsummaryrefslogtreecommitdiff
path: root/src/zencompute/computeservice.cpp
Commit message (Collapse)AuthorAgeFilesLines
* Add Retracted state to RunnerAction for retry-free reschedulingStefan Boberg5 hours1-0/+61
| | | | | | | | | | | | | | | | | | | | | | Retracted is an explicit, instigator-initiated request to pull an action back and reschedule it on a different runner (e.g. capacity opened up elsewhere). Unlike Failed/Abandoned auto-retry, rescheduling from Retracted does not increment RetryCount since nothing went wrong. - Add Retracted enum value after Cancelled with static_assert guarding ordinal placement so runner-side transitions cannot override it - Implement idempotent RetractAction() CAS method on RunnerAction - Extend ResetActionStateToPending() to accept Retracted without incrementing RetryCount - Add RetractAction() to ComputeServiceSession with pending/running map lookup and runner cancellation for running actions - Handle Retracted in HandleActionUpdates() scheduler loop (remove from active maps, reset to Pending, no history/results entry) - Add POST jobs/{lsn}/retract and queues/{queueref}/jobs/{lsn}/retract HTTP endpoints - Bump ActionHistoryEntry::Timestamps array from [8] to [9] - Add state machine documentation and per-state comments to RunnerAction - Add tests: retract_pending, retract_not_terminal, retract_http
* Fix queue ActiveCount race in HandleActionUpdates terminal pathStefan Boberg6 hours1-1/+7
| | | | | | | | | | | | NotifyQueueActionComplete (which decrements ActiveCount) was called after releasing m_ResultsLock. GetActionResult acquires m_ResultsLock to consume the result, so a caller could observe ActiveCount still at 1 immediately after GetActionResult returned OK if the scheduler thread was preempted between releasing m_ResultsLock and reaching NotifyQueueActionComplete. Fix by calling NotifyQueueActionComplete before the m_ResultsLock block that publishes the result into m_ResultsMap. This guarantees that by the time GetActionResult can return OK, the queue counters are already updated.
* Fix race in HandleActionUpdates causing abandon_pending test to flakeStefan Boberg6 hours1-1/+17
| | | | | | | | | | | | | | | AbandonAllActions() scans m_PendingActions to mark actions as Abandoned, but EnqueueAction posts actions to m_UpdatedActions first — the scheduler inserts them into m_PendingActions on its next tick. If the session transitions to Abandoned in that window, AbandonAllActions() sees an empty m_PendingActions and the actions are later inserted as Pending with no one left to abandon them, causing GetActionResult to return 202 indefinitely. Fix: in HandleActionUpdates, when processing a Pending-state action, check if the session is already Abandoned and if so call SetActionState(Abandoned) immediately rather than inserting into the pending map. SetActionState calls PostUpdate internally, so the action re-enters m_UpdatedActions as Abandoned and flows into m_ResultsMap on the next scheduler pass.
* Show host details when discovering new compute workersStefan Boberg22 hours1-1/+18
| | | | | Log hostname, platform, CPU count and total memory alongside the worker URI so operators can identify machines at a glance.
* Batch action submission for compute serviceStefan Boberg24 hours1-75/+63
| | | | | | | - Consolidate duplicated action submission logic in httpcomputeservice into a single HandleSubmitAction method supporting both single-action and batch (actions array) payloads - Group actions by queue in RemoteHttpRunner and submit as batches with configurable chunk size, falling back to individual submission on failure - Extract shared helpers in computeservice: MakeErrorResult, ValidateQueueForEnqueue, ActivateActionInQueue, RemoveActionFromActiveMaps - Add WriteCompactBinaryObject to zencore
* compute orchestration (#763)Stefan Boberg13 days1-0/+2236
- Added local process runners for Linux/Wine, Mac with some sandboxing support - Horde & Nomad provisioning for development and testing - Client session queues with lifecycle management (active/draining/cancelled), automatic retry with configurable limits, and manual reschedule API - Improved web UI for orchestrator, compute, and hub dashboards with WebSocket push updates - Some security hardening - Improved scalability and `zen exec` command Still experimental - compute support is disabled by default