aboutsummaryrefslogtreecommitdiff
path: root/docs/compute.md
blob: 06453c97578de519cccb8eff4ff7b16865a80e13 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# Zen Compute

> **Note:** Compute interfaces are a work in progress and are not yet included
> in official binary releases.

Zen server implements a compute interface for managing distributed processing
via the UE5 DDC 2.0 Build API.

## Compute Model

The compute interface implements a "pure function" model for distributing work,
similar in spirit to serverless compute paradigms like AWS Lambda.

Clients implement transformation [Functions](#functions) in
[worker executables](#workers) and dispatch [Actions](#actions) to them via a
message-based interface.

Actions and workers must be described explicitly and fully up front — work is
submitted as self-contained objects to the compute service. This is more
constrained than general-purpose serverless platforms, which allows for
optimizations and tight integration with Zen's storage model.


## Actions

An action is the unit of work in the compute model. It is described by an action
descriptor — a Compact Binary object containing a self-contained description of
the inputs and the function to apply to produce an output.

### Sample Action Descriptor

```
work item 4857714dee2383b50b2e7d72afd79848ab5d13f8 (2 attachments):
Function: CompileShaderJobs
FunctionVersion: '83027356-2cf7-41ca-aba5-c81ab0ff2129'
BuildSystemVersion: '17fe280d-ccd8-4be8-a9d1-89c944a70969'
Inputs:
  Input:
    RawHash: 0c01d9f19033256ca974fced523d1e15b27c1b0a
    RawSize: 4482
  Virtual0:
    RawHash: dd9bbcb8763badd2f015f94f8f6e360362e2bce0
    RawSize: 3334
```

## Functions

Functions are identified by a name, and a version specification. For 
matching purposes there's also a build system version specification.
When workers are registered with the compute service, they are entered
into a table and as actions stream in the compute subsystem will try to
find a worker which implements the required function using the 
`[Function,FunctionVersion,BuildSystemVersion]` tuple. In practice there 
may be more than one matching worker and it's up to the compute service 
to pick one.

```
=== Known functions ===========================
function                       version                              build system                         worker id
CompileShaderJobs              83027356-2cf7-41ca-aba5-c81ab0ff2129 17fe280d-ccd8-4be8-a9d1-89c944a70969 69cb9bb50e9600b5bd5e5ca4ba0f9187b118069a
```

## Workers

A worker is an executable which accepts some command line options which are used to pass the 
information required to execute an action. There are two modes, one legacy mode which is
file-based and a streaming mode.

In the file-based mode the option is simply `-Build=<action file>` which points to an action
descriptor in compact binary format (see above). By convention, the referenced inputs are in a folder
named `Inputs` where any input blobs are stored as `CompressedBuffer`-format files named 
after the `IoHash` of the uncompressed contents.

In the streaming mode, the data is provided through a streaming socket interface instead
of using the file system. This eliminates process spawning overheads and enables intra-process 
pipelining for greater efficiency. The streaming mode is not yet implemented fully.

### Worker Descriptors

Workers are declared by passing a worker descriptor to the compute service. The descriptor
contains information about which executable files are required to execute the worker and how 
they need to be laid out. You can optionally also provide additional non-executable files to
go along with the executables.

The descriptor also lists the functions implemented by the worker. Each function defines
a version which is used when matching actions (the function version is passed in as the 
`FunctionVersion` in the action descriptor).

Each worker links in a small set of common support code which is used to handle the 
communication with the invoking program (the 'build system'). To be able to evolve this
interface, each worker also indicates the version of the build system using the
`BuildSystemVersion` attribute.

### Sample Worker Descriptor

```
worker 69cb9bb50e9600b5bd5e5ca4ba0f9187b118069a:
name: ShaderBuildWorker
path: Engine/Binaries/Win64/ShaderBuildWorker.exe
host: Win64
buildsystem_version: '17fe280d-ccd8-4be8-a9d1-89c944a70969'
timeout: 300
cores: 1
environment: []
executables:
  - name: 'Engine/Binaries/Win64/ShaderBuildWorker-DerivedDataBuildWorker.dll'
    hash: f4dbec80e549bae2916288f1b9428c2878d9ae7a
    size: 166912
  - name: 'Engine/Binaries/Win64/ShaderBuildWorker-DerivedDataCache.dll'
    hash: 8025d561ede05db19b235fc2ef290e2b029c1b8c
    size: 4339200
  - name: Engine/Binaries/Win64/ShaderBuildWorker.exe
    hash: b85862fca2ce04990470f27bae9ead7f31d9b27e
    size: 60928
  - name: Engine/Binaries/Win64/ShaderBuildWorker.modules
    hash: 7b05741a69a2ea607c5578668a8de50b04259668
    size: 3739
  - name: Engine/Binaries/Win64/ShaderBuildWorker.version
    hash: 8fdfd9f825febf2191b555393e69b32a1d78c24f
    size: 259
files: []
dirs:
  - Engine/Binaries/Win64
functions:
  - name: CompileShaderJobs
    version: '83027356-2cf7-41ca-aba5-c81ab0ff2129'
```

## API

The compute interfaces are exposed on the `/compute` endpoint. The LSN
APIs below are intended to replace the action ID oriented APIs.

The POST APIs typically involve a two-step dance where a descriptor is POSTed and
the service responds with a list of `needs` chunks (identified via `IoHash`) which
it does not have yet. The client can then follow up with a POST of a Compact Binary
Package containing the descriptor along with the needed chunks.

`/compute/ready` - health check endpoint returns HTTP 200 OK or HTTP 503

`/compute/sysinfo` - system information endpoint

`/compute/record/start`, `/compute/record/stop` - start/stop action recording

`/compute/workers/{worker}` - GET/POST worker descriptors and payloads

`/compute/jobs/completed` - GET list of completed actions

`/compute/jobs/{lsn}` - GET completed action results from LSN, POST action cancellation by LSN, priority changes by LSN

`/compute/jobs/{worker}/{action}` - GET completed action (job) results by action ID

`/compute/jobs/{worker}` - GET pending/running jobs for worker, POST requests to schedule action as a job

`/compute/jobs` - POST request to schedule action as a job

### Queues

Queues provide a way to logically group actions submitted by a client session. This enables
per-session cancellation and completion polling without affecting actions submitted by other
sessions.

#### Local access (integer ID routes)

These routes use sequential integer queue IDs and are restricted to local (loopback)
connections only. Remote requests receive HTTP 403 Forbidden.

`/compute/queues` - POST to create a new queue. Returns a `queue_id` which is used to
reference the queue in subsequent requests.

`/compute/queues/{queue}` - GET queue status (active, completed, failed, and cancelled
action counts, plus `is_complete` flag indicating all actions have finished). DELETE to
cancel all pending and running actions in the queue.

`/compute/queues/{queue}/completed` - GET list of completed action LSNs for this queue
whose results have not yet been retired. A queue-scoped alternative to `/compute/jobs/completed`.

`/compute/queues/{queue}/jobs` - POST to submit an action to a queue with automatic worker
resolution. Accepts an optional `priority` query parameter.

`/compute/queues/{queue}/jobs/{worker}` - POST to submit an action to a queue targeting a
specific worker. Accepts an optional `priority` query parameter.

`/compute/queues/{queue}/jobs/{lsn}` - GET action result by LSN, scoped to the queue

#### Remote access (OID token routes)

These routes use cryptographically generated 24-character hex tokens (OIDs) instead of
integer queue IDs. Tokens are unguessable and safe to use over the network. The token
mapping lives entirely in the HTTP service layer; the underlying compute service only
knows about integer queue IDs.

`/compute/queues/remote` - POST to create a new queue with token-based access. Returns
`queue_token` (24-char hex string) and `queue_id` (integer, for internal visibility).

`/compute/queues/{oidtoken}` - GET queue status or DELETE to cancel, same semantics as
the integer ID variant but using the OID token for identification.

`/compute/queues/{oidtoken}/completed` - GET list of completed action LSNs for this queue.

`/compute/queues/{oidtoken}/jobs` - POST to submit an action to a queue with automatic
worker resolution.

`/compute/queues/{oidtoken}/jobs/{worker}` - POST to submit an action targeting a specific
worker.

`/compute/queues/{oidtoken}/jobs/{lsn}` - GET action result by LSN, scoped to the queue

## Relationship to Unreal Build Accelerator

Zen Compute is designed to complement Unreal Build Accelerator (UBA), not
replace it. The two systems target different workload characteristics:

- **Zen Compute** — suited to workloads where the inputs and function are fully
  known before execution begins. All data is declared up front in the action
  descriptor, and the worker runs as a self-contained transformation. This
  enables content-addressed caching of results and efficient scheduling.

- **UBA** — suited to workloads where inputs are discovered dynamically as the
  process runs. The remote process and its dependencies are resolved on the fly,
  with inputs and results exchanged via high-frequency RPCs throughout execution.

In practice, Zen Compute handles workloads like shader compilation where the
inputs are well-defined, while UBA handles more complex build processes with
dynamic dependency graphs.

## Execution Flow

```mermaid
sequenceDiagram
  participant C as Client
  participant G as Zen Server
  participant Q as Runner
  participant W as Worker

  C->>G: POST /jobs
  G-->>C: 202 Accepted (job_id)
  G->>Q: enqueue(action)
  Q-->>G: job_id

  C->>G: GET /jobs/job_id
  G-->>C: 202 Accepted (job_id)

  Q->>W: spawn()
  Q-->>W: action
  W->>W: process
  W->>Q: complete(job_id)

  C->>G: GET /jobs/job_id
  G->>Q: status(job_id)
  Q-->>G: done
  G-->>C: 200 OK (result)
```