Motivation#
- Immediate, synchronous API methods work for quick operations but fail for work that is long, resource-intensive, or involves external services.
- Treating slow work by “just waiting longer” causes bad developer experience (hangs, debugging loops) and poor UX.
- We need an API equivalent of language-level Promises/Futures so clients can start work and then track, wait for, or cancel it.
Overview#
- Introduce Long-Running Operations (LROs) as an API-level analogue to Promises/Futures.
- LROs are returned as a new return type from API methods (instead of returning the final result directly).
- LROs are remote, persistent resources (unlike in-process Promises) and must be treated as API resources with identifiers and storage.
- LROs carry both the eventual result type and metadata about the operation (e.g., progress, timestamps).
Implementation (what you must provide)#
Operation resource type
- Identifier (
id
). - Optional
result
(parameterizedResultT
) or anOperationError
(code, message, details). - Boolean
done
to indicate completion (not success/failure). metadata
field (parameterizedMetadataT
) for progress and operational information.
- Identifier (
API methods to manage LROs
- Methods to create operations implicitly (return
Operation<ResultT, MetadataT>
). GetOperation(id)
— fetch current Operation resource.WaitOperation(id)
— block until operation resolves/rejects (server-side wait).CancelOperation(id)
— request cancellation and block until cancel completes.- Optional
PauseOperation
/ResumeOperation
when supported. ListOperations()
with filtering (including metadata queries) to discover operations.
- Methods to create operations implicitly (return
Resource hierarchy#
Prefer a centralized top-level collection of Operation resources (e.g.,
/operations/{id}
) rather than nesting under specific resources.- Nesting complicates discovery (you must know parent IDs) and prevents easy system-wide querying.
Resolution — how clients obtain final results#
Two primary styles adapted from language Promises:
Polling: client repeatedly calls
GetOperation
(possibly with backoff) untildone
is true.- Simple, client-driven; wastes network/compute if polled aggressively.
Waiting (blocking via WaitOperation): client issues
WaitOperation
once; server holds the connection and returns when done.- Cleaner client code; server must manage long connections and resources.
Use a separate
WaitOperation
method (not a boolean flag onGetOperation
) to keep semantics clear and to allow distinct monitoring/SLA treatment.
Error handling#
Do not conflate transport/GET errors with operation result errors.
GetOperation
/WaitOperation
should only return HTTP errors if retrieving the Operation resource itself failed.- The operation’s failure is modeled inside the Operation as an
OperationError
(machine-readable code, human message,details
structure).
Error codes must be unique, machine-readable, and relied upon by client code — avoid clients parsing text messages.
Use
details
(structured) for extra error data clients may need.
Monitoring progress#
- Use the
metadata
field (MetadataT) to expose progress: timestamps, processed counts, estimated remaining time, etc. - Percent complete is optional and sometimes misleading; prefer meaningful domain metrics (records processed, bytes processed, tasks completed).
- Clients can poll
GetOperation
to readmetadata
and report progress.
Canceling, pausing, resuming#
CancelOperation: should return the Operation (with
done=true
) only after cancellation is complete; attempt to remove intermediate artifacts when possible or include cleanup references inmetadata
.PauseOperation / ResumeOperation:
- Only supported when it makes sense and is implementable.
- Do not repurpose
done
to indicate paused state; instead requiremetadata
to include apaused: boolean
field if pause/resume is supported. - Methods should block until the pause or resume action is fully applied.
Exploring operations#
- Because creators of LROs may crash or lose local state, provide
ListOperations()
to discover operations without knowing IDs ahead of time. ListOperations()
must support filtering and queryingmetadata
(e.g., list all non-done operations, list all paused operations).
Persistence#
LROs are implicitly created and quickly transition from useful (while running) to stale (after completion).
Persistence strategies:
- Keep Operation resources forever (simplest; default unless storage cost is prohibitive).
- Rolling window / expiration: purge based on
expireTime
(use completion time, not creation time). Common policy: delete if older than a fixed window (e.g., 30 days). - Avoid complex schemes (type-dependent TTLs, cascading deletes, archive-then-delete) as they cause confusion.
Expiration policy should not depend on the operation’s result type—use uniform expiration to avoid surprising disappearances.
Trade-offs and guidance#
- Letting requests block until complete is simplest for clients but poor for distributed systems and fragile to connection loss.
- WaitOperation simplifies client code but shifts complexity to the server (connection management, SLAs).
- LROs introduce new resource patterns (implicit creation, parameterized types, operational metadata) that are more complex than traditional resources but provide powerful, flexible behavior.
- Keep a synchronous variant of API methods available if users rely on immediate results.