Long-running Operations - Summary

Motivation
#

Immediate, synchronous API methods work for quick operations but fail for work that is long, resource-intensive, or involves external services.
Treating slow work by “just waiting longer” causes bad developer experience (hangs, debugging loops) and poor UX.
We need an API equivalent of language-level Promises/Futures so clients can start work and then track, wait for, or cancel it.

Introduce Long-Running Operations (LROs) as an API-level analogue to Promises/Futures.
LROs are returned as a new return type from API methods (instead of returning the final result directly).
LROs are remote, persistent resources (unlike in-process Promises) and must be treated as API resources with identifiers and storage.
LROs carry both the eventual result type and metadata about the operation (e.g., progress, timestamps).

Operation resource type
- Identifier (id).
- Optional result (parameterized ResultT) or an OperationError (code, message, details).
- Boolean done to indicate completion (not success/failure).
- metadata field (parameterized MetadataT) for progress and operational information.
API methods to manage LROs
- Methods to create operations implicitly (return Operation<ResultT, MetadataT>).
- GetOperation(id) — fetch current Operation resource.
- WaitOperation(id) — block until operation resolves/rejects (server-side wait).
- CancelOperation(id) — request cancellation and block until cancel completes.
- Optional PauseOperation / ResumeOperation when supported.
- ListOperations() with filtering (including metadata queries) to discover operations.

Prefer a centralized top-level collection of Operation resources (e.g., /operations/{id}) rather than nesting under specific resources.
- Nesting complicates discovery (you must know parent IDs) and prevents easy system-wide querying.

Two primary styles adapted from language Promises:
1. Polling: client repeatedly calls GetOperation (possibly with backoff) until done is true.
  - Simple, client-driven; wastes network/compute if polled aggressively.
2. Waiting (blocking via WaitOperation): client issues WaitOperation once; server holds the connection and returns when done.
  - Cleaner client code; server must manage long connections and resources.
Use a separate WaitOperation method (not a boolean flag on GetOperation) to keep semantics clear and to allow distinct monitoring/SLA treatment.

Do not conflate transport/GET errors with operation result errors.
- GetOperation / WaitOperation should only return HTTP errors if retrieving the Operation resource itself failed.
- The operation’s failure is modeled inside the Operation as an OperationError (machine-readable code, human message, details structure).
Error codes must be unique, machine-readable, and relied upon by client code — avoid clients parsing text messages.
Use details (structured) for extra error data clients may need.

Use the metadata field (MetadataT) to expose progress: timestamps, processed counts, estimated remaining time, etc.
Percent complete is optional and sometimes misleading; prefer meaningful domain metrics (records processed, bytes processed, tasks completed).
Clients can poll GetOperation to read metadata and report progress.

CancelOperation: should return the Operation (with done=true) only after cancellation is complete; attempt to remove intermediate artifacts when possible or include cleanup references in metadata.
PauseOperation / ResumeOperation:
- Only supported when it makes sense and is implementable.
- Do not repurpose done to indicate paused state; instead require metadata to include a paused: boolean field if pause/resume is supported.
- Methods should block until the pause or resume action is fully applied.

Because creators of LROs may crash or lose local state, provide ListOperations() to discover operations without knowing IDs ahead of time.
ListOperations() must support filtering and querying metadata (e.g., list all non-done operations, list all paused operations).

LROs are implicitly created and quickly transition from useful (while running) to stale (after completion).
Persistence strategies:
- Keep Operation resources forever (simplest; default unless storage cost is prohibitive).
- Rolling window / expiration: purge based on expireTime (use completion time, not creation time). Common policy: delete if older than a fixed window (e.g., 30 days).
- Avoid complex schemes (type-dependent TTLs, cascading deletes, archive-then-delete) as they cause confusion.
Expiration policy should not depend on the operation’s result type—use uniform expiration to avoid surprising disappearances.

Letting requests block until complete is simplest for clients but poor for distributed systems and fragile to connection loss.
WaitOperation simplifies client code but shifts complexity to the server (connection management, SLAs).
LROs introduce new resource patterns (implicit creation, parameterized types, operational metadata) that are more complex than traditional resources but provide powerful, flexible behavior.
Keep a synchronous variant of API methods available if users rely on immediate results.