Skip to main content
Background Image

Long-running Operations - Summary

·713 words·4 mins

Motivation
#

  • Immediate, synchronous API methods work for quick operations but fail for work that is long, resource-intensive, or involves external services.
  • Treating slow work by “just waiting longer” causes bad developer experience (hangs, debugging loops) and poor UX.
  • We need an API equivalent of language-level Promises/Futures so clients can start work and then track, wait for, or cancel it.

Overview
#

  • Introduce Long-Running Operations (LROs) as an API-level analogue to Promises/Futures.
  • LROs are returned as a new return type from API methods (instead of returning the final result directly).
  • LROs are remote, persistent resources (unlike in-process Promises) and must be treated as API resources with identifiers and storage.
  • LROs carry both the eventual result type and metadata about the operation (e.g., progress, timestamps).

Implementation (what you must provide)
#

  1. Operation resource type

    • Identifier (id).
    • Optional result (parameterized ResultT) or an OperationError (code, message, details).
    • Boolean done to indicate completion (not success/failure).
    • metadata field (parameterized MetadataT) for progress and operational information.
  2. API methods to manage LROs

    • Methods to create operations implicitly (return Operation<ResultT, MetadataT>).
    • GetOperation(id) — fetch current Operation resource.
    • WaitOperation(id) — block until operation resolves/rejects (server-side wait).
    • CancelOperation(id) — request cancellation and block until cancel completes.
    • Optional PauseOperation / ResumeOperation when supported.
    • ListOperations() with filtering (including metadata queries) to discover operations.

Resource hierarchy
#

  • Prefer a centralized top-level collection of Operation resources (e.g., /operations/{id}) rather than nesting under specific resources.

    • Nesting complicates discovery (you must know parent IDs) and prevents easy system-wide querying.

Resolution — how clients obtain final results
#

  • Two primary styles adapted from language Promises:

    1. Polling: client repeatedly calls GetOperation (possibly with backoff) until done is true.

      • Simple, client-driven; wastes network/compute if polled aggressively.
    2. Waiting (blocking via WaitOperation): client issues WaitOperation once; server holds the connection and returns when done.

      • Cleaner client code; server must manage long connections and resources.
  • Use a separate WaitOperation method (not a boolean flag on GetOperation) to keep semantics clear and to allow distinct monitoring/SLA treatment.

Error handling
#

  • Do not conflate transport/GET errors with operation result errors.

    • GetOperation / WaitOperation should only return HTTP errors if retrieving the Operation resource itself failed.
    • The operation’s failure is modeled inside the Operation as an OperationError (machine-readable code, human message, details structure).
  • Error codes must be unique, machine-readable, and relied upon by client code — avoid clients parsing text messages.

  • Use details (structured) for extra error data clients may need.

Monitoring progress
#

  • Use the metadata field (MetadataT) to expose progress: timestamps, processed counts, estimated remaining time, etc.
  • Percent complete is optional and sometimes misleading; prefer meaningful domain metrics (records processed, bytes processed, tasks completed).
  • Clients can poll GetOperation to read metadata and report progress.

Canceling, pausing, resuming
#

  • CancelOperation: should return the Operation (with done=true) only after cancellation is complete; attempt to remove intermediate artifacts when possible or include cleanup references in metadata.

  • PauseOperation / ResumeOperation:

    • Only supported when it makes sense and is implementable.
    • Do not repurpose done to indicate paused state; instead require metadata to include a paused: boolean field if pause/resume is supported.
    • Methods should block until the pause or resume action is fully applied.

Exploring operations
#

  • Because creators of LROs may crash or lose local state, provide ListOperations() to discover operations without knowing IDs ahead of time.
  • ListOperations() must support filtering and querying metadata (e.g., list all non-done operations, list all paused operations).

Persistence
#

  • LROs are implicitly created and quickly transition from useful (while running) to stale (after completion).

  • Persistence strategies:

    • Keep Operation resources forever (simplest; default unless storage cost is prohibitive).
    • Rolling window / expiration: purge based on expireTime (use completion time, not creation time). Common policy: delete if older than a fixed window (e.g., 30 days).
    • Avoid complex schemes (type-dependent TTLs, cascading deletes, archive-then-delete) as they cause confusion.
  • Expiration policy should not depend on the operation’s result type—use uniform expiration to avoid surprising disappearances.

Trade-offs and guidance
#

  • Letting requests block until complete is simplest for clients but poor for distributed systems and fragile to connection loss.
  • WaitOperation simplifies client code but shifts complexity to the server (connection management, SLAs).
  • LROs introduce new resource patterns (implicit creation, parameterized types, operational metadata) that are more complex than traditional resources but provide powerful, flexible behavior.
  • Keep a synchronous variant of API methods available if users rely on immediate results.