Terraform State Is a Graph, Not a File

Terraform’s file-based state model imposes a global lock and whole-file reads for small, localized changes, causing contention and slow refreshes at scale. Stategraph treats state as a graph and applies established distributed systems techniques—MVCC, subgraph isolation, and ordered locking—to enable parallelism and scope refreshes to affected subgraphs. Implemented on PostgreSQL and compatible with Terraform’s remote backend, it eliminates lock bottlenecks and makes state queryable and auditable.

Key Points

Terraform’s file-based state with a global lock creates systemic contention and scalability issues because operations typically touch only a small subgraph of resources.
Splitting state files is not a real fix; it multiplies coordination problems and introduces distributed transaction complexity across states.
Representing state as a graph enables subgraph isolation, precise row/edge-level locking, and MVCC, allowing safe parallelism and non-blocking reads.
Graph-aware refresh limits work to the affected change cone instead of traversing the entire state, yielding significant performance gains.
Stategraph implements this model using PostgreSQL (resources, dependencies, transactions) while remaining protocol-compatible with Terraform/OpenTofu and requiring no config changes.

Sentiment

Cautiously positive: many find the graph-based, fine-grained locking approach compelling for large, shared infrastructures, while others prefer the simplicity of current state workflows and question the need outside big-scale scenarios.

In Agreement

Global state locks and full-state refreshes become serious bottlenecks as repositories and teams grow; locking only the affected resources is the right fix.
Splitting state files reduces contention but fractures the dependency graph and adds cross-state orchestration complexity; a single cohesive graph view is preferable.
A graph-aware, transactional backend that permits concurrent operations on disjoint subgraphs matches Terraform’s real access patterns.
Maintaining inspectability (via CLI/UI) while improving concurrency is valuable; state should remain queryable and auditable.
A drop-in backend that imports existing tfstate and doesn’t require refactoring code addresses a common pain point for large orgs.

Opposed

The simplicity and transparency of a single JSON state file is a strength; it’s easy to understand, troubleshoot, and manually repair drift—don’t turn it into a black box.
Many teams avoid contention by splitting state per microservice/module (Terragrunt, Atmos) and haven’t experienced scaling problems.
Existing remote backends with coarse locks (e.g., DynamoDB) or splitting states are ‘good enough’ for most use cases.
Concerns about toolchain compatibility (e.g., with Spacelift/Env0) and whether this introduces a new runner/agent surface area.
Why not use simpler storage like SQLite in S3 with S3 locking, rather than a more complex Postgres-backed system?