docs: restructure LLM documentation for better PR quality (#961)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 05:06:30 +00:00 · 2026-01-06 13:34:01 -06:00
parent fc1c254922
commit b946989553
7 changed files with 766 additions and 1262 deletions
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -1,942 +1,63 @@
-# AGENT.md - Litestream AI Agent Documentation
+# AGENTS.md - Litestream AI Agent Guide

-This document provides comprehensive guidance for AI agents working with the Litestream codebase. Read this document carefully before making any modifications.
+Litestream is a disaster recovery tool for SQLite that runs as a background process, monitors the WAL, converts changes to immutable LTX files, and replicates them to cloud storage. It uses `modernc.org/sqlite` (pure Go, no CGO required).

-## Table of Contents
+## Before You Start

- [Overview](#overview)
- [Fundamental Concepts](#fundamental-concepts)
- [Core Architecture](#core-architecture)
- [Critical Concepts](#critical-concepts)
- [Architectural Boundaries and Patterns](#architectural-boundaries-and-patterns)
- [Common Pitfalls](#common-pitfalls)
- [Component Guide](#component-guide)
- [Performance Considerations](#performance-considerations)
- [Testing Requirements](#testing-requirements)
+1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) for contribution requirements
+2. Check [CONTRIBUTING.md](CONTRIBUTING.md) for what we accept (bug fixes welcome, features need discussion)
+3. Review recent PRs for current patterns

-## Overview
+## Critical Rules

-Litestream is a **disaster recovery tool for SQLite** that runs as a background process and safely replicates changes incrementally to various storage backends. It monitors SQLite's Write-Ahead Log (WAL), converts changes to an immutable LTX format, and replicates these to configured destinations.
+- **Lock page at 1GB**: SQLite reserves page at 0x40000000. Always skip it. See [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md)
+- **LTX files are immutable**: Never modify after creation. See [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md)
+- **Single replica per database**: Each DB replicates to exactly one destination
+- **Use `litestream ltx`**: Not `litestream wal` (deprecated)

-**Current Architecture Highlights:**
- **LTX Format**: Page-level replication format replaces direct WAL mirroring
- **Multi-level Compaction**: Hierarchical compaction keeps storage efficient (30s → 5m → 1h → snapshots)
- **Single Replica Constraint**: Each database is replicated to exactly one remote destination
- **Pure Go Build**: Uses `modernc.org/sqlite`, so no CGO dependency for the main binary
- **Optional NATS JetStream Support**: Additional replica backend alongside S3/GCS/ABS/OSS/File/SFTP
- **Snapshot Compatibility**: Only LTX-based backups are supported—keep legacy v0.3.x binaries to restore old WAL snapshots
+## Layer Boundaries

-**Key Design Principles:**
- **Non-invasive**: Uses only SQLite API, never directly manipulates database files
- **Incremental**: Replicates only changes, not full databases
- **Single-destination**: Exactly one replica destination per database
- **Eventually Consistent**: Handles storage backends with eventual consistency
- **Safe**: Maintains long-running read transactions for consistency
+| Layer | File | Responsibility |
+|-------|------|----------------|
+| DB | `db.go` | Database state, restoration, WAL monitoring |
+| Replica | `replica.go` | Replication mechanics only |
+| Storage | `**/replica_client.go` | Backend implementations |

-## Fundamental Concepts
-
-**CRITICAL**: Understanding SQLite internals and the LTX format is essential for working with Litestream.
-
-### Required Reading
-
-1. **[SQLite Internals](docs/SQLITE_INTERNALS.md)** - Understand WAL, pages, transactions, and the 1GB lock page
-2. **[LTX Format](docs/LTX_FORMAT.md)** - Learn the custom replication format Litestream uses
-
-### Key SQLite Concepts
-
- **WAL (Write-Ahead Log)**: Temporary file containing uncommitted changes
- **Pages**: Fixed-size blocks (typically 4KB) that make up the database
- **Lock Page at 1GB**: Special page at 0x40000000 that MUST be skipped
- **Checkpoints**: Process of merging WAL back into main database
- **Transaction Isolation**: Long-running read transaction for consistency
-
-### Key LTX Concepts
-
- **Immutable Files**: Once written, LTX files are never modified
- **TXID Ranges**: Each file covers a range of transaction IDs
- **Page Index**: Binary search tree for efficient page lookup
- **Compaction Levels**: Time-based merging to reduce storage (30s → 5min → 1hr)
- **Checksums**: CRC-64 integrity verification at multiple levels
- **CLI Command**: Use `litestream ltx` (not `wal`) for LTX operations
-
-### The Replication Flow
-
-```mermaid
-graph LR
-    App[Application] -->|SQL| SQLite
-    SQLite -->|Writes| WAL[WAL File]
-    WAL -->|Monitor| Litestream
-    Litestream -->|Convert| LTX[LTX Format]
-    LTX -->|Upload| Storage[Cloud Storage]
-    Storage -->|Restore| Database[New Database]
-```
-
-## Core Architecture
-
-```mermaid
-graph TB
-    subgraph "SQLite Layer"
-        SQLite[SQLite Database]
-        WAL[WAL File]
-        SQLite -->|Writes| WAL
-    end
-
-    subgraph "Litestream Core"
-        DB[DB Component<br/>db.go]
-        Replica[Replica Manager<br/>replica.go]
-        Store[Store<br/>store.go]
-
-        DB -->|Manages| Replica
-        Store -->|Coordinates| DB
-    end
-
-    subgraph "Storage Layer"
-        RC[ReplicaClient Interface<br/>replica_client.go]
-        S3[S3 Client]
-        GCS[GCS Client]
-        File[File Client]
-        SFTP[SFTP Client]
-
-        Replica -->|Uses| RC
-        RC -->|Implements| S3
-        RC -->|Implements| GCS
-        RC -->|Implements| File
-        RC -->|Implements| SFTP
-    end
-
-    WAL -->|Monitor Changes| DB
-    DB -->|Checkpoint| SQLite
-```
-
-### Data Flow Sequence
-
-```mermaid
-sequenceDiagram
-    participant App
-    participant SQLite
-    participant WAL
-    participant DB
-    participant Replica
-    participant Storage
-
-    App->>SQLite: Write Transaction
-    SQLite->>WAL: Append Changes
-
-    loop Monitor (1s interval)
-        DB->>WAL: Check Size/Changes
-        WAL-->>DB: Current State
-
-        alt WAL Has Changes
-            DB->>WAL: Read Pages
-            DB->>DB: Convert to LTX Format
-            DB->>Replica: Queue LTX File
-
-            loop Sync (configurable interval)
-                Replica->>Storage: WriteLTXFile()
-                Storage-->>Replica: FileInfo
-                Replica->>Replica: Update Position
-            end
-        end
-    end
-
-    alt Checkpoint Needed
-        DB->>SQLite: PRAGMA wal_checkpoint
-        SQLite->>WAL: Merge to Main DB
-    end
-```
-
-## Critical Concepts
-
-### 1. SQLite Lock Page at 1GB Boundary ⚠️
-
-**CRITICAL**: SQLite reserves a special lock page at exactly 1GB (0x40000000 bytes).
-
-```go
-// db.go:951-953 - Must skip lock page during replication
-lockPgno := ltx.LockPgno(pageSize)  // Page number varies by page size
-if pgno == lockPgno {
-    continue // Skip this page - it's reserved by SQLite
-}
-```
-
-**Lock Page Numbers by Page Size:**
- 4KB pages: 262145 (most common)
- 8KB pages: 131073
- 16KB pages: 65537
- 32KB pages: 32769
-
-**Testing Requirement**: Any changes affecting page iteration MUST be tested with >1GB databases.
-
-### 2. LTX File Format
-
-LTX (Log Transaction) files are **immutable**, append-only files containing:
- Header with transaction IDs (MinTXID, MaxTXID)
- Page data with checksums
- Page index for efficient seeking
- Trailer with metadata
-
-**Important**: LTX files are NOT SQLite WAL files - they're a custom format for efficient replication.
-
-### 3. Compaction Process
-
-Compaction merges multiple LTX files to reduce storage overhead:
-
-```mermaid
-flowchart LR
-    subgraph "Level 0 (Raw)"
-        L0A[0000000001-0000000100.ltx]
-        L0B[0000000101-0000000200.ltx]
-        L0C[0000000201-0000000300.ltx]
-    end
-
-    subgraph "Level 1 (30 seconds)"
-        L1[0000000001-0000000300.ltx]
-    end
-
-    subgraph "Level 2 (5 minutes)"
-        L2[0000000001-0000001000.ltx]
-    end
-
-    subgraph "Level 3 (1 hour)"
-        L3[0000000001-0000002000.ltx]
-    end
-
-    subgraph "Snapshot (24h)"
-        Snap[snapshot.ltx]
-    end
-
-    L0A -->|Merge| L1
-    L0B -->|Merge| L1
-    L0C -->|Merge| L1
-    L1 -->|30s window| L2
-    L2 -->|5min window| L3
-    L3 -->|Hourly| Snap
-```
-
-**Critical Compaction Rule**: When compacting with eventually consistent storage:
-```go
-// db.go:1280-1294 - ALWAYS read from local disk when available
-f, err := os.Open(db.LTXPath(info.Level, info.MinTXID, info.MaxTXID))
-if err == nil {
-    // Use local file - it's complete and consistent
-    return f, nil
-}
-// Only fall back to remote if local doesn't exist
-return replica.Client.OpenLTXFile(...)
-```
-
-### 4. Eventual Consistency Handling
-
-Many storage backends (S3, R2, etc.) are eventually consistent. This means:
- A file you just wrote might not be immediately readable
- A file might be listed but only partially available
- Reads might return stale or incomplete data
-
-**Solution**: Always prefer local files during compaction.
-
-## Architectural Boundaries and Patterns
-
-**CRITICAL**: Understanding proper architectural boundaries is essential for successful contributions.
-
-### Layer Responsibilities
-
-```mermaid
-graph TB
-    subgraph "DB Layer (db.go)"
-        DBInit[DB.init&#40;&#41;]
-        DBPos[DB position tracking]
-        DBRestore[Database state validation]
-        DBSnapshot[Snapshot triggering via verify&#40;&#41;]
-    end
-
-    subgraph "Replica Layer (replica.go)"
-        ReplicaStart[Replica.Start&#40;&#41;]
-        ReplicaSync[Sync operations]
-        ReplicaPos[Replica position tracking]
-        ReplicaClient[Storage interaction]
-    end
-
-    subgraph "Storage Layer"
-        S3[S3/GCS/Azure]
-        LTXFiles[LTX Files]
-    end
-
-    DBInit -->|Initialize| ReplicaStart
-    DBInit -->|Check positions| DBPos
-    DBInit -->|Validate state| DBRestore
-    ReplicaStart -->|Focus on replication only| ReplicaSync
-    ReplicaSync -->|Upload/Download| ReplicaClient
-    ReplicaClient -->|Read/Write| S3
-    S3 -->|Store| LTXFiles
-```
-
-### ✅ DO: Handle database state in DB layer
-
-**Principle**: Database restoration logic belongs in the DB layer, not the Replica layer.
-
-**Pattern**: When the database is behind the replica (local TXID < remote TXID):
-
-1. **Clear local L0 cache**: Remove the entire L0 directory and recreate it
-   - Use `os.RemoveAll()` on the L0 directory path
-   - Recreate with proper permissions using `internal.MkdirAll()`
-
-2. **Fetch latest L0 file from replica**: Download the most recent L0 LTX file
-   - Call `replica.Client.OpenLTXFile()` with the remote min/max TXID
-   - Stream the file contents (don't load into memory)
-
-3. **Write using atomic file operations**: Prevent partial/corrupted files
-   - Write to temporary file with `.tmp` suffix
-   - Call `Sync()` to ensure data is on disk
-   - Atomically rename temp file to final path
-
-**Why this matters**: If the database state is not synchronized before replication starts, the system will attempt to apply WAL segments that are ahead of the database's current position, leading to restore failures.
-
-**Reference Implementation**: See `DB.checkDatabaseBehindReplica()` in db.go:670-737
-
-### ❌ DON'T: Put database state logic in Replica layer
-
-```go
-// WRONG - Replica should only handle replication concerns
-func (r *Replica) Start() error {
-    // DON'T check database state here
-    if needsRestore() {  // ❌ Wrong layer!
-        restoreDatabase()  // ❌ Wrong layer!
-    }
-    // Replica should focus only on replication mechanics
-}
-```
-
-### Atomic File Operations Pattern
-
-**CRITICAL**: Always use atomic writes to prevent partial/corrupted files.
-
-### ✅ DO: Write to temp file, then rename
-
-```go
-// CORRECT - Atomic file write pattern
-func writeFileAtomic(path string, data []byte) error {
-    // Create temp file in same directory (for atomic rename)
-    dir := filepath.Dir(path)
-    tmpFile, err := os.CreateTemp(dir, ".tmp-*")
-    if err != nil {
-        return fmt.Errorf("create temp file: %w", err)
-    }
-    tmpPath := tmpFile.Name()
-
-    // Clean up temp file on error
-    defer func() {
-        if tmpFile != nil {
-            tmpFile.Close()
-            os.Remove(tmpPath)
-        }
-    }()
-
-    // Write data to temp file
-    if _, err := tmpFile.Write(data); err != nil {
-        return fmt.Errorf("write temp file: %w", err)
-    }
-
-    // Sync to ensure data is on disk
-    if err := tmpFile.Sync(); err != nil {
-        return fmt.Errorf("sync temp file: %w", err)
-    }
-
-    // Close before rename
-    if err := tmpFile.Close(); err != nil {
-        return fmt.Errorf("close temp file: %w", err)
-    }
-    tmpFile = nil // Prevent defer cleanup
-
-    // Atomic rename (on same filesystem)
-    if err := os.Rename(tmpPath, path); err != nil {
-        os.Remove(tmpPath)
-        return fmt.Errorf("rename to final path: %w", err)
-    }
-
-    return nil
-}
-```
-
-### ❌ DON'T: Write directly to final location
-
-```go
-// WRONG - Can leave partial files on failure
-func writeFileDirect(path string, data []byte) error {
-    return os.WriteFile(path, data, 0644)  // ❌ Not atomic!
-}
-```
-
-### Error Handling Patterns
-
-### ✅ DO: Return errors immediately
-
-```go
-// CORRECT - Return error for caller to handle
-func (db *DB) validatePosition() error {
-    dpos, err := db.Pos()
-    if err != nil {
-        return err
-    }
-    rpos := replica.Pos()
-    if dpos.TXID < rpos.TXID {
-        return fmt.Errorf("database position (%v) behind replica (%v)", dpos, rpos)
-    }
-    return nil
-}
-```
-
-### ❌ DON'T: Continue on critical errors
-
-```go
-// WRONG - Silently continuing can cause data corruption
-func (db *DB) validatePosition() {
-    if dpos, _ := db.Pos(); dpos.TXID < replica.Pos().TXID {
-        log.Printf("warning: position mismatch")  // ❌ Don't just log!
-        // Continuing here is dangerous
-    }
-}
-```
-
-### Leveraging Existing Mechanisms
-
-### ✅ DO: Use verify() for snapshot triggering
-
-```go
-// CORRECT - Leverage existing snapshot mechanism
-func (db *DB) ensureSnapshot() error {
-    // Use existing verify() which already handles snapshot logic
-    if err := db.verify(); err != nil {
-        return fmt.Errorf("verify for snapshot: %w", err)
-    }
-    // verify() will trigger snapshot if needed
-    return nil
-}
-```
-
-### ❌ DON'T: Reimplement existing functionality
-
-```go
-// WRONG - Don't recreate what already exists
-func (db *DB) customSnapshot() error {
-    // ❌ Don't write custom snapshot logic
-    // when verify() already does this correctly
-}
-```
-
-## Common Pitfalls
-
-### ❌ DON'T: Mix architectural concerns
-
-```go
-// WRONG - Database state logic in Replica layer
-func (r *Replica) Start() error {
-    if db.needsRestore() {  // ❌ Wrong layer for DB state!
-        r.restoreDatabase()  // ❌ Replica shouldn't manage DB state!
-    }
-    return r.sync()
-}
-```
-
-### ✅ DO: Keep concerns in proper layers
-
-```go
-// CORRECT - Each layer handles its own concerns
-func (db *DB) init() error {
-    // DB layer handles database state
-    if db.needsRestore() {
-        if err := db.restore(); err != nil {
-            return err
-        }
-    }
-    // Then start replica for replication only
-    return db.replica.Start()
-}
-
-func (r *Replica) Start() error {
-    // Replica focuses only on replication
-    return r.startSync()
-}
-```
-
-### ❌ DON'T: Read from remote during compaction
-
-```go
-// WRONG - Can get partial/corrupt data
-f, err := client.OpenLTXFile(ctx, level, minTXID, maxTXID, 0, 0)
-```
-
-### ✅ DO: Read from local when available
-
-```go
-// CORRECT - Check local first
-if f, err := os.Open(localPath); err == nil {
-    defer f.Close()
-    // Use local file
-} else {
-    // Fall back to remote only if necessary
-}
-```
-
-### ❌ DON'T: Use RLock for write operations
-
-```go
-// WRONG - Race condition in replica.go:217
-r.mu.RLock()  // Should be Lock() for writes
-defer r.mu.RUnlock()
-r.pos = pos   // Writing with RLock!
-```
-
-### ✅ DO: Use proper lock types
-
-```go
-// CORRECT
-r.mu.Lock()
-defer r.mu.Unlock()
-r.pos = pos
-```
-
-### ❌ DON'T: Ignore CreatedAt preservation
-
-```go
-// WRONG - Loses timestamp granularity
-info := &ltx.FileInfo{
-    CreatedAt: time.Now(), // Don't use current time
-}
-```
-
-### ✅ DO: Preserve earliest timestamp
-
-```go
-// CORRECT - Preserve temporal information
-info, err := replica.Client.WriteLTXFile(ctx, level, minTXID, maxTXID, r)
-if err != nil {
-    return fmt.Errorf("write ltx: %w", err)
-}
-info.CreatedAt = oldestSourceFile.CreatedAt
-```
-
-### ❌ DON'T: Write files without atomic operations
-
-```go
-// WRONG - Can leave partial files on failure
-func saveLTXFile(path string, data []byte) error {
-    return os.WriteFile(path, data, 0644)  // ❌ Not atomic!
-}
-```
-
-### ✅ DO: Use atomic write pattern
-
-```go
-// CORRECT - Write to temp, then rename
-func saveLTXFileAtomic(path string, data []byte) error {
-    tmpPath := path + ".tmp"
-    if err := os.WriteFile(tmpPath, data, 0644); err != nil {
-        return err
-    }
-    return os.Rename(tmpPath, path)  // Atomic on same filesystem
-}
-```
-
-### ❌ DON'T: Ignore errors and continue
-
-```go
-// WRONG - Continuing after error can corrupt state
-func (db *DB) processFiles() {
-    for _, file := range files {
-        if err := processFile(file); err != nil {
-            log.Printf("error: %v", err)  // ❌ Just logging!
-            // Continuing to next file is dangerous
-        }
-    }
-}
-```
-
-### ✅ DO: Return errors for proper handling
-
-```go
-// CORRECT - Let caller decide how to handle errors
-func (db *DB) processFiles() error {
-    for _, file := range files {
-        if err := processFile(file); err != nil {
-            return fmt.Errorf("process file %s: %w", file, err)
-        }
-    }
-    return nil
-}
-```
-
-### ❌ DON'T: Recreate existing functionality
-
-```go
-// WRONG - Don't reimplement what already exists
-func customSnapshotTrigger() {
-    // Complex custom logic to trigger snapshots
-    // when db.verify() already does this!
-}
-```
-
-### ✅ DO: Leverage existing mechanisms
-
-```go
-// CORRECT - Use what's already there
-func triggerSnapshot() error {
-    return db.verify()  // Already handles snapshot logic correctly
-}
-```
-
-## Component Guide
-
-### DB Component (db.go)
-
-**Responsibilities:**
- Manages SQLite database connection (via `modernc.org/sqlite` - no CGO)
- Monitors WAL for changes
- Performs checkpoints
- Maintains long-running read transaction
- Converts WAL pages to LTX format
-
-**Key Fields:**
-```go
-type DB struct {
-    path     string      // Database file path
-    db       *sql.DB     // SQLite connection
-    rtx      *sql.Tx     // Long-running read transaction
-    pageSize int         // Database page size (critical for lock page)
-    notify   chan struct{} // Notifies on WAL changes
-}
-```
-
-**Initialization Sequence:**
-1. Open database connection
-2. Read page size from database
-3. Initialize long-running read transaction
-4. Start monitor goroutine
-5. Initialize replicas
-
-### Replica Component (replica.go)
-
-**Responsibilities:**
- Manages replication to a single destination (one replica per DB)
- Tracks replication position (ltx.Pos)
- Handles sync intervals
- Manages encryption (if configured)
-
-**Key Operations:**
- `Sync()`: Synchronizes pending changes
- `SetPos()`: Updates replication position (must use Lock, not RLock!)
- `Snapshot()`: Creates full database snapshot
-
-### ReplicaClient Interface (replica_client.go)
-
-**Required Methods:**
-```go
-type ReplicaClient interface {
-    Type() string  // Client type identifier
-
-    // File operations
-    LTXFiles(ctx context.Context, level int, seek ltx.TXID, useMetadata bool) (ltx.FileIterator, error)
-    OpenLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, offset, size int64) (io.ReadCloser, error)
-    WriteLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, r io.Reader) (*ltx.FileInfo, error)
-    DeleteLTXFiles(ctx context.Context, files []*ltx.FileInfo) error
-    DeleteAll(ctx context.Context) error
-}
-```
-
-**LTXFiles useMetadata Parameter:**
- **`useMetadata=true`**: Fetch accurate timestamps from backend metadata (required for point-in-time restores)
-  - Slower but provides correct CreatedAt timestamps
-  - Use when restoring to specific timestamp
- **`useMetadata=false`**: Use fast timestamps (LastModified/ModTime) for normal operations
-  - Faster enumeration, suitable for synchronization
-  - Use during replication monitoring
-
-**Implementation Requirements:**
- Handle partial reads gracefully
- Implement proper error types (os.ErrNotExist)
- Support seek/offset for efficient page fetching
- Preserve file timestamps when `useMetadata=true`
-
-### Store Component (store.go)
-
-**Responsibilities:**
- Coordinates multiple databases
- Manages compaction schedules
- Controls resource usage
- Handles retention policies
-
-**Default Compaction Levels:**
-```go
-var defaultLevels = CompactionLevels{
-    {Level: 0, Interval: 0},        // Raw LTX files (no compaction)
-    {Level: 1, Interval: 30*Second},
-    {Level: 2, Interval: 5*Minute},
-    {Level: 3, Interval: 1*Hour},
-    // Snapshots created daily (24h retention)
-}
-```
-
-## Performance Considerations
-
-### O(n) Operations to Watch
-
-1. **Page Iteration**: Linear scan through all pages
-   - Cache page index when possible
-   - Use binary search on sorted page lists
-
-2. **File Listing**: Directory scans can be expensive
-   - Cache file listings when unchanged
-   - Use seek parameter to skip old files
-
-3. **Compaction**: Reads all input files
-   - Limit concurrent compactions
-   - Use appropriate level intervals
-
-### Caching Strategy
-
-```go
-// Page index caching example
-const DefaultEstimatedPageIndexSize = 32 * 1024 // 32KB
-
-// Fetch end of file first for page index
-offset := info.Size - DefaultEstimatedPageIndexSize
-if offset < 0 {
-    offset = 0
-}
-// Read page index once, cache for duration of operation
-```
-
-### Batch Operations
-
- Group small writes into larger LTX files
- Batch delete operations for old files
- Use prepared statements for repeated queries
-
-## Testing Requirements
-
-### For Any DB Changes
-
-```bash
-# Test with various page sizes
-./bin/litestream-test populate -db test.db -page-size 4096 -target-size 2GB
-./bin/litestream-test populate -db test.db -page-size 8192 -target-size 2GB
-
-# Test lock page handling
-./bin/litestream-test validate -source-db test.db -replica-url file:///tmp/replica
-```
-
-### For Replica Client Changes
-
-```bash
-# Test eventual consistency
-go test -v ./replica_client_test.go -integration [s3|gcs|abs|oss|sftp]
-
-# Test partial reads
-# (Example) add targeted partial-read tests in your backend package
-go test -v -run TestReplicaClient_PartialRead ./...
-```
-
-### For Compaction Changes
-
-```bash
-# Test with store compaction
-go test -v -run TestStore_CompactDB ./...
-
-# Test with eventual consistency mock
-go test -v -run TestStore_CompactDB_RemotePartialRead ./...
-```
-
-### Race Condition Testing
-
-```bash
-# Always run with race detector
-go test -race -v ./...
-
-# Specific race-prone areas
-go test -race -v -run TestReplica_Sync ./...
-go test -race -v -run TestDB_Sync ./...
-go test -race -v -run TestStore_CompactDB ./...
-```
+Database state logic belongs in DB layer, not Replica layer.

 ## Quick Reference

-### File Paths
+**Build:**

- **Database**: `/path/to/database.db`
- **Metadata**: `/path/to/database.db-litestream/`
- **LTX Files**: `/path/to/database.db-litestream/ltx/LEVEL/MIN-MAX.ltx`
- **Snapshots**: `/path/to/database.db-litestream/snapshots/TIMESTAMP.ltx`
-
-### Key Configuration
-
-```yaml
-l0-retention: 5m                 # Minimum time to keep compacted L0 files
-l0-retention-check-interval: 15s # Frequency for enforcing L0 retention
-dbs:
-  - path: /path/to/db.sqlite
-    replica:
-      type: s3
-      bucket: my-bucket
-      path: db-backup
-      sync-interval: 10s  # How often to sync
-
-# Compaction configuration (default)
-levels:
-  - level: 1
-    interval: 30s    # 30-second windows
-  - level: 2
-    interval: 5m     # 5-minute windows
-  - level: 3
-    interval: 1h     # 1-hour windows
+```bash
+go build -o bin/litestream ./cmd/litestream
+go test -race -v ./...
 ```

-### Important Constants
+**Code quality:**

-```go
-DefaultMonitorInterval    = 1 * time.Second   // WAL check frequency
-DefaultCheckpointInterval = 1 * time.Minute   // Time-based passive checkpoint frequency
-DefaultMinCheckpointPageN = 1000              // Min pages before passive checkpoint
-DefaultTruncatePageN      = 121359            // ~500MB truncate threshold (REMOVED: DefaultMaxCheckpointPageN - RESTART mode permanently removed due to #724)
+```bash
+pre-commit run --all-files
 ```

-## Getting Help
+## Documentation

-For complex architectural questions, consult:
-1. **`docs/SQLITE_INTERNALS.md`** - SQLite fundamentals, WAL format, lock page details
-2. **`docs/LTX_FORMAT.md`** - LTX file format specification and operations
-3. `docs/ARCHITECTURE.md` - Deep technical details of Litestream components
-4. `docs/REPLICA_CLIENT_GUIDE.md` - Storage backend implementation guide
-5. `docs/TESTING_GUIDE.md` - Comprehensive testing strategies
-6. Review recent PRs for current patterns and best practices
+| Document | When to Read |
+|----------|--------------|
+| [docs/PATTERNS.md](docs/PATTERNS.md) | Code patterns and anti-patterns |
+| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Deep component details |
+| [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) | WAL format, lock page |
+| [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md) | Replication format |
+| [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) | Test strategies |
+| [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md) | Adding storage backends |

-## Future Roadmap
+## Checklist

-**Planned Features:**
- **Litestream VFS**: Virtual File System for read replicas
-  - Instantly spin up database copies
-  - Background hydration from S3
-  - Enables scaling read operations without full database downloads
- **Enhanced read replica support**: Direct reads from remote storage
+Before submitting changes:

-## Important Constraints
-
-1. **Single Replica Authority**: Each database is replicated to exactly one remote target—configure redundancy at the storage layer if needed.
-2. **Legacy Backups**: Pre-LTX (v0.3.x) WAL snapshots cannot be restored with current binaries; keep an old binary around to hydrate those backups before re-replicating.
-3. **CLI Changes**: Use `litestream ltx` for LTX inspection; `litestream wal` is deprecated.
-4. **Pure Go Build**: The default build is CGO-free via `modernc.org/sqlite`; enable CGO only for optional VFS tooling.
-5. **Page-Level Compaction**: Expect compaction to merge files across 30s/5m/1h windows plus daily snapshots.
-
-## Final Checklist Before Making Changes
-
- [ ] Read this entire document
- [ ] Read `docs/SQLITE_INTERNALS.md` for SQLite fundamentals
- [ ] Read `docs/LTX_FORMAT.md` for replication format details
- [ ] Understand current constraints (single replica authority, LTX-only restores)
- [ ] Understand the component you're modifying
- [ ] Understand architectural boundaries (DB vs Replica responsibilities)
- [ ] Check for eventual consistency implications
- [ ] Consider >1GB database edge cases (lock page at 0x40000000)
- [ ] Use atomic file operations (temp file + rename)
- [ ] Return errors properly (don't just log and continue)
- [ ] Leverage existing mechanisms (e.g., verify() for snapshots)
- [ ] Plan appropriate tests
- [ ] Review recent similar PRs for patterns
- [ ] Use proper locking (Lock vs RLock)
- [ ] Preserve timestamps where applicable
- [ ] Test with race detector enabled
-
-## Agent-Specific Instructions
-
-This document serves as the universal source of truth for all AI coding assistants. Different agents may access it through various paths:
- **Claude**: Reads `AGENTS.md` directly (also loads `CLAUDE.md` if present)
- **GitHub Copilot**: Via `.github/copilot-instructions.md` symlink
- **Cursor**: Via `.cursorrules` symlink
- **Gemini**: Reads `AGENTS.md` and respects `.aiexclude` patterns
- **Other agents**: Check for `AGENTS.md` or `llms.txt` in repository root
-
-### GitHub Copilot / OpenAI Codex
-
-**Context Window**: 64k tokens (upgrading to 1M with GPT-4.1)
-
-**Best Practices**:
- Use `/explain` command for SQLite internals
- Reference patterns in Common Pitfalls section
- Switch to GPT-5-Codex model for complex refactoring
- Focus on architectural boundaries and anti-patterns
- Leverage workspace indexing for multi-file operations
-
-**Model Selection**:
- Use GPT-4o for quick completions
- Switch to GPT-5 or Claude Opus 4.1 for complex tasks
-
-### Cursor
-
-**Context Window**: Configurable based on model selection
-
-**Best Practices**:
- Enable "codebase indexing" for full repository context
- Use Claude 3.5 Sonnet for architectural questions
- Use GPT-4o for quick inline completions
- Split complex rules into `.cursor/rules/*.mdc` files if needed
- Leverage workspace search before asking questions
-
-**Model Recommendations**:
- **Architecture changes**: Claude 3.5 Sonnet
- **Quick fixes**: GPT-4o or cursor-small
- **Test generation**: Any model with codebase context
-
-### Claude / Claude Code
-
-**Context Window**: 200k tokens standard (1M in beta)
-
-**Best Practices**:
- Full documentation can be loaded (5k lines fits easily)
- Reference `docs/` subdirectory for deep technical details
- Use structured note-taking for complex multi-step tasks
- Leverage MCP tools when available
- Check `CLAUDE.md` for project-specific configuration
-
-**Strengths**:
- Deep architectural reasoning
- Complex system analysis
- Large context window utilization
-
-### Google Gemini / Gemini Code Assist
-
-**Context Window**: Varies by tier
-
-**Best Practices**:
- Check `.aiexclude` for files to ignore
- Enable local codebase awareness
- Excellent for test generation and documentation
- Use for code review and security scanning
- Leverage code customization features
-
-**Configuration**:
- Respects `.aiexclude` patterns (like `.gitignore`)
- Can use custom AI rules files
-
-### General Multi-Agent Guidelines
-
-1. **Always start with this document** (AGENTS.md) for project understanding
-2. **Check `llms.txt`** for quick navigation to other documentation
-3. **Respect architectural boundaries** (DB layer vs Replica layer)
-4. **Follow the patterns** in Common Pitfalls section
-5. **Test with race detector** for any concurrent code changes
-6. **Preserve backward compatibility** with current constraints
-
-### Documentation Hierarchy
-
-```text
-Tier 1 (Always read):
- AGENTS.md (this file)
- llms.txt (if you need navigation)
-
-Tier 2 (Read when relevant):
- docs/SQLITE_INTERNALS.md (for WAL/page work)
- docs/LTX_FORMAT.md (for replication work)
- docs/ARCHITECTURE.md (for major changes)
-
-Tier 3 (Reference only):
- docs/TESTING_GUIDE.md (for test scenarios)
- docs/REPLICA_CLIENT_GUIDE.md (for new backends)
-```
+- [ ] Read relevant docs above
+- [ ] Follow patterns in [docs/PATTERNS.md](docs/PATTERNS.md)
+- [ ] Test with race detector (`go test -race`)
+- [ ] Run `pre-commit run --all-files`
+- [ ] For page iteration: test with >1GB databases
+- [ ] Show investigation evidence in PR (see [AI_PR_GUIDE.md](AI_PR_GUIDE.md))
--- a/AI_PR_GUIDE.md
+++ b/AI_PR_GUIDE.md
@@ -0,0 +1,178 @@
+# AI-Assisted Contribution Guide
+
+This guide helps AI assistants (and humans using them) submit high-quality PRs to Litestream.
+
+## TL;DR Checklist
+
+Before submitting a PR:
+
+- [ ] **Show your investigation** - Include logs, file patterns, or debug output proving the problem
+- [ ] **Define scope clearly** - State what this PR does AND does not do
+- [ ] **Include runnable test commands** - Not just descriptions, actual `go test` commands
+- [ ] **Reference related issues/PRs** - Show awareness of related work
+
+## What Makes PRs Succeed
+
+Analysis of recent PRs shows successful submissions share these patterns:
+
+### 1. Investigation Artifacts
+
+Show evidence, don't just describe the fix.
+
+**Good:**
+
+```markdown
+## Problem
+File patterns show excessive snapshot creation after checkpoint:
+- 21:43 5.2G snapshot.ltx
+- 21:47 5.2G snapshot.ltx (after checkpoint - should not trigger new snapshot)
+
+Debug logs show `verify()` incorrectly detecting position mismatch...
+```
+
+**Bad:**
+
+```markdown
+## Problem
+Snapshots are created too often. This PR fixes it.
+```
+
+### 2. Clear Scope Definition
+
+Explicitly state boundaries.
+
+**Good:**
+
+```markdown
+## Scope
+This PR adds the lease client interface only.
+
+**In scope:**
+- LeaseClient interface definition
+- Mock implementation for testing
+
+**Not in scope (future PRs):**
+- Integration with Store
+- Distributed coordination logic
+```
+
+**Bad:**
+
+```markdown
+## Changes
+Added leasing support and also fixed a checkpoint bug I noticed.
+```
+
+### 3. Runnable Test Commands
+
+**Good:** Include actual commands that can be run:
+
+```bash
+# Unit tests
+go test -race -v -run TestDB_CheckpointDoesNotTriggerSnapshot ./...
+
+# Integration test with file backend
+go test -v ./replica_client_test.go -integration file
+```
+
+**Bad:** Vague descriptions like "Manual testing with file backend" or "Verified it works"
+
+### 4. Before/After Comparison
+
+For behavior changes, show the difference:
+
+**Good:**
+
+```markdown
+## Behavior Change
+
+| Scenario | Before | After |
+|----------|--------|-------|
+| Checkpoint with no changes | Creates snapshot | No snapshot |
+| Checkpoint with changes | Creates snapshot | Creates snapshot |
+```
+
+## Common Mistakes
+
+### Scope Creep
+
+**Problem:** Mixing unrelated changes in one PR.
+
+**Example:** PR titled "Add lease client" also includes a fix for checkpoint timing.
+
+**Fix:** Split into separate PRs. Reference them: "This PR adds the lease client. The checkpoint fix is in #XXX."
+
+### Missing Root Cause Analysis
+
+**Problem:** Implementing a fix without proving the problem exists.
+
+**Example:** "Add exponential backoff" without showing what's filling disk.
+
+**Fix:** Include investigation showing the actual cause before proposing solution.
+
+### Vague Test Plans
+
+**Problem:** "Tested manually" or "Verified it works."
+
+**Fix:** Include exact commands:
+
+```bash
+go test -race -v -run TestSpecificFunction ./...
+```
+
+### No Integration Context
+
+**Problem:** Large features without explaining how they fit.
+
+**Fix:** For multi-PR work, explain the phases:
+
+```markdown
+This is Phase 1 of 3 for distributed leasing:
+1. **This PR**: Lease client interface
+2. Future: Store integration
+3. Future: Distributed coordination
+```
+
+## PR Description Template
+
+Use this structure for PR descriptions:
+
+```text
+## Summary
+[1-2 sentences: what this PR does]
+
+## Problem
+[Evidence of the problem - logs, file patterns, user reports]
+
+## Solution
+[Brief explanation of the approach]
+
+## Scope
+**In scope:**
+- [item]
+
+**Not in scope:**
+- [item]
+
+## Test Plan
+[Include actual go test commands here]
+
+## Related
+- Fixes #XXX
+- Related to #YYY
+```
+
+## What We Accept
+
+From [CONTRIBUTING.md](CONTRIBUTING.md):
+
+- **Bug fixes** - Welcome, especially with evidence
+- **Small improvements** - Performance, code cleanup
+- **Documentation** - Always welcome
+- **Features** - Discuss in issue first; large features typically implemented internally
+
+## Resources
+
+- [AGENTS.md](AGENTS.md) - Project overview and checklist
+- [docs/PATTERNS.md](docs/PATTERNS.md) - Code patterns
+- [CONTRIBUTING.md](CONTRIBUTING.md) - Contribution guidelines
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,225 +1,40 @@
-# CLAUDE.md - Claude Code Optimizations for Litestream
+# CLAUDE.md - Claude Code Configuration

-This file is automatically loaded by Claude Code and provides Claude-specific optimizations. For comprehensive project documentation, see AGENTS.md.
+Claude-specific optimizations for Litestream. See [AGENTS.md](AGENTS.md) for project documentation.

-## Claude-Specific Optimizations
+## Context Window

-**Primary Documentation**: See AGENTS.md for comprehensive architectural guidance, patterns, and anti-patterns.
+With Claude's large context window, load documentation as needed:

-### Context Window Advantages
+- Start with [AGENTS.md](AGENTS.md) for overview and checklist
+- Load [docs/PATTERNS.md](docs/PATTERNS.md) when writing code
+- Load [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) for WAL/page work

-With Claude's 200k token context window, you can load the entire documentation suite:
- Full AGENTS.md for patterns and anti-patterns
- All docs/ subdirectory files for deep technical understanding
- Multiple source files simultaneously for cross-referencing
+## Claude-Specific Resources

-### Key Focus Areas for Claude
+### Specialized Agents (.claude/agents/)

-1. **Architectural Reasoning**: Leverage deep understanding of DB vs Replica layer boundaries
-2. **Complex Analysis**: Use full context for multi-file refactoring
-3. **SQLite Internals**: Reference docs/SQLITE_INTERNALS.md for WAL format details
-4. **LTX Format**: Reference docs/LTX_FORMAT.md for replication specifics
+- `sqlite-expert.md` - SQLite WAL and page management
+- `replica-client-developer.md` - Storage backend implementation
+- `ltx-compaction-specialist.md` - LTX format and compaction
+- `test-engineer.md` - Testing strategies
+- `performance-optimizer.md` - Performance optimization

-### Claude-Specific Resources
+### Commands (.claude/commands/)

-#### Specialized Agents (.claude/agents/)
+- `/analyze-ltx` - Analyze LTX file structure
+- `/debug-wal` - Debug WAL replication issues
+- `/test-compaction` - Test compaction scenarios
+- `/trace-replication` - Trace replication flow
+- `/validate-replica` - Validate replica client
+- `/add-storage-backend` - Create new storage backend
+- `/fix-common-issues` - Diagnose common problems
+- `/run-comprehensive-tests` - Execute full test suite

- **sqlite-expert.md**: SQLite WAL and page management expertise
- **replica-client-developer.md**: Storage backend implementation
- **ltx-compaction-specialist.md**: LTX format and compaction
- **test-engineer.md**: Comprehensive testing strategies
- **performance-optimizer.md**: Performance and resource optimization
-
-#### Commands (.claude/commands/)
-
- `/analyze-ltx`: Analyze LTX file structure and contents
- `/debug-wal`: Debug WAL replication issues
- `/test-compaction`: Test compaction scenarios
- `/trace-replication`: Trace replication flow
- `/validate-replica`: Validate replica client implementation
- `/add-storage-backend`: Create new storage backend
- `/fix-common-issues`: Diagnose and fix common problems
- `/run-comprehensive-tests`: Execute full test suite
-
-Use these commands with: `<command> [arguments]` in Claude Code.
-
-## Overview
-
-Litestream is a standalone disaster recovery tool for SQLite that runs as a background process and safely replicates changes incrementally to another file or S3. It works through the SQLite API to prevent database corruption.
-
-## Build and Development Commands
-
-### Building
+## Quick Commands

 ```bash
-# Build the main binary
-go build ./cmd/litestream
-
-# Install the binary
-go install ./cmd/litestream
-
-# Build for specific platforms (using Makefile)
-make docker               # Build Docker image
-make dist-linux          # Build Linux AMD64 distribution
-make dist-linux-arm      # Build Linux ARM distribution
-make dist-linux-arm64    # Build Linux ARM64 distribution
-make dist-macos          # Build macOS distribution (requires LITESTREAM_VERSION env var)
-```
-
-### Testing
-
-```bash
-# Run all tests
-go test -v ./...
-
-# Run tests with coverage
-go test -v -cover ./...
-
-# Test VFS functionality (requires CGO and explicit vfs build tag)
-go test -tags vfs ./cmd/litestream-vfs -v
-
-# Test builds before committing (always use -o bin/ to avoid committing binaries)
-go build -o bin/litestream ./cmd/litestream           # Test main build (no CGO required)
-CGO_ENABLED=1 go build -tags vfs -o bin/litestream-vfs ./cmd/litestream-vfs  # Test VFS with CGO
-
-# Run specific integration tests (requires environment setup)
-go test -v ./replica_client_test.go -integration s3
-go test -v ./replica_client_test.go -integration gcs
-go test -v ./replica_client_test.go -integration abs
-go test -v ./replica_client_test.go -integration oss
-go test -v ./replica_client_test.go -integration sftp
-```
-
-### Code Quality
-
-```bash
-# Format code
-go fmt ./...
-goimports -local github.com/benbjohnson/litestream -w .
-
-# Run linters
-go vet ./...
-staticcheck ./...
-
-# Run pre-commit hooks (includes trailing whitespace, goimports, go-vet, staticcheck)
+go build -o bin/litestream ./cmd/litestream
+go test -race -v ./...
 pre-commit run --all-files
 ```
-
-## Architecture
-
-### Core Components
-
-**DB (`db.go`)**: Manages a SQLite database instance with WAL monitoring, checkpoint management, and metrics. Handles replication coordination and maintains long-running read transactions for consistency.
-
-**Replica (`replica.go`)**: Connects a database to replication destinations via ReplicaClient interface. Manages periodic synchronization and maintains replication position.
-
-**ReplicaClient Interface** (`replica_client.go`): Abstraction for different storage backends (S3, GCS, Azure Blob Storage, OSS, SFTP, file system, NATS). Each implementation handles snapshot/WAL segment upload and restoration. The `LTXFiles` method includes a `useMetadata` parameter: when true, it fetches accurate timestamps from backend metadata (required for point-in-time restores); when false, it uses fast timestamps for normal operations. During compaction, the system preserves the earliest CreatedAt timestamp from source files to maintain temporal granularity for restoration.
-
-**WAL Processing**: The system monitors SQLite WAL files for changes, segments them into LTX format files, and replicates these segments to configured destinations. Uses SQLite checksums for integrity verification.
-
-### Storage Backends
-
- **S3** (`s3/replica_client.go`): AWS S3 and compatible storage
- **GCS** (`gs/replica_client.go`): Google Cloud Storage
- **ABS** (`abs/replica_client.go`): Azure Blob Storage
- **OSS** (`oss/replica_client.go`): Alibaba Cloud Object Storage Service
- **SFTP** (`sftp/replica_client.go`): SSH File Transfer Protocol
- **File** (`file/replica_client.go`): Local file system replication
- **NATS** (`nats/replica_client.go`): NATS JetStream object storage
-
-### Command Structure
-
-Main entry point (`cmd/litestream/main.go`) provides subcommands:
-
- `replicate`: Primary replication daemon mode
- `restore`: Restore database from replica
- `databases`: List configured databases
- `ltx`: WAL/LTX file utilities (renamed from 'wal')
- `version`: Display version information
- `mcp`: Model Context Protocol support
-
-## Key Design Patterns
-
-1. **Non-invasive monitoring**: Uses SQLite API exclusively, no direct file manipulation
-2. **Incremental replication**: Segments WAL into small chunks for efficient transfer
-3. **Single remote authority**: Each database replicates to exactly one destination
-4. **Age encryption**: Optional end-to-end encryption using age identities/recipients
-5. **Prometheus metrics**: Built-in observability for monitoring replication health
-6. **Timestamp preservation**: Compaction preserves earliest CreatedAt timestamp from source files to maintain temporal granularity for point-in-time restoration
-
-## Configuration
-
-Primary configuration via YAML file (`etc/litestream.yml`) or environment variables. Supports:
-
- Database paths and replica destinations
- Sync intervals and checkpoint settings
- Authentication credentials for cloud storage
- Encryption keys for age encryption
-
-## Important Notes
-
- External contributions accepted for bug fixes only (not features)
- Uses pre-commit hooks for code quality enforcement
- Requires Go 1.24+ for build
- Main binary does NOT require CGO
- VFS functionality requires explicit `-tags vfs` build flag AND CGO_ENABLED=1
- **ALWAYS build binaries into `bin/` directory** which is gitignored (e.g., `go build -o bin/litestream`)
- Always test builds with different configurations before committing
-
-## Workflows and Best Practices
-
- Any time you create/edit markdown files, lint and fix them with markdownlint
-
-## Testing Considerations
-
-### SQLite Lock Page at 1GB Boundary
-
-Litestream handles a critical SQLite edge case: the lock page at exactly 1GB
-(offset 0x40000000). This page is reserved by SQLite for file locking and
-cannot contain data. The code skips this page during replication (see
-db.go:951-953).
-
-**Key Implementation Details:**
-
- Lock page calculation: `LockPgno = (0x40000000 / pageSize) + 1`
- Located in LTX library: `ltx.LockPgno(pageSize)`
- Must be skipped when iterating through database pages
- Affects databases larger than 1GB regardless of page size
-
-**Testing Requirements:**
-
-1. **Create databases >1GB** to ensure lock page handling works
-2. **Test with various page sizes** as lock page number changes:
-   - 4KB: page 262145 (default, most common)
-   - 8KB: page 131073
-   - 16KB: page 65537
-   - 32KB: page 32769
-3. **Verify replication** correctly skips the lock page
-4. **Test restoration** to ensure databases restore properly across 1GB boundary
-
-**Quick Test Script:**
-
-```bash
-# Create a >1GB test database
-sqlite3 large.db <<EOF
-PRAGMA page_size=4096;
-CREATE TABLE test(data BLOB);
-- Insert enough data to exceed 1GB
-WITH RECURSIVE generate_series(value) AS (
-  SELECT 1 UNION ALL SELECT value+1 FROM generate_series LIMIT 300000
-)
-INSERT INTO test SELECT randomblob(4000) FROM generate_series;
-EOF
-
-# Verify it crosses the 1GB boundary
-echo "File size: $(stat -f%z large.db 2>/dev/null || stat -c%s large.db)"
-echo "Page count: $(sqlite3 large.db 'PRAGMA page_count')"
-echo "Lock page should be at: $((0x40000000 / 4096 + 1))"
-
-# Test replication
-./bin/litestream replicate large.db file:///tmp/replica
-
-# Test restoration
-./bin/litestream restore -o restored.db file:///tmp/replica
-sqlite3 restored.db "PRAGMA integrity_check;"
-```
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -22,6 +22,22 @@ Thank you for your interest in contributing to Litestream! We value community co
 - **Large external feature contributions**: Features carry a long-term maintenance burden. To reduce burnout and maintain code quality, we typically implement major features internally. This allows us to ensure consistency with the overall architecture and maintain the high reliability that Litestream users depend on for disaster recovery
 - **Breaking changes**: Changes that break backward compatibility require extensive discussion

+## AI-Assisted Contributions
+
+We welcome AI-assisted contributions for bug fixes and small improvements. Whether you're using Claude, Copilot, Cursor, or other AI tools:
+
+**Requirements:**
+
+- **Show your investigation** - Include evidence (logs, file patterns, debug output) proving the problem exists
+- **Define scope clearly** - State what the PR does and does not do
+- **Include runnable test commands** - Actual `go test` commands, not just descriptions
+- **Human review before submission** - You're responsible for the code you submit
+
+**Resources:**
+
+- [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - Detailed guide with templates and examples
+- [AGENTS.md](AGENTS.md) - Project overview for AI assistants
+
 ## How to Contribute

 ### Reporting Bugs
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -1,81 +1,42 @@
-# GEMINI.md - Gemini Code Assist Configuration for Litestream
+# GEMINI.md - Gemini Code Assist Configuration

-This file provides Gemini-specific configuration and notes. For comprehensive project documentation, see AGENTS.md.
+Gemini-specific configuration for Litestream. See [AGENTS.md](AGENTS.md) for project documentation.

-## Primary Documentation
+## Before Contributing

-**See AGENTS.md** for complete architectural guidance, patterns, and anti-patterns for working with Litestream.
+1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - PR quality requirements
+2. Read [AGENTS.md](AGENTS.md) - Project overview and checklist
+3. Check [CONTRIBUTING.md](CONTRIBUTING.md) - What we accept

-## Gemini-Specific Configuration
+## File Exclusions

-### File Exclusions
-Check `.aiexclude` file for patterns of files that should not be shared with Gemini (similar to `.gitignore`).
+Check `.aiexclude` for patterns of files that should not be shared with Gemini.

-### Strengths for This Project
+## Gemini Strengths for This Project

-1. **Test Generation**: Excellent at creating comprehensive test suites
-2. **Documentation**: Strong at generating and updating documentation
-3. **Code Review**: Good at identifying potential issues and security concerns
-4. **Local Codebase Awareness**: Enable for full repository understanding
+- **Test generation** - Creating comprehensive test suites
+- **Documentation** - Generating and updating docs
+- **Code review** - Identifying issues and security concerns
+- **Local codebase awareness** - Enable for full repository understanding

-## Key Project Concepts
+## Documentation

-### SQLite Lock Page
- Must skip page at 1GB boundary (0x40000000)
- Page number varies by page size (262145 for 4KB pages)
- See docs/SQLITE_INTERNALS.md for details
+Load as needed:

-### LTX Format
- Immutable replication files
- Named by transaction ID ranges
- See docs/LTX_FORMAT.md for specification
+- [docs/PATTERNS.md](docs/PATTERNS.md) - Code patterns when writing code
+- [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) - For WAL/page work
+- [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) - For test generation

-### Architectural Boundaries
- DB layer (db.go): Database state and restoration
- Replica layer (replica.go): Replication only
- Storage layer: ReplicaClient implementations
+## Critical Rules

-## Testing Focus
+- **Lock page at 1GB** - Skip page at 0x40000000
+- **LTX files are immutable** - Never modify after creation
+- **Layer boundaries** - DB handles state, Replica handles replication

-When generating tests:
- Include >1GB database tests for lock page verification
- Add race condition tests with -race flag
- Test various page sizes (4KB, 8KB, 16KB, 32KB)
- Include eventual consistency scenarios
-
-## Common Tasks
-
-### Adding Storage Backend
-1. Implement ReplicaClient interface
-2. Follow existing patterns (s3/, gs/, abs/)
-3. Handle eventual consistency
-4. Generate comprehensive tests
-
-### Refactoring
-1. Respect layer boundaries (DB vs Replica)
-2. Maintain current constraints (single replica authority, LTX-only restores)
-3. Use atomic file operations
-4. Return errors properly (don't just log)
-
-## Build and Test Commands
+## Quick Commands

 ```bash
-# Build without CGO
 go build -o bin/litestream ./cmd/litestream
-
-# Test with race detection
 go test -race -v ./...
-
-# Test specific backend
-go test -v ./replica_client_test.go -integration s3
+pre-commit run --all-files
 ```
-
-## Configuration Reference
-
-See `etc/litestream.yml` for configuration examples. Remember: each database replicates to exactly one remote destination.
-
-## Additional Resources
-
- llms.txt: Quick navigation index
- docs/: Deep technical documentation
- .claude/commands/: Task-specific commands (if using with Claude Code)
--- a/docs/PATTERNS.md
+++ b/docs/PATTERNS.md
@@ -0,0 +1,433 @@
+# Litestream Code Patterns and Anti-Patterns
+
+This document contains detailed code patterns, examples, and anti-patterns for working with Litestream. For a quick overview, see [AGENTS.md](../AGENTS.md).
+
+## Table of Contents
+
+- [Architectural Boundaries](#architectural-boundaries)
+- [Atomic File Operations](#atomic-file-operations)
+- [Error Handling](#error-handling)
+- [Locking Patterns](#locking-patterns)
+- [Compaction and Eventual Consistency](#compaction-and-eventual-consistency)
+- [Timestamp Preservation](#timestamp-preservation)
+- [Common Pitfalls](#common-pitfalls)
+- [Component Reference](#component-reference)
+
+## Architectural Boundaries
+
+### Layer Responsibilities
+
+```text
+DB Layer (db.go)          → Database state, restoration, monitoring
+Replica Layer (replica.go) → Replication mechanics only
+Storage Layer             → ReplicaClient implementations
+```
+
+### DO: Handle database state in DB layer
+
+Database restoration logic belongs in the DB layer, not the Replica layer.
+
+When the database is behind the replica (local TXID < remote TXID):
+
+1. **Clear local L0 cache**: Remove the entire L0 directory and recreate it
+2. **Fetch latest L0 file from replica**: Download the most recent L0 LTX file
+3. **Write using atomic file operations**: Prevent partial/corrupted files
+
+```go
+// CORRECT - DB layer handles database state
+func (db *DB) init() error {
+    // DB layer handles database state
+    if db.needsRestore() {
+        if err := db.restore(); err != nil {
+            return err
+        }
+    }
+    // Then start replica for replication only
+    return db.replica.Start()
+}
+
+func (r *Replica) Start() error {
+    // Replica focuses only on replication
+    return r.startSync()
+}
+```
+
+Reference: `DB.checkDatabaseBehindReplica()` in db.go:670-737
+
+### DON'T: Put database state logic in Replica layer
+
+```go
+// WRONG - Replica should only handle replication concerns
+func (r *Replica) Start() error {
+    // DON'T check database state here
+    if needsRestore() {  // Wrong layer!
+        restoreDatabase()  // Wrong layer!
+    }
+    // Replica should focus only on replication mechanics
+}
+```
+
+## Atomic File Operations
+
+Always use atomic writes to prevent partial/corrupted files.
+
+### DO: Write to temp file, then rename
+
+```go
+// CORRECT - Atomic file write pattern
+func writeFileAtomic(path string, data []byte) error {
+    // Create temp file in same directory (for atomic rename)
+    dir := filepath.Dir(path)
+    tmpFile, err := os.CreateTemp(dir, ".tmp-*")
+    if err != nil {
+        return fmt.Errorf("create temp file: %w", err)
+    }
+    tmpPath := tmpFile.Name()
+
+    // Clean up temp file on error
+    defer func() {
+        if tmpFile != nil {
+            tmpFile.Close()
+            os.Remove(tmpPath)
+        }
+    }()
+
+    // Write data to temp file
+    if _, err := tmpFile.Write(data); err != nil {
+        return fmt.Errorf("write temp file: %w", err)
+    }
+
+    // Sync to ensure data is on disk
+    if err := tmpFile.Sync(); err != nil {
+        return fmt.Errorf("sync temp file: %w", err)
+    }
+
+    // Close before rename
+    if err := tmpFile.Close(); err != nil {
+        return fmt.Errorf("close temp file: %w", err)
+    }
+    tmpFile = nil // Prevent defer cleanup
+
+    // Atomic rename (on same filesystem)
+    if err := os.Rename(tmpPath, path); err != nil {
+        os.Remove(tmpPath)
+        return fmt.Errorf("rename to final path: %w", err)
+    }
+
+    return nil
+}
+```
+
+### DON'T: Write directly to final location
+
+```go
+// WRONG - Can leave partial files on failure
+func writeFileDirect(path string, data []byte) error {
+    return os.WriteFile(path, data, 0644)  // Not atomic!
+}
+```
+
+## Error Handling
+
+### DO: Return errors immediately
+
+```go
+// CORRECT - Return error for caller to handle
+func (db *DB) validatePosition() error {
+    dpos, err := db.Pos()
+    if err != nil {
+        return err
+    }
+    rpos := replica.Pos()
+    if dpos.TXID < rpos.TXID {
+        return fmt.Errorf("database position (%v) behind replica (%v)", dpos, rpos)
+    }
+    return nil
+}
+```
+
+### DON'T: Continue on critical errors
+
+```go
+// WRONG - Silently continuing can cause data corruption
+func (db *DB) validatePosition() {
+    if dpos, _ := db.Pos(); dpos.TXID < replica.Pos().TXID {
+        log.Printf("warning: position mismatch")  // Don't just log!
+        // Continuing here is dangerous
+    }
+}
+```
+
+### DON'T: Ignore errors and continue in loops
+
+```go
+// WRONG - Continuing after error can corrupt state
+func (db *DB) processFiles() {
+    for _, file := range files {
+        if err := processFile(file); err != nil {
+            log.Printf("error: %v", err)  // Just logging!
+            // Continuing to next file is dangerous
+        }
+    }
+}
+```
+
+### DO: Return errors properly in loops
+
+```go
+// CORRECT - Let caller decide how to handle errors
+func (db *DB) processFiles() error {
+    for _, file := range files {
+        if err := processFile(file); err != nil {
+            return fmt.Errorf("process file %s: %w", file, err)
+        }
+    }
+    return nil
+}
+```
+
+## Locking Patterns
+
+### DO: Use proper lock types
+
+```go
+// CORRECT - Use Lock() for writes
+r.mu.Lock()
+defer r.mu.Unlock()
+r.pos = pos
+```
+
+### DON'T: Use RLock for write operations
+
+```go
+// WRONG - Race condition
+r.mu.RLock()  // Should be Lock() for writes
+defer r.mu.RUnlock()
+r.pos = pos   // Writing with RLock!
+```
+
+## Compaction and Eventual Consistency
+
+Many storage backends (S3, R2, etc.) are eventually consistent:
+
+- A file you just wrote might not be immediately readable
+- A file might be listed but only partially available
+- Reads might return stale or incomplete data
+
+### DO: Read from local when available
+
+```go
+// CORRECT - Check local first during compaction
+// db.go:1280-1294 - ALWAYS read from local disk when available
+f, err := os.Open(db.LTXPath(info.Level, info.MinTXID, info.MaxTXID))
+if err == nil {
+    // Use local file - it's complete and consistent
+    return f, nil
+}
+// Only fall back to remote if local doesn't exist
+return replica.Client.OpenLTXFile(...)
+```
+
+### DON'T: Read from remote during compaction
+
+```go
+// WRONG - Can get partial/corrupt data from eventually consistent storage
+f, err := client.OpenLTXFile(ctx, level, minTXID, maxTXID, 0, 0)
+```
+
+## Timestamp Preservation
+
+During compaction, preserve the earliest CreatedAt timestamp from source files to maintain temporal granularity for point-in-time restoration.
+
+### DO: Preserve earliest timestamp
+
+```go
+// CORRECT - Preserve temporal information
+info, err := replica.Client.WriteLTXFile(ctx, level, minTXID, maxTXID, r)
+if err != nil {
+    return fmt.Errorf("write ltx: %w", err)
+}
+info.CreatedAt = oldestSourceFile.CreatedAt
+```
+
+### DON'T: Ignore CreatedAt preservation
+
+```go
+// WRONG - Loses timestamp granularity for point-in-time restores
+info := &ltx.FileInfo{
+    CreatedAt: time.Now(), // Don't use current time during compaction
+}
+```
+
+## Common Pitfalls
+
+### 1. Mixing architectural concerns
+
+```go
+// WRONG - Database state logic in Replica layer
+func (r *Replica) Start() error {
+    if db.needsRestore() {  // Wrong layer for DB state!
+        r.restoreDatabase()  // Replica shouldn't manage DB state!
+    }
+    return r.sync()
+}
+```
+
+### 2. Recreating existing functionality
+
+```go
+// WRONG - Don't reimplement what already exists
+func customSnapshotTrigger() {
+    // Complex custom logic to trigger snapshots
+    // when db.verify() already does this!
+}
+```
+
+### DO: Leverage existing mechanisms
+
+```go
+// CORRECT - Use what's already there
+func triggerSnapshot() error {
+    return db.verify()  // Already handles snapshot logic correctly
+}
+```
+
+### 3. Skipping the lock page
+
+The lock page at 1GB (0x40000000) must always be skipped:
+
+```go
+// db.go:951-953 - Must skip lock page during replication
+lockPgno := ltx.LockPgno(pageSize)
+if pgno == lockPgno {
+    continue // Skip this page - it's reserved by SQLite
+}
+```
+
+Lock page numbers by page size:
+
+| Page Size | Lock Page Number |
+|-----------|------------------|
+| 4KB | 262145 |
+| 8KB | 131073 |
+| 16KB | 65537 |
+| 32KB | 32769 |
+
+## Component Reference
+
+### DB Component (db.go)
+
+**Responsibilities:**
+
+- Manages SQLite database connection (via `modernc.org/sqlite` - no CGO)
+- Monitors WAL for changes
+- Performs checkpoints
+- Maintains long-running read transaction
+- Converts WAL pages to LTX format
+
+**Key Fields:**
+
+```go
+type DB struct {
+    path     string      // Database file path
+    db       *sql.DB     // SQLite connection
+    rtx      *sql.Tx     // Long-running read transaction
+    pageSize int         // Database page size (critical for lock page)
+    notify   chan struct{} // Notifies on WAL changes
+}
+```
+
+**Initialization Sequence:**
+
+1. Open database connection
+2. Read page size from database
+3. Initialize long-running read transaction
+4. Start monitor goroutine
+5. Initialize replicas
+
+### Replica Component (replica.go)
+
+**Responsibilities:**
+
+- Manages replication to a single destination (one replica per DB)
+- Tracks replication position (ltx.Pos)
+- Handles sync intervals
+- Manages encryption (if configured)
+
+**Key Operations:**
+
+- `Sync()`: Synchronizes pending changes
+- `SetPos()`: Updates replication position (must use Lock, not RLock!)
+- `Snapshot()`: Creates full database snapshot
+
+### ReplicaClient Interface (replica_client.go)
+
+**Required Methods:**
+
+```go
+type ReplicaClient interface {
+    Type() string  // Client type identifier
+
+    // File operations
+    LTXFiles(ctx context.Context, level int, seek ltx.TXID, useMetadata bool) (ltx.FileIterator, error)
+    OpenLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, offset, size int64) (io.ReadCloser, error)
+    WriteLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, r io.Reader) (*ltx.FileInfo, error)
+    DeleteLTXFiles(ctx context.Context, files []*ltx.FileInfo) error
+    DeleteAll(ctx context.Context) error
+}
+```
+
+**useMetadata Parameter:**
+
+- `useMetadata=true`: Fetch accurate timestamps from backend metadata (required for point-in-time restores)
+- `useMetadata=false`: Use fast timestamps for normal operations
+
+### Store Component (store.go)
+
+**Default Compaction Levels:**
+
+```go
+var defaultLevels = CompactionLevels{
+    {Level: 0, Interval: 0},        // Raw LTX files (no compaction)
+    {Level: 1, Interval: 30*Second},
+    {Level: 2, Interval: 5*Minute},
+    {Level: 3, Interval: 1*Hour},
+    // Snapshots created daily (24h retention)
+}
+```
+
+## Testing Patterns
+
+### Race Condition Testing
+
+```bash
+# Always run with race detector
+go test -race -v ./...
+
+# Specific race-prone areas
+go test -race -v -run TestReplica_Sync ./...
+go test -race -v -run TestDB_Sync ./...
+go test -race -v -run TestStore_CompactDB ./...
+```
+
+### Lock Page Testing
+
+```bash
+# Test with various page sizes
+./bin/litestream-test populate -db test.db -page-size 4096 -target-size 2GB
+./bin/litestream-test populate -db test.db -page-size 8192 -target-size 2GB
+
+# Validate lock page handling
+./bin/litestream-test validate -source-db test.db -replica-url file:///tmp/replica
+```
+
+### Integration Testing
+
+```bash
+# Test specific backend
+go test -v ./replica_client_test.go -integration s3
+go test -v ./replica_client_test.go -integration gcs
+go test -v ./replica_client_test.go -integration abs
+go test -v ./replica_client_test.go -integration oss
+go test -v ./replica_client_test.go -integration sftp
+```
--- a/llms.txt
+++ b/llms.txt
@@ -1,83 +1,63 @@
 # Litestream

-Disaster recovery tool for SQLite that runs as a background process and safely replicates changes incrementally to S3, GCS, Azure Blob Storage, SFTP, or another file system.
+Disaster recovery tool for SQLite. Replicates WAL changes to S3, GCS, Azure, SFTP, or local filesystem.

-## Core Documentation
+## Quick Start for AI Contributors

- [AGENTS.md](AGENTS.md): AI agent instructions, architectural patterns, and anti-patterns
- [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md): Critical SQLite knowledge including WAL format and 1GB lock page
- [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md): LTX (Log Transaction) format specification for replication
- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md): Deep technical details of Litestream components
+1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - PR quality requirements
+2. Read [AGENTS.md](AGENTS.md) - Project overview and checklist
+3. Check [CONTRIBUTING.md](CONTRIBUTING.md) - What we accept
+4. Show investigation evidence in PRs

-## Implementation Guides
+## PR Checklist

- [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md): Guide for implementing storage backends
- [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md): Comprehensive testing strategies including >1GB database tests
+- [ ] Evidence of problem (logs, file patterns)
+- [ ] Clear scope (what PR does/doesn't do)
+- [ ] Runnable test commands
+- [ ] Race detector tested (`go test -race`)

-## Core Components
+## Documentation

- [db.go](db.go): Database monitoring, WAL reading, checkpoint management
- [replica.go](replica.go): Replication management, position tracking, synchronization
- [store.go](store.go): Multi-database coordination, compaction scheduling
- [replica_client.go](replica_client.go): Interface definition for storage backends
+| Document | Purpose |
+|----------|---------|
+| [AGENTS.md](AGENTS.md) | Project overview, critical rules |
+| [AI_PR_GUIDE.md](AI_PR_GUIDE.md) | PR templates, common mistakes |
+| [docs/PATTERNS.md](docs/PATTERNS.md) | Code patterns and anti-patterns |
+| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Component details |
+| [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) | WAL format, 1GB lock page |
+| [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md) | Replication format |
+| [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) | Test strategies |
+| [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md) | Storage backends |
+
+## Core Files
+
+| File | Purpose |
+|------|---------|
+| `db.go` | Database monitoring, WAL, checkpoints |
+| `replica.go` | Replication management |
+| `store.go` | Multi-database coordination |
+| `replica_client.go` | Storage backend interface |

 ## Storage Backends

- [s3/replica_client.go](s3/replica_client.go): AWS S3 and compatible storage implementation
- [gs/replica_client.go](gs/replica_client.go): Google Cloud Storage implementation
- [abs/replica_client.go](abs/replica_client.go): Azure Blob Storage implementation
- [sftp/replica_client.go](sftp/replica_client.go): SFTP implementation
- [file/replica_client.go](file/replica_client.go): Local file system implementation
- [nats/replica_client.go](nats/replica_client.go): NATS JetStream implementation
+- `s3/replica_client.go` - AWS S3
+- `gs/replica_client.go` - Google Cloud Storage
+- `abs/replica_client.go` - Azure Blob Storage
+- `sftp/replica_client.go` - SFTP
+- `file/replica_client.go` - Local filesystem
+- `nats/replica_client.go` - NATS JetStream

 ## Critical Concepts

-### SQLite Lock Page
-The lock page at exactly 1GB (0x40000000) must always be skipped during replication. Page number varies by page size: 262145 for 4KB pages, 131073 for 8KB pages.
+- **Lock page at 1GB** - Always skip page at 0x40000000
+- **LTX files are immutable** - Never modify after creation
+- **Single replica per DB** - One destination per database
+- **Layer boundaries** - DB handles state, Replica handles replication

-### LTX Format
-Immutable, append-only files containing database changes. Files are named by transaction ID ranges (e.g., 0000000001-0000000064.ltx).
+## Build

-### Compaction Levels
- Level 0: Raw LTX files (no compaction)
- Level 1: 30-second windows
- Level 2: 5-minute windows
- Level 3: 1-hour windows
- Snapshots: Daily full database state
-
-### Architectural Boundaries
- **DB Layer (db.go)**: Handles database state, restoration logic, monitoring
- **Replica Layer (replica.go)**: Focuses solely on replication concerns
- **Storage Layer**: Implements ReplicaClient interface for various backends
-
-## Key Patterns
-
-### Atomic File Operations
-Always write to temporary file then rename for atomicity.
-
-### Error Handling
-Return errors immediately, don't log and continue.
-
-### Eventual Consistency
-Always prefer local files during compaction to handle eventually consistent storage.
-
-### Locking
-Use Lock() for writes, RLock() for reads. Never use RLock() when modifying state.
-
-## Testing Requirements
-
- Test with databases >1GB to verify lock page handling
- Run with race detector enabled (-race flag)
- Test with various page sizes (4KB, 8KB, 16KB, 32KB)
- Verify eventual consistency handling with storage backends
-
-## Configuration
-
-Primary configuration via YAML file (etc/litestream.yml) or environment variables. Each database replicates to exactly one remote destination.
-
-## Build Requirements
-
- Go 1.24+
- No CGO required for main binary (uses modernc.org/sqlite)
- CGO required only for VFS functionality (build with -tags vfs)
- Always build binaries into bin/ directory (gitignored)
+```bash
+go build -o bin/litestream ./cmd/litestream
+go test -race -v ./...
+pre-commit run --all-files
+```