docs: restructure LLM documentation for better PR quality (#961)

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Cory LaNou
2026-01-06 13:34:01 -06:00
committed by GitHub
parent fc1c254922
commit b946989553
7 changed files with 766 additions and 1262 deletions

963
AGENTS.md
View File

@@ -1,942 +1,63 @@
# AGENT.md - Litestream AI Agent Documentation
# AGENTS.md - Litestream AI Agent Guide
This document provides comprehensive guidance for AI agents working with the Litestream codebase. Read this document carefully before making any modifications.
Litestream is a disaster recovery tool for SQLite that runs as a background process, monitors the WAL, converts changes to immutable LTX files, and replicates them to cloud storage. It uses `modernc.org/sqlite` (pure Go, no CGO required).
## Table of Contents
## Before You Start
- [Overview](#overview)
- [Fundamental Concepts](#fundamental-concepts)
- [Core Architecture](#core-architecture)
- [Critical Concepts](#critical-concepts)
- [Architectural Boundaries and Patterns](#architectural-boundaries-and-patterns)
- [Common Pitfalls](#common-pitfalls)
- [Component Guide](#component-guide)
- [Performance Considerations](#performance-considerations)
- [Testing Requirements](#testing-requirements)
1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) for contribution requirements
2. Check [CONTRIBUTING.md](CONTRIBUTING.md) for what we accept (bug fixes welcome, features need discussion)
3. Review recent PRs for current patterns
## Overview
## Critical Rules
Litestream is a **disaster recovery tool for SQLite** that runs as a background process and safely replicates changes incrementally to various storage backends. It monitors SQLite's Write-Ahead Log (WAL), converts changes to an immutable LTX format, and replicates these to configured destinations.
- **Lock page at 1GB**: SQLite reserves page at 0x40000000. Always skip it. See [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md)
- **LTX files are immutable**: Never modify after creation. See [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md)
- **Single replica per database**: Each DB replicates to exactly one destination
- **Use `litestream ltx`**: Not `litestream wal` (deprecated)
**Current Architecture Highlights:**
- **LTX Format**: Page-level replication format replaces direct WAL mirroring
- **Multi-level Compaction**: Hierarchical compaction keeps storage efficient (30s → 5m → 1h → snapshots)
- **Single Replica Constraint**: Each database is replicated to exactly one remote destination
- **Pure Go Build**: Uses `modernc.org/sqlite`, so no CGO dependency for the main binary
- **Optional NATS JetStream Support**: Additional replica backend alongside S3/GCS/ABS/OSS/File/SFTP
- **Snapshot Compatibility**: Only LTX-based backups are supported—keep legacy v0.3.x binaries to restore old WAL snapshots
## Layer Boundaries
**Key Design Principles:**
- **Non-invasive**: Uses only SQLite API, never directly manipulates database files
- **Incremental**: Replicates only changes, not full databases
- **Single-destination**: Exactly one replica destination per database
- **Eventually Consistent**: Handles storage backends with eventual consistency
- **Safe**: Maintains long-running read transactions for consistency
| Layer | File | Responsibility |
|-------|------|----------------|
| DB | `db.go` | Database state, restoration, WAL monitoring |
| Replica | `replica.go` | Replication mechanics only |
| Storage | `**/replica_client.go` | Backend implementations |
## Fundamental Concepts
**CRITICAL**: Understanding SQLite internals and the LTX format is essential for working with Litestream.
### Required Reading
1. **[SQLite Internals](docs/SQLITE_INTERNALS.md)** - Understand WAL, pages, transactions, and the 1GB lock page
2. **[LTX Format](docs/LTX_FORMAT.md)** - Learn the custom replication format Litestream uses
### Key SQLite Concepts
- **WAL (Write-Ahead Log)**: Temporary file containing uncommitted changes
- **Pages**: Fixed-size blocks (typically 4KB) that make up the database
- **Lock Page at 1GB**: Special page at 0x40000000 that MUST be skipped
- **Checkpoints**: Process of merging WAL back into main database
- **Transaction Isolation**: Long-running read transaction for consistency
### Key LTX Concepts
- **Immutable Files**: Once written, LTX files are never modified
- **TXID Ranges**: Each file covers a range of transaction IDs
- **Page Index**: Binary search tree for efficient page lookup
- **Compaction Levels**: Time-based merging to reduce storage (30s → 5min → 1hr)
- **Checksums**: CRC-64 integrity verification at multiple levels
- **CLI Command**: Use `litestream ltx` (not `wal`) for LTX operations
### The Replication Flow
```mermaid
graph LR
App[Application] -->|SQL| SQLite
SQLite -->|Writes| WAL[WAL File]
WAL -->|Monitor| Litestream
Litestream -->|Convert| LTX[LTX Format]
LTX -->|Upload| Storage[Cloud Storage]
Storage -->|Restore| Database[New Database]
```
## Core Architecture
```mermaid
graph TB
subgraph "SQLite Layer"
SQLite[SQLite Database]
WAL[WAL File]
SQLite -->|Writes| WAL
end
subgraph "Litestream Core"
DB[DB Component<br/>db.go]
Replica[Replica Manager<br/>replica.go]
Store[Store<br/>store.go]
DB -->|Manages| Replica
Store -->|Coordinates| DB
end
subgraph "Storage Layer"
RC[ReplicaClient Interface<br/>replica_client.go]
S3[S3 Client]
GCS[GCS Client]
File[File Client]
SFTP[SFTP Client]
Replica -->|Uses| RC
RC -->|Implements| S3
RC -->|Implements| GCS
RC -->|Implements| File
RC -->|Implements| SFTP
end
WAL -->|Monitor Changes| DB
DB -->|Checkpoint| SQLite
```
### Data Flow Sequence
```mermaid
sequenceDiagram
participant App
participant SQLite
participant WAL
participant DB
participant Replica
participant Storage
App->>SQLite: Write Transaction
SQLite->>WAL: Append Changes
loop Monitor (1s interval)
DB->>WAL: Check Size/Changes
WAL-->>DB: Current State
alt WAL Has Changes
DB->>WAL: Read Pages
DB->>DB: Convert to LTX Format
DB->>Replica: Queue LTX File
loop Sync (configurable interval)
Replica->>Storage: WriteLTXFile()
Storage-->>Replica: FileInfo
Replica->>Replica: Update Position
end
end
end
alt Checkpoint Needed
DB->>SQLite: PRAGMA wal_checkpoint
SQLite->>WAL: Merge to Main DB
end
```
## Critical Concepts
### 1. SQLite Lock Page at 1GB Boundary ⚠️
**CRITICAL**: SQLite reserves a special lock page at exactly 1GB (0x40000000 bytes).
```go
// db.go:951-953 - Must skip lock page during replication
lockPgno := ltx.LockPgno(pageSize) // Page number varies by page size
if pgno == lockPgno {
continue // Skip this page - it's reserved by SQLite
}
```
**Lock Page Numbers by Page Size:**
- 4KB pages: 262145 (most common)
- 8KB pages: 131073
- 16KB pages: 65537
- 32KB pages: 32769
**Testing Requirement**: Any changes affecting page iteration MUST be tested with >1GB databases.
### 2. LTX File Format
LTX (Log Transaction) files are **immutable**, append-only files containing:
- Header with transaction IDs (MinTXID, MaxTXID)
- Page data with checksums
- Page index for efficient seeking
- Trailer with metadata
**Important**: LTX files are NOT SQLite WAL files - they're a custom format for efficient replication.
### 3. Compaction Process
Compaction merges multiple LTX files to reduce storage overhead:
```mermaid
flowchart LR
subgraph "Level 0 (Raw)"
L0A[0000000001-0000000100.ltx]
L0B[0000000101-0000000200.ltx]
L0C[0000000201-0000000300.ltx]
end
subgraph "Level 1 (30 seconds)"
L1[0000000001-0000000300.ltx]
end
subgraph "Level 2 (5 minutes)"
L2[0000000001-0000001000.ltx]
end
subgraph "Level 3 (1 hour)"
L3[0000000001-0000002000.ltx]
end
subgraph "Snapshot (24h)"
Snap[snapshot.ltx]
end
L0A -->|Merge| L1
L0B -->|Merge| L1
L0C -->|Merge| L1
L1 -->|30s window| L2
L2 -->|5min window| L3
L3 -->|Hourly| Snap
```
**Critical Compaction Rule**: When compacting with eventually consistent storage:
```go
// db.go:1280-1294 - ALWAYS read from local disk when available
f, err := os.Open(db.LTXPath(info.Level, info.MinTXID, info.MaxTXID))
if err == nil {
// Use local file - it's complete and consistent
return f, nil
}
// Only fall back to remote if local doesn't exist
return replica.Client.OpenLTXFile(...)
```
### 4. Eventual Consistency Handling
Many storage backends (S3, R2, etc.) are eventually consistent. This means:
- A file you just wrote might not be immediately readable
- A file might be listed but only partially available
- Reads might return stale or incomplete data
**Solution**: Always prefer local files during compaction.
## Architectural Boundaries and Patterns
**CRITICAL**: Understanding proper architectural boundaries is essential for successful contributions.
### Layer Responsibilities
```mermaid
graph TB
subgraph "DB Layer (db.go)"
DBInit[DB.init&#40;&#41;]
DBPos[DB position tracking]
DBRestore[Database state validation]
DBSnapshot[Snapshot triggering via verify&#40;&#41;]
end
subgraph "Replica Layer (replica.go)"
ReplicaStart[Replica.Start&#40;&#41;]
ReplicaSync[Sync operations]
ReplicaPos[Replica position tracking]
ReplicaClient[Storage interaction]
end
subgraph "Storage Layer"
S3[S3/GCS/Azure]
LTXFiles[LTX Files]
end
DBInit -->|Initialize| ReplicaStart
DBInit -->|Check positions| DBPos
DBInit -->|Validate state| DBRestore
ReplicaStart -->|Focus on replication only| ReplicaSync
ReplicaSync -->|Upload/Download| ReplicaClient
ReplicaClient -->|Read/Write| S3
S3 -->|Store| LTXFiles
```
### ✅ DO: Handle database state in DB layer
**Principle**: Database restoration logic belongs in the DB layer, not the Replica layer.
**Pattern**: When the database is behind the replica (local TXID < remote TXID):
1. **Clear local L0 cache**: Remove the entire L0 directory and recreate it
- Use `os.RemoveAll()` on the L0 directory path
- Recreate with proper permissions using `internal.MkdirAll()`
2. **Fetch latest L0 file from replica**: Download the most recent L0 LTX file
- Call `replica.Client.OpenLTXFile()` with the remote min/max TXID
- Stream the file contents (don't load into memory)
3. **Write using atomic file operations**: Prevent partial/corrupted files
- Write to temporary file with `.tmp` suffix
- Call `Sync()` to ensure data is on disk
- Atomically rename temp file to final path
**Why this matters**: If the database state is not synchronized before replication starts, the system will attempt to apply WAL segments that are ahead of the database's current position, leading to restore failures.
**Reference Implementation**: See `DB.checkDatabaseBehindReplica()` in db.go:670-737
### ❌ DON'T: Put database state logic in Replica layer
```go
// WRONG - Replica should only handle replication concerns
func (r *Replica) Start() error {
// DON'T check database state here
if needsRestore() { // ❌ Wrong layer!
restoreDatabase() // ❌ Wrong layer!
}
// Replica should focus only on replication mechanics
}
```
### Atomic File Operations Pattern
**CRITICAL**: Always use atomic writes to prevent partial/corrupted files.
### ✅ DO: Write to temp file, then rename
```go
// CORRECT - Atomic file write pattern
func writeFileAtomic(path string, data []byte) error {
// Create temp file in same directory (for atomic rename)
dir := filepath.Dir(path)
tmpFile, err := os.CreateTemp(dir, ".tmp-*")
if err != nil {
return fmt.Errorf("create temp file: %w", err)
}
tmpPath := tmpFile.Name()
// Clean up temp file on error
defer func() {
if tmpFile != nil {
tmpFile.Close()
os.Remove(tmpPath)
}
}()
// Write data to temp file
if _, err := tmpFile.Write(data); err != nil {
return fmt.Errorf("write temp file: %w", err)
}
// Sync to ensure data is on disk
if err := tmpFile.Sync(); err != nil {
return fmt.Errorf("sync temp file: %w", err)
}
// Close before rename
if err := tmpFile.Close(); err != nil {
return fmt.Errorf("close temp file: %w", err)
}
tmpFile = nil // Prevent defer cleanup
// Atomic rename (on same filesystem)
if err := os.Rename(tmpPath, path); err != nil {
os.Remove(tmpPath)
return fmt.Errorf("rename to final path: %w", err)
}
return nil
}
```
### ❌ DON'T: Write directly to final location
```go
// WRONG - Can leave partial files on failure
func writeFileDirect(path string, data []byte) error {
return os.WriteFile(path, data, 0644) // ❌ Not atomic!
}
```
### Error Handling Patterns
### ✅ DO: Return errors immediately
```go
// CORRECT - Return error for caller to handle
func (db *DB) validatePosition() error {
dpos, err := db.Pos()
if err != nil {
return err
}
rpos := replica.Pos()
if dpos.TXID < rpos.TXID {
return fmt.Errorf("database position (%v) behind replica (%v)", dpos, rpos)
}
return nil
}
```
### ❌ DON'T: Continue on critical errors
```go
// WRONG - Silently continuing can cause data corruption
func (db *DB) validatePosition() {
if dpos, _ := db.Pos(); dpos.TXID < replica.Pos().TXID {
log.Printf("warning: position mismatch") // ❌ Don't just log!
// Continuing here is dangerous
}
}
```
### Leveraging Existing Mechanisms
### ✅ DO: Use verify() for snapshot triggering
```go
// CORRECT - Leverage existing snapshot mechanism
func (db *DB) ensureSnapshot() error {
// Use existing verify() which already handles snapshot logic
if err := db.verify(); err != nil {
return fmt.Errorf("verify for snapshot: %w", err)
}
// verify() will trigger snapshot if needed
return nil
}
```
### ❌ DON'T: Reimplement existing functionality
```go
// WRONG - Don't recreate what already exists
func (db *DB) customSnapshot() error {
// ❌ Don't write custom snapshot logic
// when verify() already does this correctly
}
```
## Common Pitfalls
### ❌ DON'T: Mix architectural concerns
```go
// WRONG - Database state logic in Replica layer
func (r *Replica) Start() error {
if db.needsRestore() { // ❌ Wrong layer for DB state!
r.restoreDatabase() // ❌ Replica shouldn't manage DB state!
}
return r.sync()
}
```
### ✅ DO: Keep concerns in proper layers
```go
// CORRECT - Each layer handles its own concerns
func (db *DB) init() error {
// DB layer handles database state
if db.needsRestore() {
if err := db.restore(); err != nil {
return err
}
}
// Then start replica for replication only
return db.replica.Start()
}
func (r *Replica) Start() error {
// Replica focuses only on replication
return r.startSync()
}
```
### ❌ DON'T: Read from remote during compaction
```go
// WRONG - Can get partial/corrupt data
f, err := client.OpenLTXFile(ctx, level, minTXID, maxTXID, 0, 0)
```
### ✅ DO: Read from local when available
```go
// CORRECT - Check local first
if f, err := os.Open(localPath); err == nil {
defer f.Close()
// Use local file
} else {
// Fall back to remote only if necessary
}
```
### ❌ DON'T: Use RLock for write operations
```go
// WRONG - Race condition in replica.go:217
r.mu.RLock() // Should be Lock() for writes
defer r.mu.RUnlock()
r.pos = pos // Writing with RLock!
```
### ✅ DO: Use proper lock types
```go
// CORRECT
r.mu.Lock()
defer r.mu.Unlock()
r.pos = pos
```
### ❌ DON'T: Ignore CreatedAt preservation
```go
// WRONG - Loses timestamp granularity
info := &ltx.FileInfo{
CreatedAt: time.Now(), // Don't use current time
}
```
### ✅ DO: Preserve earliest timestamp
```go
// CORRECT - Preserve temporal information
info, err := replica.Client.WriteLTXFile(ctx, level, minTXID, maxTXID, r)
if err != nil {
return fmt.Errorf("write ltx: %w", err)
}
info.CreatedAt = oldestSourceFile.CreatedAt
```
### ❌ DON'T: Write files without atomic operations
```go
// WRONG - Can leave partial files on failure
func saveLTXFile(path string, data []byte) error {
return os.WriteFile(path, data, 0644) // ❌ Not atomic!
}
```
### ✅ DO: Use atomic write pattern
```go
// CORRECT - Write to temp, then rename
func saveLTXFileAtomic(path string, data []byte) error {
tmpPath := path + ".tmp"
if err := os.WriteFile(tmpPath, data, 0644); err != nil {
return err
}
return os.Rename(tmpPath, path) // Atomic on same filesystem
}
```
### ❌ DON'T: Ignore errors and continue
```go
// WRONG - Continuing after error can corrupt state
func (db *DB) processFiles() {
for _, file := range files {
if err := processFile(file); err != nil {
log.Printf("error: %v", err) // ❌ Just logging!
// Continuing to next file is dangerous
}
}
}
```
### ✅ DO: Return errors for proper handling
```go
// CORRECT - Let caller decide how to handle errors
func (db *DB) processFiles() error {
for _, file := range files {
if err := processFile(file); err != nil {
return fmt.Errorf("process file %s: %w", file, err)
}
}
return nil
}
```
### ❌ DON'T: Recreate existing functionality
```go
// WRONG - Don't reimplement what already exists
func customSnapshotTrigger() {
// Complex custom logic to trigger snapshots
// when db.verify() already does this!
}
```
### ✅ DO: Leverage existing mechanisms
```go
// CORRECT - Use what's already there
func triggerSnapshot() error {
return db.verify() // Already handles snapshot logic correctly
}
```
## Component Guide
### DB Component (db.go)
**Responsibilities:**
- Manages SQLite database connection (via `modernc.org/sqlite` - no CGO)
- Monitors WAL for changes
- Performs checkpoints
- Maintains long-running read transaction
- Converts WAL pages to LTX format
**Key Fields:**
```go
type DB struct {
path string // Database file path
db *sql.DB // SQLite connection
rtx *sql.Tx // Long-running read transaction
pageSize int // Database page size (critical for lock page)
notify chan struct{} // Notifies on WAL changes
}
```
**Initialization Sequence:**
1. Open database connection
2. Read page size from database
3. Initialize long-running read transaction
4. Start monitor goroutine
5. Initialize replicas
### Replica Component (replica.go)
**Responsibilities:**
- Manages replication to a single destination (one replica per DB)
- Tracks replication position (ltx.Pos)
- Handles sync intervals
- Manages encryption (if configured)
**Key Operations:**
- `Sync()`: Synchronizes pending changes
- `SetPos()`: Updates replication position (must use Lock, not RLock!)
- `Snapshot()`: Creates full database snapshot
### ReplicaClient Interface (replica_client.go)
**Required Methods:**
```go
type ReplicaClient interface {
Type() string // Client type identifier
// File operations
LTXFiles(ctx context.Context, level int, seek ltx.TXID, useMetadata bool) (ltx.FileIterator, error)
OpenLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, offset, size int64) (io.ReadCloser, error)
WriteLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, r io.Reader) (*ltx.FileInfo, error)
DeleteLTXFiles(ctx context.Context, files []*ltx.FileInfo) error
DeleteAll(ctx context.Context) error
}
```
**LTXFiles useMetadata Parameter:**
- **`useMetadata=true`**: Fetch accurate timestamps from backend metadata (required for point-in-time restores)
- Slower but provides correct CreatedAt timestamps
- Use when restoring to specific timestamp
- **`useMetadata=false`**: Use fast timestamps (LastModified/ModTime) for normal operations
- Faster enumeration, suitable for synchronization
- Use during replication monitoring
**Implementation Requirements:**
- Handle partial reads gracefully
- Implement proper error types (os.ErrNotExist)
- Support seek/offset for efficient page fetching
- Preserve file timestamps when `useMetadata=true`
### Store Component (store.go)
**Responsibilities:**
- Coordinates multiple databases
- Manages compaction schedules
- Controls resource usage
- Handles retention policies
**Default Compaction Levels:**
```go
var defaultLevels = CompactionLevels{
{Level: 0, Interval: 0}, // Raw LTX files (no compaction)
{Level: 1, Interval: 30*Second},
{Level: 2, Interval: 5*Minute},
{Level: 3, Interval: 1*Hour},
// Snapshots created daily (24h retention)
}
```
## Performance Considerations
### O(n) Operations to Watch
1. **Page Iteration**: Linear scan through all pages
- Cache page index when possible
- Use binary search on sorted page lists
2. **File Listing**: Directory scans can be expensive
- Cache file listings when unchanged
- Use seek parameter to skip old files
3. **Compaction**: Reads all input files
- Limit concurrent compactions
- Use appropriate level intervals
### Caching Strategy
```go
// Page index caching example
const DefaultEstimatedPageIndexSize = 32 * 1024 // 32KB
// Fetch end of file first for page index
offset := info.Size - DefaultEstimatedPageIndexSize
if offset < 0 {
offset = 0
}
// Read page index once, cache for duration of operation
```
### Batch Operations
- Group small writes into larger LTX files
- Batch delete operations for old files
- Use prepared statements for repeated queries
## Testing Requirements
### For Any DB Changes
```bash
# Test with various page sizes
./bin/litestream-test populate -db test.db -page-size 4096 -target-size 2GB
./bin/litestream-test populate -db test.db -page-size 8192 -target-size 2GB
# Test lock page handling
./bin/litestream-test validate -source-db test.db -replica-url file:///tmp/replica
```
### For Replica Client Changes
```bash
# Test eventual consistency
go test -v ./replica_client_test.go -integration [s3|gcs|abs|oss|sftp]
# Test partial reads
# (Example) add targeted partial-read tests in your backend package
go test -v -run TestReplicaClient_PartialRead ./...
```
### For Compaction Changes
```bash
# Test with store compaction
go test -v -run TestStore_CompactDB ./...
# Test with eventual consistency mock
go test -v -run TestStore_CompactDB_RemotePartialRead ./...
```
### Race Condition Testing
```bash
# Always run with race detector
go test -race -v ./...
# Specific race-prone areas
go test -race -v -run TestReplica_Sync ./...
go test -race -v -run TestDB_Sync ./...
go test -race -v -run TestStore_CompactDB ./...
```
Database state logic belongs in DB layer, not Replica layer.
## Quick Reference
### File Paths
**Build:**
- **Database**: `/path/to/database.db`
- **Metadata**: `/path/to/database.db-litestream/`
- **LTX Files**: `/path/to/database.db-litestream/ltx/LEVEL/MIN-MAX.ltx`
- **Snapshots**: `/path/to/database.db-litestream/snapshots/TIMESTAMP.ltx`
### Key Configuration
```yaml
l0-retention: 5m # Minimum time to keep compacted L0 files
l0-retention-check-interval: 15s # Frequency for enforcing L0 retention
dbs:
- path: /path/to/db.sqlite
replica:
type: s3
bucket: my-bucket
path: db-backup
sync-interval: 10s # How often to sync
# Compaction configuration (default)
levels:
- level: 1
interval: 30s # 30-second windows
- level: 2
interval: 5m # 5-minute windows
- level: 3
interval: 1h # 1-hour windows
```bash
go build -o bin/litestream ./cmd/litestream
go test -race -v ./...
```
### Important Constants
**Code quality:**
```go
DefaultMonitorInterval = 1 * time.Second // WAL check frequency
DefaultCheckpointInterval = 1 * time.Minute // Time-based passive checkpoint frequency
DefaultMinCheckpointPageN = 1000 // Min pages before passive checkpoint
DefaultTruncatePageN = 121359 // ~500MB truncate threshold (REMOVED: DefaultMaxCheckpointPageN - RESTART mode permanently removed due to #724)
```bash
pre-commit run --all-files
```
## Getting Help
## Documentation
For complex architectural questions, consult:
1. **`docs/SQLITE_INTERNALS.md`** - SQLite fundamentals, WAL format, lock page details
2. **`docs/LTX_FORMAT.md`** - LTX file format specification and operations
3. `docs/ARCHITECTURE.md` - Deep technical details of Litestream components
4. `docs/REPLICA_CLIENT_GUIDE.md` - Storage backend implementation guide
5. `docs/TESTING_GUIDE.md` - Comprehensive testing strategies
6. Review recent PRs for current patterns and best practices
| Document | When to Read |
|----------|--------------|
| [docs/PATTERNS.md](docs/PATTERNS.md) | Code patterns and anti-patterns |
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Deep component details |
| [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) | WAL format, lock page |
| [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md) | Replication format |
| [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) | Test strategies |
| [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md) | Adding storage backends |
## Future Roadmap
## Checklist
**Planned Features:**
- **Litestream VFS**: Virtual File System for read replicas
- Instantly spin up database copies
- Background hydration from S3
- Enables scaling read operations without full database downloads
- **Enhanced read replica support**: Direct reads from remote storage
Before submitting changes:
## Important Constraints
1. **Single Replica Authority**: Each database is replicated to exactly one remote target—configure redundancy at the storage layer if needed.
2. **Legacy Backups**: Pre-LTX (v0.3.x) WAL snapshots cannot be restored with current binaries; keep an old binary around to hydrate those backups before re-replicating.
3. **CLI Changes**: Use `litestream ltx` for LTX inspection; `litestream wal` is deprecated.
4. **Pure Go Build**: The default build is CGO-free via `modernc.org/sqlite`; enable CGO only for optional VFS tooling.
5. **Page-Level Compaction**: Expect compaction to merge files across 30s/5m/1h windows plus daily snapshots.
## Final Checklist Before Making Changes
- [ ] Read this entire document
- [ ] Read `docs/SQLITE_INTERNALS.md` for SQLite fundamentals
- [ ] Read `docs/LTX_FORMAT.md` for replication format details
- [ ] Understand current constraints (single replica authority, LTX-only restores)
- [ ] Understand the component you're modifying
- [ ] Understand architectural boundaries (DB vs Replica responsibilities)
- [ ] Check for eventual consistency implications
- [ ] Consider >1GB database edge cases (lock page at 0x40000000)
- [ ] Use atomic file operations (temp file + rename)
- [ ] Return errors properly (don't just log and continue)
- [ ] Leverage existing mechanisms (e.g., verify() for snapshots)
- [ ] Plan appropriate tests
- [ ] Review recent similar PRs for patterns
- [ ] Use proper locking (Lock vs RLock)
- [ ] Preserve timestamps where applicable
- [ ] Test with race detector enabled
## Agent-Specific Instructions
This document serves as the universal source of truth for all AI coding assistants. Different agents may access it through various paths:
- **Claude**: Reads `AGENTS.md` directly (also loads `CLAUDE.md` if present)
- **GitHub Copilot**: Via `.github/copilot-instructions.md` symlink
- **Cursor**: Via `.cursorrules` symlink
- **Gemini**: Reads `AGENTS.md` and respects `.aiexclude` patterns
- **Other agents**: Check for `AGENTS.md` or `llms.txt` in repository root
### GitHub Copilot / OpenAI Codex
**Context Window**: 64k tokens (upgrading to 1M with GPT-4.1)
**Best Practices**:
- Use `/explain` command for SQLite internals
- Reference patterns in Common Pitfalls section
- Switch to GPT-5-Codex model for complex refactoring
- Focus on architectural boundaries and anti-patterns
- Leverage workspace indexing for multi-file operations
**Model Selection**:
- Use GPT-4o for quick completions
- Switch to GPT-5 or Claude Opus 4.1 for complex tasks
### Cursor
**Context Window**: Configurable based on model selection
**Best Practices**:
- Enable "codebase indexing" for full repository context
- Use Claude 3.5 Sonnet for architectural questions
- Use GPT-4o for quick inline completions
- Split complex rules into `.cursor/rules/*.mdc` files if needed
- Leverage workspace search before asking questions
**Model Recommendations**:
- **Architecture changes**: Claude 3.5 Sonnet
- **Quick fixes**: GPT-4o or cursor-small
- **Test generation**: Any model with codebase context
### Claude / Claude Code
**Context Window**: 200k tokens standard (1M in beta)
**Best Practices**:
- Full documentation can be loaded (5k lines fits easily)
- Reference `docs/` subdirectory for deep technical details
- Use structured note-taking for complex multi-step tasks
- Leverage MCP tools when available
- Check `CLAUDE.md` for project-specific configuration
**Strengths**:
- Deep architectural reasoning
- Complex system analysis
- Large context window utilization
### Google Gemini / Gemini Code Assist
**Context Window**: Varies by tier
**Best Practices**:
- Check `.aiexclude` for files to ignore
- Enable local codebase awareness
- Excellent for test generation and documentation
- Use for code review and security scanning
- Leverage code customization features
**Configuration**:
- Respects `.aiexclude` patterns (like `.gitignore`)
- Can use custom AI rules files
### General Multi-Agent Guidelines
1. **Always start with this document** (AGENTS.md) for project understanding
2. **Check `llms.txt`** for quick navigation to other documentation
3. **Respect architectural boundaries** (DB layer vs Replica layer)
4. **Follow the patterns** in Common Pitfalls section
5. **Test with race detector** for any concurrent code changes
6. **Preserve backward compatibility** with current constraints
### Documentation Hierarchy
```text
Tier 1 (Always read):
- AGENTS.md (this file)
- llms.txt (if you need navigation)
Tier 2 (Read when relevant):
- docs/SQLITE_INTERNALS.md (for WAL/page work)
- docs/LTX_FORMAT.md (for replication work)
- docs/ARCHITECTURE.md (for major changes)
Tier 3 (Reference only):
- docs/TESTING_GUIDE.md (for test scenarios)
- docs/REPLICA_CLIENT_GUIDE.md (for new backends)
```
- [ ] Read relevant docs above
- [ ] Follow patterns in [docs/PATTERNS.md](docs/PATTERNS.md)
- [ ] Test with race detector (`go test -race`)
- [ ] Run `pre-commit run --all-files`
- [ ] For page iteration: test with >1GB databases
- [ ] Show investigation evidence in PR (see [AI_PR_GUIDE.md](AI_PR_GUIDE.md))

178
AI_PR_GUIDE.md Normal file
View File

@@ -0,0 +1,178 @@
# AI-Assisted Contribution Guide
This guide helps AI assistants (and humans using them) submit high-quality PRs to Litestream.
## TL;DR Checklist
Before submitting a PR:
- [ ] **Show your investigation** - Include logs, file patterns, or debug output proving the problem
- [ ] **Define scope clearly** - State what this PR does AND does not do
- [ ] **Include runnable test commands** - Not just descriptions, actual `go test` commands
- [ ] **Reference related issues/PRs** - Show awareness of related work
## What Makes PRs Succeed
Analysis of recent PRs shows successful submissions share these patterns:
### 1. Investigation Artifacts
Show evidence, don't just describe the fix.
**Good:**
```markdown
## Problem
File patterns show excessive snapshot creation after checkpoint:
- 21:43 5.2G snapshot.ltx
- 21:47 5.2G snapshot.ltx (after checkpoint - should not trigger new snapshot)
Debug logs show `verify()` incorrectly detecting position mismatch...
```
**Bad:**
```markdown
## Problem
Snapshots are created too often. This PR fixes it.
```
### 2. Clear Scope Definition
Explicitly state boundaries.
**Good:**
```markdown
## Scope
This PR adds the lease client interface only.
**In scope:**
- LeaseClient interface definition
- Mock implementation for testing
**Not in scope (future PRs):**
- Integration with Store
- Distributed coordination logic
```
**Bad:**
```markdown
## Changes
Added leasing support and also fixed a checkpoint bug I noticed.
```
### 3. Runnable Test Commands
**Good:** Include actual commands that can be run:
```bash
# Unit tests
go test -race -v -run TestDB_CheckpointDoesNotTriggerSnapshot ./...
# Integration test with file backend
go test -v ./replica_client_test.go -integration file
```
**Bad:** Vague descriptions like "Manual testing with file backend" or "Verified it works"
### 4. Before/After Comparison
For behavior changes, show the difference:
**Good:**
```markdown
## Behavior Change
| Scenario | Before | After |
|----------|--------|-------|
| Checkpoint with no changes | Creates snapshot | No snapshot |
| Checkpoint with changes | Creates snapshot | Creates snapshot |
```
## Common Mistakes
### Scope Creep
**Problem:** Mixing unrelated changes in one PR.
**Example:** PR titled "Add lease client" also includes a fix for checkpoint timing.
**Fix:** Split into separate PRs. Reference them: "This PR adds the lease client. The checkpoint fix is in #XXX."
### Missing Root Cause Analysis
**Problem:** Implementing a fix without proving the problem exists.
**Example:** "Add exponential backoff" without showing what's filling disk.
**Fix:** Include investigation showing the actual cause before proposing solution.
### Vague Test Plans
**Problem:** "Tested manually" or "Verified it works."
**Fix:** Include exact commands:
```bash
go test -race -v -run TestSpecificFunction ./...
```
### No Integration Context
**Problem:** Large features without explaining how they fit.
**Fix:** For multi-PR work, explain the phases:
```markdown
This is Phase 1 of 3 for distributed leasing:
1. **This PR**: Lease client interface
2. Future: Store integration
3. Future: Distributed coordination
```
## PR Description Template
Use this structure for PR descriptions:
```text
## Summary
[1-2 sentences: what this PR does]
## Problem
[Evidence of the problem - logs, file patterns, user reports]
## Solution
[Brief explanation of the approach]
## Scope
**In scope:**
- [item]
**Not in scope:**
- [item]
## Test Plan
[Include actual go test commands here]
## Related
- Fixes #XXX
- Related to #YYY
```
## What We Accept
From [CONTRIBUTING.md](CONTRIBUTING.md):
- **Bug fixes** - Welcome, especially with evidence
- **Small improvements** - Performance, code cleanup
- **Documentation** - Always welcome
- **Features** - Discuss in issue first; large features typically implemented internally
## Resources
- [AGENTS.md](AGENTS.md) - Project overview and checklist
- [docs/PATTERNS.md](docs/PATTERNS.md) - Code patterns
- [CONTRIBUTING.md](CONTRIBUTING.md) - Contribution guidelines

237
CLAUDE.md
View File

@@ -1,225 +1,40 @@
# CLAUDE.md - Claude Code Optimizations for Litestream
# CLAUDE.md - Claude Code Configuration
This file is automatically loaded by Claude Code and provides Claude-specific optimizations. For comprehensive project documentation, see AGENTS.md.
Claude-specific optimizations for Litestream. See [AGENTS.md](AGENTS.md) for project documentation.
## Claude-Specific Optimizations
## Context Window
**Primary Documentation**: See AGENTS.md for comprehensive architectural guidance, patterns, and anti-patterns.
With Claude's large context window, load documentation as needed:
### Context Window Advantages
- Start with [AGENTS.md](AGENTS.md) for overview and checklist
- Load [docs/PATTERNS.md](docs/PATTERNS.md) when writing code
- Load [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) for WAL/page work
With Claude's 200k token context window, you can load the entire documentation suite:
- Full AGENTS.md for patterns and anti-patterns
- All docs/ subdirectory files for deep technical understanding
- Multiple source files simultaneously for cross-referencing
## Claude-Specific Resources
### Key Focus Areas for Claude
### Specialized Agents (.claude/agents/)
1. **Architectural Reasoning**: Leverage deep understanding of DB vs Replica layer boundaries
2. **Complex Analysis**: Use full context for multi-file refactoring
3. **SQLite Internals**: Reference docs/SQLITE_INTERNALS.md for WAL format details
4. **LTX Format**: Reference docs/LTX_FORMAT.md for replication specifics
- `sqlite-expert.md` - SQLite WAL and page management
- `replica-client-developer.md` - Storage backend implementation
- `ltx-compaction-specialist.md` - LTX format and compaction
- `test-engineer.md` - Testing strategies
- `performance-optimizer.md` - Performance optimization
### Claude-Specific Resources
### Commands (.claude/commands/)
#### Specialized Agents (.claude/agents/)
- `/analyze-ltx` - Analyze LTX file structure
- `/debug-wal` - Debug WAL replication issues
- `/test-compaction` - Test compaction scenarios
- `/trace-replication` - Trace replication flow
- `/validate-replica` - Validate replica client
- `/add-storage-backend` - Create new storage backend
- `/fix-common-issues` - Diagnose common problems
- `/run-comprehensive-tests` - Execute full test suite
- **sqlite-expert.md**: SQLite WAL and page management expertise
- **replica-client-developer.md**: Storage backend implementation
- **ltx-compaction-specialist.md**: LTX format and compaction
- **test-engineer.md**: Comprehensive testing strategies
- **performance-optimizer.md**: Performance and resource optimization
#### Commands (.claude/commands/)
- `/analyze-ltx`: Analyze LTX file structure and contents
- `/debug-wal`: Debug WAL replication issues
- `/test-compaction`: Test compaction scenarios
- `/trace-replication`: Trace replication flow
- `/validate-replica`: Validate replica client implementation
- `/add-storage-backend`: Create new storage backend
- `/fix-common-issues`: Diagnose and fix common problems
- `/run-comprehensive-tests`: Execute full test suite
Use these commands with: `<command> [arguments]` in Claude Code.
## Overview
Litestream is a standalone disaster recovery tool for SQLite that runs as a background process and safely replicates changes incrementally to another file or S3. It works through the SQLite API to prevent database corruption.
## Build and Development Commands
### Building
## Quick Commands
```bash
# Build the main binary
go build ./cmd/litestream
# Install the binary
go install ./cmd/litestream
# Build for specific platforms (using Makefile)
make docker # Build Docker image
make dist-linux # Build Linux AMD64 distribution
make dist-linux-arm # Build Linux ARM distribution
make dist-linux-arm64 # Build Linux ARM64 distribution
make dist-macos # Build macOS distribution (requires LITESTREAM_VERSION env var)
```
### Testing
```bash
# Run all tests
go test -v ./...
# Run tests with coverage
go test -v -cover ./...
# Test VFS functionality (requires CGO and explicit vfs build tag)
go test -tags vfs ./cmd/litestream-vfs -v
# Test builds before committing (always use -o bin/ to avoid committing binaries)
go build -o bin/litestream ./cmd/litestream # Test main build (no CGO required)
CGO_ENABLED=1 go build -tags vfs -o bin/litestream-vfs ./cmd/litestream-vfs # Test VFS with CGO
# Run specific integration tests (requires environment setup)
go test -v ./replica_client_test.go -integration s3
go test -v ./replica_client_test.go -integration gcs
go test -v ./replica_client_test.go -integration abs
go test -v ./replica_client_test.go -integration oss
go test -v ./replica_client_test.go -integration sftp
```
### Code Quality
```bash
# Format code
go fmt ./...
goimports -local github.com/benbjohnson/litestream -w .
# Run linters
go vet ./...
staticcheck ./...
# Run pre-commit hooks (includes trailing whitespace, goimports, go-vet, staticcheck)
go build -o bin/litestream ./cmd/litestream
go test -race -v ./...
pre-commit run --all-files
```
## Architecture
### Core Components
**DB (`db.go`)**: Manages a SQLite database instance with WAL monitoring, checkpoint management, and metrics. Handles replication coordination and maintains long-running read transactions for consistency.
**Replica (`replica.go`)**: Connects a database to replication destinations via ReplicaClient interface. Manages periodic synchronization and maintains replication position.
**ReplicaClient Interface** (`replica_client.go`): Abstraction for different storage backends (S3, GCS, Azure Blob Storage, OSS, SFTP, file system, NATS). Each implementation handles snapshot/WAL segment upload and restoration. The `LTXFiles` method includes a `useMetadata` parameter: when true, it fetches accurate timestamps from backend metadata (required for point-in-time restores); when false, it uses fast timestamps for normal operations. During compaction, the system preserves the earliest CreatedAt timestamp from source files to maintain temporal granularity for restoration.
**WAL Processing**: The system monitors SQLite WAL files for changes, segments them into LTX format files, and replicates these segments to configured destinations. Uses SQLite checksums for integrity verification.
### Storage Backends
- **S3** (`s3/replica_client.go`): AWS S3 and compatible storage
- **GCS** (`gs/replica_client.go`): Google Cloud Storage
- **ABS** (`abs/replica_client.go`): Azure Blob Storage
- **OSS** (`oss/replica_client.go`): Alibaba Cloud Object Storage Service
- **SFTP** (`sftp/replica_client.go`): SSH File Transfer Protocol
- **File** (`file/replica_client.go`): Local file system replication
- **NATS** (`nats/replica_client.go`): NATS JetStream object storage
### Command Structure
Main entry point (`cmd/litestream/main.go`) provides subcommands:
- `replicate`: Primary replication daemon mode
- `restore`: Restore database from replica
- `databases`: List configured databases
- `ltx`: WAL/LTX file utilities (renamed from 'wal')
- `version`: Display version information
- `mcp`: Model Context Protocol support
## Key Design Patterns
1. **Non-invasive monitoring**: Uses SQLite API exclusively, no direct file manipulation
2. **Incremental replication**: Segments WAL into small chunks for efficient transfer
3. **Single remote authority**: Each database replicates to exactly one destination
4. **Age encryption**: Optional end-to-end encryption using age identities/recipients
5. **Prometheus metrics**: Built-in observability for monitoring replication health
6. **Timestamp preservation**: Compaction preserves earliest CreatedAt timestamp from source files to maintain temporal granularity for point-in-time restoration
## Configuration
Primary configuration via YAML file (`etc/litestream.yml`) or environment variables. Supports:
- Database paths and replica destinations
- Sync intervals and checkpoint settings
- Authentication credentials for cloud storage
- Encryption keys for age encryption
## Important Notes
- External contributions accepted for bug fixes only (not features)
- Uses pre-commit hooks for code quality enforcement
- Requires Go 1.24+ for build
- Main binary does NOT require CGO
- VFS functionality requires explicit `-tags vfs` build flag AND CGO_ENABLED=1
- **ALWAYS build binaries into `bin/` directory** which is gitignored (e.g., `go build -o bin/litestream`)
- Always test builds with different configurations before committing
## Workflows and Best Practices
- Any time you create/edit markdown files, lint and fix them with markdownlint
## Testing Considerations
### SQLite Lock Page at 1GB Boundary
Litestream handles a critical SQLite edge case: the lock page at exactly 1GB
(offset 0x40000000). This page is reserved by SQLite for file locking and
cannot contain data. The code skips this page during replication (see
db.go:951-953).
**Key Implementation Details:**
- Lock page calculation: `LockPgno = (0x40000000 / pageSize) + 1`
- Located in LTX library: `ltx.LockPgno(pageSize)`
- Must be skipped when iterating through database pages
- Affects databases larger than 1GB regardless of page size
**Testing Requirements:**
1. **Create databases >1GB** to ensure lock page handling works
2. **Test with various page sizes** as lock page number changes:
- 4KB: page 262145 (default, most common)
- 8KB: page 131073
- 16KB: page 65537
- 32KB: page 32769
3. **Verify replication** correctly skips the lock page
4. **Test restoration** to ensure databases restore properly across 1GB boundary
**Quick Test Script:**
```bash
# Create a >1GB test database
sqlite3 large.db <<EOF
PRAGMA page_size=4096;
CREATE TABLE test(data BLOB);
-- Insert enough data to exceed 1GB
WITH RECURSIVE generate_series(value) AS (
SELECT 1 UNION ALL SELECT value+1 FROM generate_series LIMIT 300000
)
INSERT INTO test SELECT randomblob(4000) FROM generate_series;
EOF
# Verify it crosses the 1GB boundary
echo "File size: $(stat -f%z large.db 2>/dev/null || stat -c%s large.db)"
echo "Page count: $(sqlite3 large.db 'PRAGMA page_count')"
echo "Lock page should be at: $((0x40000000 / 4096 + 1))"
# Test replication
./bin/litestream replicate large.db file:///tmp/replica
# Test restoration
./bin/litestream restore -o restored.db file:///tmp/replica
sqlite3 restored.db "PRAGMA integrity_check;"
```

View File

@@ -22,6 +22,22 @@ Thank you for your interest in contributing to Litestream! We value community co
- **Large external feature contributions**: Features carry a long-term maintenance burden. To reduce burnout and maintain code quality, we typically implement major features internally. This allows us to ensure consistency with the overall architecture and maintain the high reliability that Litestream users depend on for disaster recovery
- **Breaking changes**: Changes that break backward compatibility require extensive discussion
## AI-Assisted Contributions
We welcome AI-assisted contributions for bug fixes and small improvements. Whether you're using Claude, Copilot, Cursor, or other AI tools:
**Requirements:**
- **Show your investigation** - Include evidence (logs, file patterns, debug output) proving the problem exists
- **Define scope clearly** - State what the PR does and does not do
- **Include runnable test commands** - Actual `go test` commands, not just descriptions
- **Human review before submission** - You're responsible for the code you submit
**Resources:**
- [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - Detailed guide with templates and examples
- [AGENTS.md](AGENTS.md) - Project overview for AI assistants
## How to Contribute
### Reporting Bugs

View File

@@ -1,81 +1,42 @@
# GEMINI.md - Gemini Code Assist Configuration for Litestream
# GEMINI.md - Gemini Code Assist Configuration
This file provides Gemini-specific configuration and notes. For comprehensive project documentation, see AGENTS.md.
Gemini-specific configuration for Litestream. See [AGENTS.md](AGENTS.md) for project documentation.
## Primary Documentation
## Before Contributing
**See AGENTS.md** for complete architectural guidance, patterns, and anti-patterns for working with Litestream.
1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - PR quality requirements
2. Read [AGENTS.md](AGENTS.md) - Project overview and checklist
3. Check [CONTRIBUTING.md](CONTRIBUTING.md) - What we accept
## Gemini-Specific Configuration
## File Exclusions
### File Exclusions
Check `.aiexclude` file for patterns of files that should not be shared with Gemini (similar to `.gitignore`).
Check `.aiexclude` for patterns of files that should not be shared with Gemini.
### Strengths for This Project
## Gemini Strengths for This Project
1. **Test Generation**: Excellent at creating comprehensive test suites
2. **Documentation**: Strong at generating and updating documentation
3. **Code Review**: Good at identifying potential issues and security concerns
4. **Local Codebase Awareness**: Enable for full repository understanding
- **Test generation** - Creating comprehensive test suites
- **Documentation** - Generating and updating docs
- **Code review** - Identifying issues and security concerns
- **Local codebase awareness** - Enable for full repository understanding
## Key Project Concepts
## Documentation
### SQLite Lock Page
- Must skip page at 1GB boundary (0x40000000)
- Page number varies by page size (262145 for 4KB pages)
- See docs/SQLITE_INTERNALS.md for details
Load as needed:
### LTX Format
- Immutable replication files
- Named by transaction ID ranges
- See docs/LTX_FORMAT.md for specification
- [docs/PATTERNS.md](docs/PATTERNS.md) - Code patterns when writing code
- [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) - For WAL/page work
- [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) - For test generation
### Architectural Boundaries
- DB layer (db.go): Database state and restoration
- Replica layer (replica.go): Replication only
- Storage layer: ReplicaClient implementations
## Critical Rules
## Testing Focus
- **Lock page at 1GB** - Skip page at 0x40000000
- **LTX files are immutable** - Never modify after creation
- **Layer boundaries** - DB handles state, Replica handles replication
When generating tests:
- Include >1GB database tests for lock page verification
- Add race condition tests with -race flag
- Test various page sizes (4KB, 8KB, 16KB, 32KB)
- Include eventual consistency scenarios
## Common Tasks
### Adding Storage Backend
1. Implement ReplicaClient interface
2. Follow existing patterns (s3/, gs/, abs/)
3. Handle eventual consistency
4. Generate comprehensive tests
### Refactoring
1. Respect layer boundaries (DB vs Replica)
2. Maintain current constraints (single replica authority, LTX-only restores)
3. Use atomic file operations
4. Return errors properly (don't just log)
## Build and Test Commands
## Quick Commands
```bash
# Build without CGO
go build -o bin/litestream ./cmd/litestream
# Test with race detection
go test -race -v ./...
# Test specific backend
go test -v ./replica_client_test.go -integration s3
pre-commit run --all-files
```
## Configuration Reference
See `etc/litestream.yml` for configuration examples. Remember: each database replicates to exactly one remote destination.
## Additional Resources
- llms.txt: Quick navigation index
- docs/: Deep technical documentation
- .claude/commands/: Task-specific commands (if using with Claude Code)

433
docs/PATTERNS.md Normal file
View File

@@ -0,0 +1,433 @@
# Litestream Code Patterns and Anti-Patterns
This document contains detailed code patterns, examples, and anti-patterns for working with Litestream. For a quick overview, see [AGENTS.md](../AGENTS.md).
## Table of Contents
- [Architectural Boundaries](#architectural-boundaries)
- [Atomic File Operations](#atomic-file-operations)
- [Error Handling](#error-handling)
- [Locking Patterns](#locking-patterns)
- [Compaction and Eventual Consistency](#compaction-and-eventual-consistency)
- [Timestamp Preservation](#timestamp-preservation)
- [Common Pitfalls](#common-pitfalls)
- [Component Reference](#component-reference)
## Architectural Boundaries
### Layer Responsibilities
```text
DB Layer (db.go) → Database state, restoration, monitoring
Replica Layer (replica.go) → Replication mechanics only
Storage Layer → ReplicaClient implementations
```
### DO: Handle database state in DB layer
Database restoration logic belongs in the DB layer, not the Replica layer.
When the database is behind the replica (local TXID < remote TXID):
1. **Clear local L0 cache**: Remove the entire L0 directory and recreate it
2. **Fetch latest L0 file from replica**: Download the most recent L0 LTX file
3. **Write using atomic file operations**: Prevent partial/corrupted files
```go
// CORRECT - DB layer handles database state
func (db *DB) init() error {
// DB layer handles database state
if db.needsRestore() {
if err := db.restore(); err != nil {
return err
}
}
// Then start replica for replication only
return db.replica.Start()
}
func (r *Replica) Start() error {
// Replica focuses only on replication
return r.startSync()
}
```
Reference: `DB.checkDatabaseBehindReplica()` in db.go:670-737
### DON'T: Put database state logic in Replica layer
```go
// WRONG - Replica should only handle replication concerns
func (r *Replica) Start() error {
// DON'T check database state here
if needsRestore() { // Wrong layer!
restoreDatabase() // Wrong layer!
}
// Replica should focus only on replication mechanics
}
```
## Atomic File Operations
Always use atomic writes to prevent partial/corrupted files.
### DO: Write to temp file, then rename
```go
// CORRECT - Atomic file write pattern
func writeFileAtomic(path string, data []byte) error {
// Create temp file in same directory (for atomic rename)
dir := filepath.Dir(path)
tmpFile, err := os.CreateTemp(dir, ".tmp-*")
if err != nil {
return fmt.Errorf("create temp file: %w", err)
}
tmpPath := tmpFile.Name()
// Clean up temp file on error
defer func() {
if tmpFile != nil {
tmpFile.Close()
os.Remove(tmpPath)
}
}()
// Write data to temp file
if _, err := tmpFile.Write(data); err != nil {
return fmt.Errorf("write temp file: %w", err)
}
// Sync to ensure data is on disk
if err := tmpFile.Sync(); err != nil {
return fmt.Errorf("sync temp file: %w", err)
}
// Close before rename
if err := tmpFile.Close(); err != nil {
return fmt.Errorf("close temp file: %w", err)
}
tmpFile = nil // Prevent defer cleanup
// Atomic rename (on same filesystem)
if err := os.Rename(tmpPath, path); err != nil {
os.Remove(tmpPath)
return fmt.Errorf("rename to final path: %w", err)
}
return nil
}
```
### DON'T: Write directly to final location
```go
// WRONG - Can leave partial files on failure
func writeFileDirect(path string, data []byte) error {
return os.WriteFile(path, data, 0644) // Not atomic!
}
```
## Error Handling
### DO: Return errors immediately
```go
// CORRECT - Return error for caller to handle
func (db *DB) validatePosition() error {
dpos, err := db.Pos()
if err != nil {
return err
}
rpos := replica.Pos()
if dpos.TXID < rpos.TXID {
return fmt.Errorf("database position (%v) behind replica (%v)", dpos, rpos)
}
return nil
}
```
### DON'T: Continue on critical errors
```go
// WRONG - Silently continuing can cause data corruption
func (db *DB) validatePosition() {
if dpos, _ := db.Pos(); dpos.TXID < replica.Pos().TXID {
log.Printf("warning: position mismatch") // Don't just log!
// Continuing here is dangerous
}
}
```
### DON'T: Ignore errors and continue in loops
```go
// WRONG - Continuing after error can corrupt state
func (db *DB) processFiles() {
for _, file := range files {
if err := processFile(file); err != nil {
log.Printf("error: %v", err) // Just logging!
// Continuing to next file is dangerous
}
}
}
```
### DO: Return errors properly in loops
```go
// CORRECT - Let caller decide how to handle errors
func (db *DB) processFiles() error {
for _, file := range files {
if err := processFile(file); err != nil {
return fmt.Errorf("process file %s: %w", file, err)
}
}
return nil
}
```
## Locking Patterns
### DO: Use proper lock types
```go
// CORRECT - Use Lock() for writes
r.mu.Lock()
defer r.mu.Unlock()
r.pos = pos
```
### DON'T: Use RLock for write operations
```go
// WRONG - Race condition
r.mu.RLock() // Should be Lock() for writes
defer r.mu.RUnlock()
r.pos = pos // Writing with RLock!
```
## Compaction and Eventual Consistency
Many storage backends (S3, R2, etc.) are eventually consistent:
- A file you just wrote might not be immediately readable
- A file might be listed but only partially available
- Reads might return stale or incomplete data
### DO: Read from local when available
```go
// CORRECT - Check local first during compaction
// db.go:1280-1294 - ALWAYS read from local disk when available
f, err := os.Open(db.LTXPath(info.Level, info.MinTXID, info.MaxTXID))
if err == nil {
// Use local file - it's complete and consistent
return f, nil
}
// Only fall back to remote if local doesn't exist
return replica.Client.OpenLTXFile(...)
```
### DON'T: Read from remote during compaction
```go
// WRONG - Can get partial/corrupt data from eventually consistent storage
f, err := client.OpenLTXFile(ctx, level, minTXID, maxTXID, 0, 0)
```
## Timestamp Preservation
During compaction, preserve the earliest CreatedAt timestamp from source files to maintain temporal granularity for point-in-time restoration.
### DO: Preserve earliest timestamp
```go
// CORRECT - Preserve temporal information
info, err := replica.Client.WriteLTXFile(ctx, level, minTXID, maxTXID, r)
if err != nil {
return fmt.Errorf("write ltx: %w", err)
}
info.CreatedAt = oldestSourceFile.CreatedAt
```
### DON'T: Ignore CreatedAt preservation
```go
// WRONG - Loses timestamp granularity for point-in-time restores
info := &ltx.FileInfo{
CreatedAt: time.Now(), // Don't use current time during compaction
}
```
## Common Pitfalls
### 1. Mixing architectural concerns
```go
// WRONG - Database state logic in Replica layer
func (r *Replica) Start() error {
if db.needsRestore() { // Wrong layer for DB state!
r.restoreDatabase() // Replica shouldn't manage DB state!
}
return r.sync()
}
```
### 2. Recreating existing functionality
```go
// WRONG - Don't reimplement what already exists
func customSnapshotTrigger() {
// Complex custom logic to trigger snapshots
// when db.verify() already does this!
}
```
### DO: Leverage existing mechanisms
```go
// CORRECT - Use what's already there
func triggerSnapshot() error {
return db.verify() // Already handles snapshot logic correctly
}
```
### 3. Skipping the lock page
The lock page at 1GB (0x40000000) must always be skipped:
```go
// db.go:951-953 - Must skip lock page during replication
lockPgno := ltx.LockPgno(pageSize)
if pgno == lockPgno {
continue // Skip this page - it's reserved by SQLite
}
```
Lock page numbers by page size:
| Page Size | Lock Page Number |
|-----------|------------------|
| 4KB | 262145 |
| 8KB | 131073 |
| 16KB | 65537 |
| 32KB | 32769 |
## Component Reference
### DB Component (db.go)
**Responsibilities:**
- Manages SQLite database connection (via `modernc.org/sqlite` - no CGO)
- Monitors WAL for changes
- Performs checkpoints
- Maintains long-running read transaction
- Converts WAL pages to LTX format
**Key Fields:**
```go
type DB struct {
path string // Database file path
db *sql.DB // SQLite connection
rtx *sql.Tx // Long-running read transaction
pageSize int // Database page size (critical for lock page)
notify chan struct{} // Notifies on WAL changes
}
```
**Initialization Sequence:**
1. Open database connection
2. Read page size from database
3. Initialize long-running read transaction
4. Start monitor goroutine
5. Initialize replicas
### Replica Component (replica.go)
**Responsibilities:**
- Manages replication to a single destination (one replica per DB)
- Tracks replication position (ltx.Pos)
- Handles sync intervals
- Manages encryption (if configured)
**Key Operations:**
- `Sync()`: Synchronizes pending changes
- `SetPos()`: Updates replication position (must use Lock, not RLock!)
- `Snapshot()`: Creates full database snapshot
### ReplicaClient Interface (replica_client.go)
**Required Methods:**
```go
type ReplicaClient interface {
Type() string // Client type identifier
// File operations
LTXFiles(ctx context.Context, level int, seek ltx.TXID, useMetadata bool) (ltx.FileIterator, error)
OpenLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, offset, size int64) (io.ReadCloser, error)
WriteLTXFile(ctx context.Context, level int, minTXID, maxTXID ltx.TXID, r io.Reader) (*ltx.FileInfo, error)
DeleteLTXFiles(ctx context.Context, files []*ltx.FileInfo) error
DeleteAll(ctx context.Context) error
}
```
**useMetadata Parameter:**
- `useMetadata=true`: Fetch accurate timestamps from backend metadata (required for point-in-time restores)
- `useMetadata=false`: Use fast timestamps for normal operations
### Store Component (store.go)
**Default Compaction Levels:**
```go
var defaultLevels = CompactionLevels{
{Level: 0, Interval: 0}, // Raw LTX files (no compaction)
{Level: 1, Interval: 30*Second},
{Level: 2, Interval: 5*Minute},
{Level: 3, Interval: 1*Hour},
// Snapshots created daily (24h retention)
}
```
## Testing Patterns
### Race Condition Testing
```bash
# Always run with race detector
go test -race -v ./...
# Specific race-prone areas
go test -race -v -run TestReplica_Sync ./...
go test -race -v -run TestDB_Sync ./...
go test -race -v -run TestStore_CompactDB ./...
```
### Lock Page Testing
```bash
# Test with various page sizes
./bin/litestream-test populate -db test.db -page-size 4096 -target-size 2GB
./bin/litestream-test populate -db test.db -page-size 8192 -target-size 2GB
# Validate lock page handling
./bin/litestream-test validate -source-db test.db -replica-url file:///tmp/replica
```
### Integration Testing
```bash
# Test specific backend
go test -v ./replica_client_test.go -integration s3
go test -v ./replica_client_test.go -integration gcs
go test -v ./replica_client_test.go -integration abs
go test -v ./replica_client_test.go -integration oss
go test -v ./replica_client_test.go -integration sftp
```

114
llms.txt
View File

@@ -1,83 +1,63 @@
# Litestream
Disaster recovery tool for SQLite that runs as a background process and safely replicates changes incrementally to S3, GCS, Azure Blob Storage, SFTP, or another file system.
Disaster recovery tool for SQLite. Replicates WAL changes to S3, GCS, Azure, SFTP, or local filesystem.
## Core Documentation
## Quick Start for AI Contributors
- [AGENTS.md](AGENTS.md): AI agent instructions, architectural patterns, and anti-patterns
- [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md): Critical SQLite knowledge including WAL format and 1GB lock page
- [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md): LTX (Log Transaction) format specification for replication
- [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md): Deep technical details of Litestream components
1. Read [AI_PR_GUIDE.md](AI_PR_GUIDE.md) - PR quality requirements
2. Read [AGENTS.md](AGENTS.md) - Project overview and checklist
3. Check [CONTRIBUTING.md](CONTRIBUTING.md) - What we accept
4. Show investigation evidence in PRs
## Implementation Guides
## PR Checklist
- [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md): Guide for implementing storage backends
- [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md): Comprehensive testing strategies including >1GB database tests
- [ ] Evidence of problem (logs, file patterns)
- [ ] Clear scope (what PR does/doesn't do)
- [ ] Runnable test commands
- [ ] Race detector tested (`go test -race`)
## Core Components
## Documentation
- [db.go](db.go): Database monitoring, WAL reading, checkpoint management
- [replica.go](replica.go): Replication management, position tracking, synchronization
- [store.go](store.go): Multi-database coordination, compaction scheduling
- [replica_client.go](replica_client.go): Interface definition for storage backends
| Document | Purpose |
|----------|---------|
| [AGENTS.md](AGENTS.md) | Project overview, critical rules |
| [AI_PR_GUIDE.md](AI_PR_GUIDE.md) | PR templates, common mistakes |
| [docs/PATTERNS.md](docs/PATTERNS.md) | Code patterns and anti-patterns |
| [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) | Component details |
| [docs/SQLITE_INTERNALS.md](docs/SQLITE_INTERNALS.md) | WAL format, 1GB lock page |
| [docs/LTX_FORMAT.md](docs/LTX_FORMAT.md) | Replication format |
| [docs/TESTING_GUIDE.md](docs/TESTING_GUIDE.md) | Test strategies |
| [docs/REPLICA_CLIENT_GUIDE.md](docs/REPLICA_CLIENT_GUIDE.md) | Storage backends |
## Core Files
| File | Purpose |
|------|---------|
| `db.go` | Database monitoring, WAL, checkpoints |
| `replica.go` | Replication management |
| `store.go` | Multi-database coordination |
| `replica_client.go` | Storage backend interface |
## Storage Backends
- [s3/replica_client.go](s3/replica_client.go): AWS S3 and compatible storage implementation
- [gs/replica_client.go](gs/replica_client.go): Google Cloud Storage implementation
- [abs/replica_client.go](abs/replica_client.go): Azure Blob Storage implementation
- [sftp/replica_client.go](sftp/replica_client.go): SFTP implementation
- [file/replica_client.go](file/replica_client.go): Local file system implementation
- [nats/replica_client.go](nats/replica_client.go): NATS JetStream implementation
- `s3/replica_client.go` - AWS S3
- `gs/replica_client.go` - Google Cloud Storage
- `abs/replica_client.go` - Azure Blob Storage
- `sftp/replica_client.go` - SFTP
- `file/replica_client.go` - Local filesystem
- `nats/replica_client.go` - NATS JetStream
## Critical Concepts
### SQLite Lock Page
The lock page at exactly 1GB (0x40000000) must always be skipped during replication. Page number varies by page size: 262145 for 4KB pages, 131073 for 8KB pages.
- **Lock page at 1GB** - Always skip page at 0x40000000
- **LTX files are immutable** - Never modify after creation
- **Single replica per DB** - One destination per database
- **Layer boundaries** - DB handles state, Replica handles replication
### LTX Format
Immutable, append-only files containing database changes. Files are named by transaction ID ranges (e.g., 0000000001-0000000064.ltx).
## Build
### Compaction Levels
- Level 0: Raw LTX files (no compaction)
- Level 1: 30-second windows
- Level 2: 5-minute windows
- Level 3: 1-hour windows
- Snapshots: Daily full database state
### Architectural Boundaries
- **DB Layer (db.go)**: Handles database state, restoration logic, monitoring
- **Replica Layer (replica.go)**: Focuses solely on replication concerns
- **Storage Layer**: Implements ReplicaClient interface for various backends
## Key Patterns
### Atomic File Operations
Always write to temporary file then rename for atomicity.
### Error Handling
Return errors immediately, don't log and continue.
### Eventual Consistency
Always prefer local files during compaction to handle eventually consistent storage.
### Locking
Use Lock() for writes, RLock() for reads. Never use RLock() when modifying state.
## Testing Requirements
- Test with databases >1GB to verify lock page handling
- Run with race detector enabled (-race flag)
- Test with various page sizes (4KB, 8KB, 16KB, 32KB)
- Verify eventual consistency handling with storage backends
## Configuration
Primary configuration via YAML file (etc/litestream.yml) or environment variables. Each database replicates to exactly one remote destination.
## Build Requirements
- Go 1.24+
- No CGO required for main binary (uses modernc.org/sqlite)
- CGO required only for VFS functionality (build with -tags vfs)
- Always build binaries into bin/ directory (gitignored)
```bash
go build -o bin/litestream ./cmd/litestream
go test -race -v ./...
pre-commit run --all-files
```