mirror of https://github.com/benbjohnson/litestream.git synced 2026-01-25 05:06:30 +00:00

Files

Cory LaNou ee36d3e8ca feat: Add litestream-test harness for comprehensive database testing (#748 )

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Ben Johnson <benbjohnson@yahoo.com>

2025-09-25 16:51:26 -05:00

10 KiB

Raw Permalink Blame History

S3 LTX File Retention Testing Guide

Overview

This document describes the comprehensive S3 LTX file retention testing scripts created to validate that old LTX files are properly cleaned up after their retention period expires. These tests use the local Python S3 mock server for isolated, repeatable testing.

Key Focus Areas

1. Small Database Testing

Database Size: 50MB
Retention Period: 2 minutes
Focus: Basic retention behavior with minimal data

2. Large Database Testing (Critical)

Database Size: 1.5GB (crosses SQLite lock page boundary)
Page Size: 4KB (lock page at #262145)
Retention Period: 3 minutes
Focus: SQLite lock page edge case + retention cleanup at scale

3. Comprehensive Analysis

Side-by-side comparison of retention behavior
Performance metrics analysis
Best practices verification

Test Scripts

1. `test-s3-retention-small-db.sh`

Purpose: Test S3 LTX retention cleanup with small databases

Features:

Creates 50MB database with structured test data
Uses local S3 mock (moto) for isolation
2-minute retention period for quick testing
Generates multiple LTX files over time
Monitors cleanup activity in logs
Validates restoration integrity

Usage:

./cmd/litestream-test/scripts/test-s3-retention-small-db.sh

Duration: ~8 minutes

2. `test-s3-retention-large-db.sh`

Purpose: Test S3 LTX retention cleanup with large databases crossing the 1GB SQLite lock page boundary

Features:

Creates 1.5GB database (crosses lock page at 1GB)
Specifically tests SQLite lock page handling
3-minute retention period
Extended monitoring for large database patterns
Comprehensive validation including lock page verification
Tests restoration of large databases

Usage:

./cmd/litestream-test/scripts/test-s3-retention-large-db.sh

Duration: ~15-20 minutes

3. `test-s3-retention-comprehensive.sh`

Purpose: Comprehensive test runner and analysis tool

Features:

Runs both small and large database tests
Provides comparative analysis
Generates detailed reports
Configurable test execution
Best practices verification

Usage:

# Run all tests
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh

# Run only small database test
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --small-only

# Run only large database test
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --large-only

# Keep test files after completion
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --no-cleanup

Duration: ~25-30 minutes for full suite

SQLite Lock Page Testing

Why It Matters

SQLite reserves a special lock page at exactly 1GB (offset 0x40000000) that cannot contain data. This creates a critical edge case that Litestream must handle correctly.

What We Test

Lock Page Location: Page #262145 (with 4KB page size)
Boundary Crossing: Databases that grow beyond 1GB
Replication Integrity: Ensure lock page is properly skipped
Restoration Accuracy: Verify restored databases maintain integrity

Lock Page Numbers by Page Size

Page Size	Lock Page #	Test Coverage
4KB	262145	✅ Tested
8KB	131073	🔄 Possible
16KB	65537	🔄 Possible
32KB	32769	🔄 Possible

Local S3 Mock Setup

Why Use Local Mock

Isolation: No external dependencies or costs
Repeatability: Consistent test environment
Speed: No network latency
Safety: No risk of affecting production data

How It Works

The tests use the existing ./etc/s3_mock.py script which:

Starts a local moto S3 server
Creates a test bucket with unique name
Runs Litestream with S3 configuration
Automatically cleans up after test completion

Environment Variables Set by Mock

LITESTREAM_S3_ACCESS_KEY_ID="lite"
LITESTREAM_S3_SECRET_ACCESS_KEY="stream"
LITESTREAM_S3_BUCKET="test{timestamp}"
LITESTREAM_S3_ENDPOINT="http://127.0.0.1:5000"
LITESTREAM_S3_FORCE_PATH_STYLE="true"

Test Execution Flow

Small Database Test Flow

Setup: Build binaries, check dependencies
Database Creation: 50MB with indexed tables
Replication Start: Begin S3 mock and Litestream
Data Generation: Create LTX files over time (6 batches, 20s apart)
Retention Monitoring: Watch for cleanup activity (4 minutes)
Validation: Test restoration and integrity
Analysis: Generate detailed report

Large Database Test Flow

Setup: Build binaries, verify lock page calculations
Database Creation: 1.5GB crossing lock page boundary
Replication Start: Begin S3 mock (longer initial sync)
Data Generation: Add incremental data around lock page
Extended Monitoring: Watch cleanup patterns (6 minutes)
Comprehensive Validation: Test large database restoration
Analysis: Generate lock page specific report

Monitoring Retention Cleanup

What to Look For

The scripts monitor logs for these cleanup indicators:

Direct: "clean", "delete", "expire", "retention", "removed", "purge"
Indirect: "old file", "ttl", "sweep", "vacuum", "evict"
LTX-specific: "ltx.*old", "snapshot.*old", "compress", "archive"

Expected Behavior

Initial Period: LTX files accumulate normally
Retention Trigger: Cleanup begins after retention period
Ongoing: Old files removed, new files continue to accumulate
Stabilization: File count stabilizes at recent files only

Warning Signs

No Cleanup: Files accumulate indefinitely
Cleanup Failures: Error messages about S3 DELETE operations
Retention Ignored: Files older than retention period remain

Dependencies

Required Tools

Go: For building Litestream binaries
Python 3: For S3 mock server
sqlite3: For database operations
bc: For calculations

Python Packages

pip3 install moto boto3

Auto-Installation

The scripts automatically:

Build missing Litestream binaries
Install missing Python packages
Check for required tools

Output and Artifacts

Log Files

/tmp/small-retention-test.log - Small database replication log
/tmp/large-retention-test.log - Large database replication log
/tmp/small-retention-config.yml - Small database config
/tmp/large-retention-config.yml - Large database config

Database Files

/tmp/small-retention-test.db - Small test database
/tmp/large-retention-test.db - Large test database
/tmp/small-retention-restored.db - Restored small database
/tmp/large-retention-restored.db - Restored large database

Analysis Output

Each test generates:

Operation Counts: Sync, upload, LTX operations
Cleanup Indicators: Number of cleanup-related log entries
Error Analysis: Any errors or warnings encountered
Performance Metrics: Duration, throughput, file counts
Validation Results: Integrity checks, restoration success

Integration with Existing Framework

Relationship to Existing Tests

These tests complement the existing test infrastructure:

test-s3-retention-cleanup.sh: Original retention test (more basic)
test-754-s3-scenarios.sh: Issue #754 specific testing
Testing Framework: Uses litestream-test CLI for data generation

Consistent Patterns

Use existing etc/s3_mock.py for S3 simulation
Follow naming conventions from existing scripts
Integrate with litestream-test populate/load/validate commands
Generate structured output for analysis

Production Validation Recommendations

After Local Testing

Real S3 Testing: Run against actual S3/GCS/Azure endpoints
Network Scenarios: Test with network interruptions
Scale Testing: Test with production-sized databases
Cost Analysis: Monitor S3 API calls and storage costs
Concurrent Testing: Multiple databases simultaneously

Retention Period Guidelines

Local Testing: 2-3 minutes for quick feedback
Staging: 1-2 hours for realistic behavior
Production: Days to weeks based on recovery requirements

Monitoring in Production

LTX File Counts: Should stabilize after retention period
Storage Growth: Should level off, not grow indefinitely
API Costs: DELETE operations should occur regularly
Performance: Cleanup shouldn't impact replication performance

Troubleshooting

Common Issues

1. Python Dependencies Missing

pip3 install moto boto3

2. Binaries Not Found

go build -o bin/litestream ./cmd/litestream
go build -o bin/litestream-test ./cmd/litestream-test

3. Large Database Test Slow

Expected: 1.5GB takes time to create and replicate
Monitor progress in logs
Increase timeouts if needed

4. No Cleanup Activity Detected

May be normal: Litestream might clean up silently
Check S3 bucket contents manually (if using real S3)
Verify retention period has elapsed

5. Lock Page Boundary Not Crossed

Check final page count vs. lock page number
Increase target database size if needed
Verify page size settings

Debug Mode

For more verbose output:

# Enable debug logging
export LITESTREAM_DEBUG=1

# Run with debug
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh

Summary

These retention testing scripts provide comprehensive validation of Litestream's S3 LTX file cleanup behavior across different database sizes and scenarios. They specifically address:

Ben's Requirements: Local testing with Python S3 mock
SQLite Edge Cases: Lock page boundary at 1GB
Scale Scenarios: Both small (50MB) and large (1.5GB) databases
Retention Verification: Multiple retention periods and monitoring
Production Readiness: Detailed analysis and recommendations

The scripts are designed to run reliably in isolation while providing detailed insights into Litestream's retention cleanup behavior.

10 KiB Raw Permalink Blame History

S3 LTX File Retention Testing Guide

Overview

Key Focus Areas

1. Small Database Testing

2. Large Database Testing (Critical)

3. Comprehensive Analysis

Test Scripts

1. test-s3-retention-small-db.sh

2. test-s3-retention-large-db.sh

3. test-s3-retention-comprehensive.sh

SQLite Lock Page Testing

Why It Matters

What We Test

Lock Page Numbers by Page Size

Local S3 Mock Setup

Why Use Local Mock

How It Works

Environment Variables Set by Mock

Test Execution Flow

Small Database Test Flow

Large Database Test Flow

Monitoring Retention Cleanup

What to Look For

Expected Behavior

Warning Signs

Dependencies

Required Tools

Python Packages

Auto-Installation

Output and Artifacts

Log Files

Database Files

Analysis Output

Integration with Existing Framework

Relationship to Existing Tests

Consistent Patterns

Production Validation Recommendations

After Local Testing

Retention Period Guidelines

Monitoring in Production

Troubleshooting

Common Issues

1. Python Dependencies Missing

2. Binaries Not Found

3. Large Database Test Slow

4. No Cleanup Activity Detected

5. Lock Page Boundary Not Crossed

Debug Mode

Summary

10 KiB

Raw Permalink Blame History

1. `test-s3-retention-small-db.sh`

2. `test-s3-retention-large-db.sh`

3. `test-s3-retention-comprehensive.sh`