Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Ben Johnson <benbjohnson@yahoo.com>
10 KiB
S3 LTX File Retention Testing Guide
Overview
This document describes the comprehensive S3 LTX file retention testing scripts created to validate that old LTX files are properly cleaned up after their retention period expires. These tests use the local Python S3 mock server for isolated, repeatable testing.
Key Focus Areas
1. Small Database Testing
- Database Size: 50MB
- Retention Period: 2 minutes
- Focus: Basic retention behavior with minimal data
2. Large Database Testing (Critical)
- Database Size: 1.5GB (crosses SQLite lock page boundary)
- Page Size: 4KB (lock page at #262145)
- Retention Period: 3 minutes
- Focus: SQLite lock page edge case + retention cleanup at scale
3. Comprehensive Analysis
- Side-by-side comparison of retention behavior
- Performance metrics analysis
- Best practices verification
Test Scripts
1. test-s3-retention-small-db.sh
Purpose: Test S3 LTX retention cleanup with small databases
Features:
- Creates 50MB database with structured test data
- Uses local S3 mock (moto) for isolation
- 2-minute retention period for quick testing
- Generates multiple LTX files over time
- Monitors cleanup activity in logs
- Validates restoration integrity
Usage:
./cmd/litestream-test/scripts/test-s3-retention-small-db.sh
Duration: ~8 minutes
2. test-s3-retention-large-db.sh
Purpose: Test S3 LTX retention cleanup with large databases crossing the 1GB SQLite lock page boundary
Features:
- Creates 1.5GB database (crosses lock page at 1GB)
- Specifically tests SQLite lock page handling
- 3-minute retention period
- Extended monitoring for large database patterns
- Comprehensive validation including lock page verification
- Tests restoration of large databases
Usage:
./cmd/litestream-test/scripts/test-s3-retention-large-db.sh
Duration: ~15-20 minutes
3. test-s3-retention-comprehensive.sh
Purpose: Comprehensive test runner and analysis tool
Features:
- Runs both small and large database tests
- Provides comparative analysis
- Generates detailed reports
- Configurable test execution
- Best practices verification
Usage:
# Run all tests
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh
# Run only small database test
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --small-only
# Run only large database test
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --large-only
# Keep test files after completion
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh --no-cleanup
Duration: ~25-30 minutes for full suite
SQLite Lock Page Testing
Why It Matters
SQLite reserves a special lock page at exactly 1GB (offset 0x40000000) that cannot contain data. This creates a critical edge case that Litestream must handle correctly.
What We Test
- Lock Page Location: Page #262145 (with 4KB page size)
- Boundary Crossing: Databases that grow beyond 1GB
- Replication Integrity: Ensure lock page is properly skipped
- Restoration Accuracy: Verify restored databases maintain integrity
Lock Page Numbers by Page Size
| Page Size | Lock Page # | Test Coverage |
|---|---|---|
| 4KB | 262145 | ✅ Tested |
| 8KB | 131073 | 🔄 Possible |
| 16KB | 65537 | 🔄 Possible |
| 32KB | 32769 | 🔄 Possible |
Local S3 Mock Setup
Why Use Local Mock
- Isolation: No external dependencies or costs
- Repeatability: Consistent test environment
- Speed: No network latency
- Safety: No risk of affecting production data
How It Works
The tests use the existing ./etc/s3_mock.py script which:
- Starts a local moto S3 server
- Creates a test bucket with unique name
- Runs Litestream with S3 configuration
- Automatically cleans up after test completion
Environment Variables Set by Mock
LITESTREAM_S3_ACCESS_KEY_ID="lite"
LITESTREAM_S3_SECRET_ACCESS_KEY="stream"
LITESTREAM_S3_BUCKET="test{timestamp}"
LITESTREAM_S3_ENDPOINT="http://127.0.0.1:5000"
LITESTREAM_S3_FORCE_PATH_STYLE="true"
Test Execution Flow
Small Database Test Flow
- Setup: Build binaries, check dependencies
- Database Creation: 50MB with indexed tables
- Replication Start: Begin S3 mock and Litestream
- Data Generation: Create LTX files over time (6 batches, 20s apart)
- Retention Monitoring: Watch for cleanup activity (4 minutes)
- Validation: Test restoration and integrity
- Analysis: Generate detailed report
Large Database Test Flow
- Setup: Build binaries, verify lock page calculations
- Database Creation: 1.5GB crossing lock page boundary
- Replication Start: Begin S3 mock (longer initial sync)
- Data Generation: Add incremental data around lock page
- Extended Monitoring: Watch cleanup patterns (6 minutes)
- Comprehensive Validation: Test large database restoration
- Analysis: Generate lock page specific report
Monitoring Retention Cleanup
What to Look For
The scripts monitor logs for these cleanup indicators:
- Direct: "clean", "delete", "expire", "retention", "removed", "purge"
- Indirect: "old file", "ttl", "sweep", "vacuum", "evict"
- LTX-specific: "ltx.*old", "snapshot.*old", "compress", "archive"
Expected Behavior
- Initial Period: LTX files accumulate normally
- Retention Trigger: Cleanup begins after retention period
- Ongoing: Old files removed, new files continue to accumulate
- Stabilization: File count stabilizes at recent files only
Warning Signs
- No Cleanup: Files accumulate indefinitely
- Cleanup Failures: Error messages about S3 DELETE operations
- Retention Ignored: Files older than retention period remain
Dependencies
Required Tools
- Go: For building Litestream binaries
- Python 3: For S3 mock server
- sqlite3: For database operations
- bc: For calculations
Python Packages
pip3 install moto boto3
Auto-Installation
The scripts automatically:
- Build missing Litestream binaries
- Install missing Python packages
- Check for required tools
Output and Artifacts
Log Files
/tmp/small-retention-test.log- Small database replication log/tmp/large-retention-test.log- Large database replication log/tmp/small-retention-config.yml- Small database config/tmp/large-retention-config.yml- Large database config
Database Files
/tmp/small-retention-test.db- Small test database/tmp/large-retention-test.db- Large test database/tmp/small-retention-restored.db- Restored small database/tmp/large-retention-restored.db- Restored large database
Analysis Output
Each test generates:
- Operation Counts: Sync, upload, LTX operations
- Cleanup Indicators: Number of cleanup-related log entries
- Error Analysis: Any errors or warnings encountered
- Performance Metrics: Duration, throughput, file counts
- Validation Results: Integrity checks, restoration success
Integration with Existing Framework
Relationship to Existing Tests
These tests complement the existing test infrastructure:
test-s3-retention-cleanup.sh: Original retention test (more basic)test-754-s3-scenarios.sh: Issue #754 specific testing- Testing Framework: Uses
litestream-testCLI for data generation
Consistent Patterns
- Use existing
etc/s3_mock.pyfor S3 simulation - Follow naming conventions from existing scripts
- Integrate with
litestream-testpopulate/load/validate commands - Generate structured output for analysis
Production Validation Recommendations
After Local Testing
- Real S3 Testing: Run against actual S3/GCS/Azure endpoints
- Network Scenarios: Test with network interruptions
- Scale Testing: Test with production-sized databases
- Cost Analysis: Monitor S3 API calls and storage costs
- Concurrent Testing: Multiple databases simultaneously
Retention Period Guidelines
- Local Testing: 2-3 minutes for quick feedback
- Staging: 1-2 hours for realistic behavior
- Production: Days to weeks based on recovery requirements
Monitoring in Production
- LTX File Counts: Should stabilize after retention period
- Storage Growth: Should level off, not grow indefinitely
- API Costs: DELETE operations should occur regularly
- Performance: Cleanup shouldn't impact replication performance
Troubleshooting
Common Issues
1. Python Dependencies Missing
pip3 install moto boto3
2. Binaries Not Found
go build -o bin/litestream ./cmd/litestream
go build -o bin/litestream-test ./cmd/litestream-test
3. Large Database Test Slow
- Expected: 1.5GB takes time to create and replicate
- Monitor progress in logs
- Increase timeouts if needed
4. No Cleanup Activity Detected
- May be normal: Litestream might clean up silently
- Check S3 bucket contents manually (if using real S3)
- Verify retention period has elapsed
5. Lock Page Boundary Not Crossed
- Check final page count vs. lock page number
- Increase target database size if needed
- Verify page size settings
Debug Mode
For more verbose output:
# Enable debug logging
export LITESTREAM_DEBUG=1
# Run with debug
./cmd/litestream-test/scripts/test-s3-retention-comprehensive.sh
Summary
These retention testing scripts provide comprehensive validation of Litestream's S3 LTX file cleanup behavior across different database sizes and scenarios. They specifically address:
- Ben's Requirements: Local testing with Python S3 mock
- SQLite Edge Cases: Lock page boundary at 1GB
- Scale Scenarios: Both small (50MB) and large (1.5GB) databases
- Retention Verification: Multiple retention periods and monitoring
- Production Readiness: Detailed analysis and recommendations
The scripts are designed to run reliably in isolation while providing detailed insights into Litestream's retention cleanup behavior.