Performance Optimization Guide¶

RAGVersion is optimized for high-throughput document tracking with support for batch operations and efficient storage backends. This guide covers performance best practices, benchmarks, and optimization techniques.

Overview¶

Key Performance Features: - ✅ Batch operations - 10-100x faster bulk inserts - ✅ Async-first architecture - Non-blocking I/O throughout - ✅ Optimized indexes - Fast queries for common operations - ✅ Connection pooling - Reuse database connections - ✅ Concurrent processing - Parallel file tracking with semaphores - ✅ Content compression - 60-80% storage reduction

Batch Operations¶

What Are Batch Operations?¶

Batch operations allow you to insert multiple documents or versions in a single database transaction, significantly reducing overhead.

Performance Improvement: - SQLite: 10-15x faster (91% time reduction) - Supabase: 20-50x faster (fewer network round trips)

When to Use Batch Operations¶

✅ Use batch operations for: - Bulk imports from existing systems - Initial directory tracking (1000+ files) - Programmatic document creation - Migration scripts - Data seeding

❌ Don't use batch operations for: - Real-time single-file tracking - Interactive CLI usage (use ragversion track) - Incremental updates

How to Use Batch Operations¶

Batch Document Creation¶

from ragversion.storage import SQLiteStorage
from ragversion.models import Document
from uuid import uuid4
from datetime import datetime

async def bulk_import():
    storage = SQLiteStorage()
    await storage.initialize()

    # Prepare documents
    documents = []
    for i in range(1000):
        doc = Document(
            id=uuid4(),
            file_path=f"/data/file_{i}.txt",
            file_name=f"file_{i}.txt",
            file_type=".txt",
            file_size=1024,
            content_hash=f"hash_{i}",
            created_at=datetime.utcnow(),
            updated_at=datetime.utcnow(),
            version_count=1,
            current_version=1,
            metadata={"imported": True}
        )
        documents.append(doc)

    # Batch insert - 10x faster than individual inserts
    await storage.batch_create_documents(documents)
    print(f"Imported {len(documents)} documents")

    await storage.close()

Batch Version Creation¶

from ragversion.models import Version, ChangeType

async def bulk_versions():
    storage = SQLiteStorage()
    await storage.initialize()

    # Prepare versions
    versions = []
    for i in range(1000):
        ver = Version(
            id=uuid4(),
            document_id=document_id,  # Your document ID
            version_number=i + 1,
            content_hash=f"hash_{i}",
            content=f"Content for version {i}",
            file_size=1024,
            change_type=ChangeType.MODIFIED,
            created_at=datetime.utcnow(),
            metadata={}
        )
        versions.append(ver)

    # Batch insert with content storage
    await storage.batch_create_versions(versions)
    print(f"Created {len(versions)} versions")

    await storage.close()

Benchmarks¶

Batch Insert Performance¶

Test Environment: - MacBook Pro M1 (2021) - Python 3.12 - SQLite with WAL mode

Results:

Documents	Individual	Batch	Speedup
100	8.6ms	0.8ms	11x
500	42ms	4.7ms	9x
1000	85ms	8.5ms	10x

Throughput: - Individual inserts: ~11,800 docs/sec - Batch inserts: ~118,000 docs/sec

File Tracking Performance¶

Single File: - Parse + detect change: ~5-10ms - Create document + version: ~2ms - Total: ~10-15ms per file

Directory (1000 files, 4 workers): - Total time: ~5-8 seconds - Throughput: ~150-200 files/sec - Includes: file parsing, change detection, database writes

Directory (1000 files, 8 workers): - Total time: ~3-5 seconds - Throughput: ~200-300 files/sec

Query Performance¶

Operation	SQLite	Supabase
Get document by ID	<1ms	50-100ms
Get document by path	<1ms	50-100ms
List 100 documents	~5ms	100-200ms
List 1000 documents	~50ms	200-400ms
Get version history (100 versions)	~10ms	100-200ms
Compute diff	~20ms	150-300ms
Get statistics	~30ms	200-400ms

Optimization Techniques¶

1. Use Appropriate Worker Count¶

The max_workers parameter controls concurrent file processing.

Guidelines: - CPU-bound tasks: workers = CPU cores - I/O-bound tasks: workers = 2-4x CPU cores - Default: 4 workers (good for most cases) - Maximum recommended: 16 workers

# Default (good for most cases)
await tracker.track_directory("./docs", max_workers=4)

# For large directories on powerful machines
await tracker.track_directory("./docs", max_workers=8)

# For very large directories (10,000+ files)
await tracker.track_directory("./docs", max_workers=16)

Benchmark Results: | Workers | 1000 Files | Throughput | |---------|------------|------------| | 1 | 15s | 67 files/sec | | 2 | 8s | 125 files/sec | | 4 | 5s | 200 files/sec | | 8 | 3s | 333 files/sec | | 16 | 2.5s | 400 files/sec |

Diminishing Returns: Beyond 8 workers, performance gains are minimal due to: - Database connection limits - Disk I/O bottlenecks - Context switching overhead

2. Disable Content Storage (If Not Needed)¶

If you only need change detection (not full content):

tracker = AsyncVersionTracker(
    storage=storage,
    store_content=False  # Only store hashes
)

Performance Impact: - 50-70% faster writes (no content compression) - 80-90% less storage space used - Faster queries (smaller database)

Trade-offs: - ❌ Can't restore previous versions - ❌ Can't compute diffs - ✅ Can still detect changes (via hashes) - ✅ Can still track version history

3. Optimize File Patterns¶

Use specific patterns to avoid tracking unnecessary files:

# Bad: Tracks everything (slow)
await tracker.track_directory("./project")

# Better: Specific patterns
await tracker.track_directory(
    "./project",
    patterns=["*.py", "*.md", "*.json"],
    recursive=True
)

# Best: Exclude large directories
await tracker.track_directory(
    "./project",
    patterns=["*.py"],
    recursive=True
)
# Then manually skip: node_modules, .git, .venv

Performance Impact: - Reduces file scanning time - Avoids parsing large binary files - Focuses tracking on relevant documents

4. Batch Processing for Bulk Operations¶

For programmatic bulk operations, use batch methods:

# Slow: Individual operations
for doc in documents:
    await storage.create_document(doc)

# Fast: Batch operation
await storage.batch_create_documents(documents)

When to Batch: - Importing 100+ documents - Initial setup - Migration scripts

5. Reuse Tracker Instances¶

Don't create new tracker for each operation:

# Bad: Creates new connection each time
for batch in batches:
    async with AsyncVersionTracker(storage=storage) as tracker:
        await tracker.track_directory(batch)  # Reconnects each time

# Good: Reuse tracker
async with AsyncVersionTracker(storage=storage) as tracker:
    for batch in batches:
        await tracker.track_directory(batch)

Performance Impact: - Avoids connection overhead - Reuses indexes and caches - Faster initialization

6. Use In-Memory Database for Testing¶

For unit tests, use in-memory SQLite:

storage = SQLiteStorage(db_path=":memory:")
await storage.initialize()

# Run tests (very fast - no disk I/O)
...

await storage.close()

Performance Impact: - 100-1000x faster than disk-based - No file I/O overhead - Perfect for CI/CD pipelines

7. Enable Compression¶

Content compression is enabled by default:

# Default: Compression enabled (recommended)
storage = SQLiteStorage(content_compression=True)

# Disable for faster writes (larger database)
storage = SQLiteStorage(content_compression=False)

Trade-offs: | Metric | Compressed | Uncompressed | |--------|------------|--------------| | Storage | 100 MB | 400-500 MB | | Write speed | Baseline | 30% faster | | Read speed | Baseline | 20% faster | | Recommended | ✅ Yes | ❌ No |

When to Disable: - Very small files (<1KB) - Already compressed files (images, PDFs) - Speed > storage (rare)

8. Optimize Database Location¶

SQLite Performance by Location:

Location	Performance	Use Case
`:memory:`	⚡️ Fastest	Testing only
SSD (local)	🚀 Very fast	Development, production
HDD (local)	⏱️ Moderate	OK for small scale
Network drive	🐌 Slow	Avoid

Recommendation: - Development: Local SSD (./ragversion.db) - Testing: In-memory (:memory:) - Production: Local SSD with backups - Never: Network drives or cloud-mounted filesystems

Scaling Guidelines¶

Small Scale (< 1,000 documents)¶

Configuration:

storage = SQLiteStorage()
tracker = AsyncVersionTracker(
    storage=storage,
    store_content=True,
    max_file_size_mb=50
)
await tracker.track_directory("./docs", max_workers=4)

Expected Performance: - Track directory: <5 seconds - Query latency: <10ms - Storage: 50-200 MB

Medium Scale (1,000 - 10,000 documents)¶

Configuration:

storage = SQLiteStorage(
    db_path="/var/lib/ragversion/ragversion.db"
)
tracker = AsyncVersionTracker(
    storage=storage,
    store_content=True,
    max_file_size_mb=50
)
await tracker.track_directory("./docs", max_workers=8)

Expected Performance: - Track directory: 30-60 seconds - Query latency: <50ms - Storage: 500 MB - 2 GB

Optimizations: - Use 8 workers - Consider selective content storage - Regular cleanup of old versions

Large Scale (10,000 - 100,000 documents)¶

Configuration:

# Consider Supabase for this scale
storage = SupabaseStorage.from_env()

tracker = AsyncVersionTracker(
    storage=storage,
    store_content=False,  # Or selective
    max_file_size_mb=50
)
await tracker.track_directory("./docs", max_workers=16)

Expected Performance: - Track directory: 5-10 minutes - Query latency: 50-200ms - Storage: 2-10 GB

Optimizations: - Consider migrating to Supabase - Disable content storage for large files - Use batch operations for bulk imports - Implement cleanup policies - Add full-text search indexes

Very Large Scale (> 100,000 documents)¶

Recommendation: Use Supabase

Configuration:

storage = SupabaseStorage(
    url=os.getenv("SUPABASE_URL"),
    key=os.getenv("SUPABASE_SERVICE_KEY"),
    connection_pool_size=20  # Increase pool
)

tracker = AsyncVersionTracker(
    storage=storage,
    store_content=False,  # Hash-only tracking
    max_file_size_mb=100
)

Expected Performance: - Track directory: 15-30 minutes (with pagination) - Query latency: 100-500ms - Storage: 10+ GB

Optimizations: - Use Supabase for scalability - Implement sharding strategies - Use content-addressable storage for deduplication - Add CDN for content delivery - Monitor query performance

Monitoring & Profiling¶

Built-in Statistics¶

# Get overall statistics
stats = await tracker.get_statistics()
print(f"Total documents: {stats.total_documents}")
print(f"Total versions: {stats.total_versions}")
print(f"Storage used: {stats.total_storage_bytes / 1024 / 1024:.2f} MB")
print(f"Avg versions/doc: {stats.average_versions_per_document:.1f}")

Custom Benchmarking¶

import time

async def benchmark_operation():
    start = time.time()

    # Your operation
    result = await tracker.track_directory("./docs")

    elapsed = time.time() - start

    print(f"Processed {len(result.successful)} files in {elapsed:.2f}s")
    print(f"Throughput: {len(result.successful)/elapsed:.0f} files/sec")
    print(f"Success rate: {result.success_rate:.1f}%")

Profiling with cProfile¶

import cProfile
import pstats

async def main():
    async with AsyncVersionTracker(storage=storage) as tracker:
        await tracker.track_directory("./docs")

# Profile
profiler = cProfile.Profile()
profiler.enable()

asyncio.run(main())

profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

Common Performance Issues¶

Issue: Slow Directory Tracking¶

Symptoms: - Takes >1 minute for 1000 files - CPU usage low

Causes & Solutions:

Too few workers

# Fix: Increase workers
await tracker.track_directory("./docs", max_workers=8)

Database on network drive

# Fix: Move to local SSD
storage = SQLiteStorage(db_path="/var/local/ragversion.db")

Tracking unnecessary files

# Fix: Use specific patterns
await tracker.track_directory(
    "./docs",
    patterns=["*.md", "*.txt"],
    recursive=True
)

Issue: High Memory Usage¶

Symptoms: - Memory grows during large batches - OOM errors on large directories

Solutions:

Process in smaller batches

# Instead of tracking all at once
files = list_all_files()
batch_size = 1000

for i in range(0, len(files), batch_size):
    batch = files[i:i+batch_size]
    await tracker.track_files(batch)

Disable content storage

tracker = AsyncVersionTracker(
    storage=storage,
    store_content=False
)

Reduce worker count

# Fewer concurrent operations = less memory
await tracker.track_directory("./docs", max_workers=2)

Issue: Database Lock Errors (SQLite)¶

Symptoms: - "database is locked" errors - Timeouts on concurrent access

Solutions:

Increase timeout

storage = SQLiteStorage(timeout=60)  # Default: 30

Reduce concurrent writers

await tracker.track_directory("./docs", max_workers=2)

Use Supabase for multi-user scenarios

# SQLite is single-writer
# For team collaboration, use Supabase
storage = SupabaseStorage.from_env()

Best Practices Summary¶

✅ DO¶

Use batch operations for bulk imports (10-100x faster)
Reuse tracker instances across operations
Tune worker count for your hardware (4-8 typical)
Enable compression for text documents
Use specific file patterns to avoid unnecessary tracking
Monitor statistics to track growth
Use SQLite for single-user scenarios
Use Supabase for multi-user or cloud scenarios

❌ DON'T¶

Don't create new tracker per operation (connection overhead)
Don't track binary files unnecessarily (slow parsing)
Don't use network drives for SQLite database
Don't set workers >16 (diminishing returns)
Don't disable compression unless you have a reason
Don't track large files without increasing max_file_size_mb
Don't use SQLite for multi-user (single-writer limitation)

Performance Checklist¶

Before deploying to production:

[ ] Chosen appropriate storage backend (SQLite vs Supabase)
[ ] Configured optimal worker count for hardware
[ ] Enabled content compression
[ ] Set up file patterns to avoid unnecessary tracking
[ ] Tested with representative data volume
[ ] Implemented cleanup policies for old versions
[ ] Configured appropriate timeouts
[ ] Set up monitoring and alerts
[ ] Benchmarked critical operations
[ ] Documented expected performance metrics

Performance Optimization Guide¶

Overview¶

Batch Operations¶

What Are Batch Operations?¶

When to Use Batch Operations¶

How to Use Batch Operations¶

Batch Document Creation¶

Batch Version Creation¶

Benchmarks¶

Batch Insert Performance¶

File Tracking Performance¶

Query Performance¶

Optimization Techniques¶

1. Use Appropriate Worker Count¶

2. Disable Content Storage (If Not Needed)¶

3. Optimize File Patterns¶

4. Batch Processing for Bulk Operations¶

5. Reuse Tracker Instances¶

6. Use In-Memory Database for Testing¶

7. Enable Compression¶

8. Optimize Database Location¶

Scaling Guidelines¶

Small Scale (< 1,000 documents)¶

Medium Scale (1,000 - 10,000 documents)¶

Large Scale (10,000 - 100,000 documents)¶

Very Large Scale (> 100,000 documents)¶

Monitoring & Profiling¶

Built-in Statistics¶

Custom Benchmarking¶

Profiling with cProfile¶

Common Performance Issues¶

Issue: Slow Directory Tracking¶

Issue: High Memory Usage¶

Issue: Database Lock Errors (SQLite)¶

Best Practices Summary¶

✅ DO¶

❌ DON'T¶

Performance Checklist¶

Further Reading¶