LlamaIndex Integration¶

RAGVersion provides seamless integration with LlamaIndex, making it easy to keep your indexes in sync with document changes.

Quick Start (Recommended)¶

NEW in v0.11.0: One-line setup!

from ragversion.integrations.llamaindex import quick_start

# That's it! 🚀
sync = await quick_start("./documents")

# Ready to query
query_engine = sync.index.as_query_engine()
response = query_engine.query("What are the features?")

What this does: - ✅ Creates RAGVersion tracker with smart defaults - ✅ Initializes VectorStoreIndex - ✅ Sets up OpenAI embeddings - ✅ Configures node parser - ✅ Indexes your documents - ✅ Enables chunk-level tracking for 80-95% cost savings

See the Quick Start Guide for more details.

Customization¶

Custom Configuration¶

sync = await quick_start(
    directory="./documents",
    storage_backend="supabase",      # or "sqlite", "auto"
    chunk_size=2048,                 # Custom chunk size
    chunk_overlap=200,               # Custom overlap
    enable_chunk_tracking=True,      # Smart updates (default)
)

Custom Embeddings¶

from llama_index.embeddings.openai import OpenAIEmbedding

embeddings = OpenAIEmbedding(model="text-embedding-3-small")
sync = await quick_start("./documents", embeddings=embeddings)

File Patterns¶

sync = await quick_start(
    directory="./documents",
    file_patterns=["*.pdf", "*.docx", "*.txt", "*.md"],
)

Advanced Usage¶

Manual Setup (For Full Control)¶

If you need complete control over the setup:

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from ragversion import AsyncVersionTracker
from ragversion.storage import SQLiteStorage
from ragversion.integrations.llamaindex import LlamaIndexSync

# Setup storage
storage = SQLiteStorage()
tracker = AsyncVersionTracker(storage=storage)
await tracker.initialize()

# Setup LlamaIndex components
embeddings = OpenAIEmbedding()
node_parser = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
index = VectorStoreIndex.from_documents([], embed_model=embeddings)

# Create sync
sync = LlamaIndexSync(
    tracker=tracker,
    index=index,
    node_parser=node_parser,
    enable_chunk_tracking=True,  # Enable smart updates
)

# Sync directory
await sync.sync_directory("./documents")

Using the Sync Integration¶

# Query the index
query_engine = sync.index.as_query_engine()
response = query_engine.query("What changed today?")

# Use as retriever
retriever = sync.index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("sample query")

# Track new files (auto-updates index)
await sync.tracker.track("./new_doc.pdf")

# Custom metadata extraction
def extract_metadata(file_path):
    return {"source": "internal", "department": "engineering"}

sync = LlamaIndexSync(
    tracker=tracker,
    index=index,
    node_parser=node_parser,
    metadata_extractor=extract_metadata,
)

Smart Chunk-Level Tracking¶

When enable_chunk_tracking=True (default), RAGVersion only re-embeds changed chunks on document updates, achieving 80-95% cost savings compared to full re-embedding.

sync = await quick_start(
    directory="./documents",
    enable_chunk_tracking=True,  # Default
)

# Update a document
await sync.tracker.track("./documents/updated.pdf")
# Only changed chunks are re-embedded!

Examples¶

API Reference¶

`quick_start()`¶

One-line setup for LlamaIndex integration.

Parameters: - directory (str): Directory to track and index - index_path (str, optional): Reserved for future persistence support - embeddings (BaseEmbedding, optional): Custom embeddings model - storage_backend (str): "auto", "sqlite", or "supabase" (default: "auto") - chunk_size (int): Node parser chunk size (default: 1024) - chunk_overlap (int): Node parser overlap (default: 20) - file_patterns (List[str]): File patterns to track (default: [".txt", ".md", "*.pdf"]) - enable_chunk_tracking (bool): Enable smart chunk updates (default: True)

Returns: - LlamaIndexSync: Initialized sync instance with documents indexed

`LlamaIndexSync`¶

Main integration class for automatic synchronization.

Methods: - sync_directory(dir_path, patterns, recursive): Sync all files in directory - refresh_index(): Refresh the entire index from tracked documents - _handle_creation(event): Handle document creation events - _handle_modification(event): Handle document modification events - _handle_deletion(event): Handle document deletion events

See the complete documentation for detailed API reference.