RAG Pipeline - Confluence & Bitbucket to Vector Store

June 2024

PythonRAGVector StoreConfluence APIBitbucket APILLMsEmbeddings

Overview

Built an end-to-end Retrieval-Augmented Generation (RAG) pipeline at JP Morgan Chase that ingests internal Confluence documentation and Bitbucket source code into a vector store, enabling semantic search and grounded LLM responses over internal knowledge.

Pipeline Architecture

Ingestion

Pulled documents from Confluence via the REST API — pages, spaces, and attachments — with incremental sync to handle updates
Pulled code and markdown from Bitbucket repositories, parsing files by extension and extracting meaningful chunks (functions, classes, docstrings, READMEs)

Chunking & Embedding

Implemented document chunking strategies tailored to each source type: semantic chunking for prose (Confluence) and AST-aware chunking for code (Bitbucket)
Generated embeddings using an LLM embedding model and stored them in a vector store with metadata for source, page/file path, and last-modified timestamp

Retrieval & Generation

Built a retrieval layer with hybrid search (dense + keyword) to surface the most relevant chunks for a given query
Wired retrieval results into an LLM prompt template to produce grounded, source-cited answers

Key Design Decisions

Incremental sync: only re-embed documents that changed since the last run, keeping the vector store fresh without full re-ingestion
Metadata filtering: users can scope queries to a specific Confluence space or Bitbucket repository
Prompt engineering: system prompts structured to cite sources and acknowledge when the context is insufficient

Outcome

Enabled engineers to query internal documentation and codebases in natural language, reducing time spent manually searching Confluence and navigating large repositories.