← All Projects
ai

RAG Pipeline - Confluence & Bitbucket to Vector Store

June 2024
PythonRAGVector StoreConfluence APIBitbucket APILLMsEmbeddings

Overview

Built an end-to-end Retrieval-Augmented Generation (RAG) pipeline at JP Morgan Chase that ingests internal Confluence documentation and Bitbucket source code into a vector store, enabling semantic search and grounded LLM responses over internal knowledge.

Pipeline Architecture

Ingestion

  • Pulled documents from Confluence via the REST API — pages, spaces, and attachments — with incremental sync to handle updates
  • Pulled code and markdown from Bitbucket repositories, parsing files by extension and extracting meaningful chunks (functions, classes, docstrings, READMEs)

Chunking & Embedding

  • Implemented document chunking strategies tailored to each source type: semantic chunking for prose (Confluence) and AST-aware chunking for code (Bitbucket)
  • Generated embeddings using an LLM embedding model and stored them in a vector store with metadata for source, page/file path, and last-modified timestamp

Retrieval & Generation

  • Built a retrieval layer with hybrid search (dense + keyword) to surface the most relevant chunks for a given query
  • Wired retrieval results into an LLM prompt template to produce grounded, source-cited answers

Key Design Decisions

  • Incremental sync: only re-embed documents that changed since the last run, keeping the vector store fresh without full re-ingestion
  • Metadata filtering: users can scope queries to a specific Confluence space or Bitbucket repository
  • Prompt engineering: system prompts structured to cite sources and acknowledge when the context is insufficient

Outcome

Enabled engineers to query internal documentation and codebases in natural language, reducing time spent manually searching Confluence and navigating large repositories.