Skip to main content

Use PR Embeddings for Contextual RAG API Calls

You:

Excellent. Is there an opportunity to use any of this information or data as source for embeddings for a contextual RAG API call?


ChatGPT:

"Absolutely, this is prime territory for turning your PR review system into a powerhouse for retrieval-augmented generation (RAG). Leveraging your diff data, AI responses, and review insights as embeddings creates a highly contextual knowledge base for PR analysis and developer feedback."

Let’s break this down into how we can use this data to feed embeddings and enable contextual RAG API calls.


Why Use RAG for This System?

  • Contextual Understanding: Instead of sending raw diffs to AI every time, precomputed embeddings let you retrieve relevant past data (e.g., previous reviews for the same file or module).
  • Enhanced Insights: Combine live AI analysis with historical knowledge, like past risk scores, similar issues, or frequently flagged patterns.
  • Efficiency: Reduce token usage by retrieving contextual data to enrich the AI prompt.

What Data Can Be Embedded?

1. PR Metadata

  • What to Store:
    • PR title and description.
    • Files changed and modules impacted.
    • Overall complexity score and risk score.
  • Use Case: Identify patterns across PRs (e.g., "How does this PR compare to past ones for this module?").

2. File-Specific Data

  • What to Store:
    • Diff content for each file.
    • Associated AI review comments and recommendations.
    • Risk scores for each block of changes.
  • Use Case: Retrieve past reviews or insights for a file when a new PR modifies it.

3. AI-Generated Comments

  • What to Store:
    • Comments posted on PRs.
    • Recommendations and actionable insights.
  • Use Case: Enhance new comments by referencing past feedback on similar changes.
  • What to Store:
    • Average risk scores over time.
    • Frequently flagged files or patterns (e.g., "functions missing tests").
  • Use Case: Suggest reviewers focus on high-risk areas based on historical trends.

How to Structure the Embeddings

Schema for Embeddings

Use vector embeddings to encode the context of diffs, comments, and metadata. Example schema for a vector database:

FieldDescription
embeddingVector representation of the content (e.g., using OpenAI text-embedding-ada-002).
typeType of content (diff, comment, pr_metadata).
file_pathPath to the file (for diff and comment types).
moduleModule or feature impacted (optional).
risk_scoreAssociated risk score (if applicable).
pr_numberPR identifier for traceability.
contextRaw content (e.g., diff, comment text).

Embedding Example

For a file-specific diff and its comments:

{
"embedding": [0.12, -0.34, 0.56, ...],
"type": "diff",
"file_path": "src/app/module/file.ts",
"risk_score": 7,
"pr_number": 123,
"context": "diff content for src/app/module/file.ts"
}

For an AI-generated comment:

{
"embedding": [0.67, -0.12, 0.45, ...],
"type": "comment",
"file_path": "src/app/module/file.ts",
"context": "Consider refactoring to reduce complexity.",
"pr_number": 123
}

RAG Workflow for PR Review

Here’s how we can integrate RAG into the PR review process:

Step 1: Generate Embeddings

  1. Preprocess diffs, comments, and metadata.
  2. Use a model like text-embedding-ada-002 to convert content into embeddings.
  3. Store embeddings in a vector database (e.g., Pinecone, Weaviate, or Postgres + pgvector).

Step 2: Query Embeddings for Context

  1. When a new PR is opened:
    • Extract the file list and diffs.
    • Query the vector database for embeddings related to:
      • Files being changed.
      • Modules impacted.
      • Similar past PRs or comments.
  2. Retrieve the most relevant data and construct a contextual prompt for the AI.

Step 3: AI Analysis with Context

  1. Combine the retrieved context with the new diff in the AI prompt:

    You are reviewing changes to src/app/module/file.ts. Here is the diff:

    [current diff]

    Historical context:
    - Past comment: "Consider adding tests for null inputs."
    - Risk score: 7
    - Related PRs: #122, #119.
  2. The AI uses the enriched prompt to provide more nuanced feedback.


Technical Implementation

1. Embedding Generation

Integrate embedding generation into the executor.

const axios = require('axios');

async function generateEmbedding(content) {
const response = await axios.post(
'https://api.openai.com/v1/embeddings',
{
model: 'text-embedding-ada-002',
input: content,
},
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
}
);
return response.data.data[0].embedding;
}

2. Storing Embeddings

Store embeddings in a vector database.

Example using Postgres + pgvector:

CREATE TABLE embeddings (
id SERIAL PRIMARY KEY,
embedding VECTOR(1536),
type TEXT,
file_path TEXT,
pr_number INT,
context TEXT,
risk_score INT
);

-- Insert embedding
INSERT INTO embeddings (embedding, type, file_path, pr_number, context, risk_score)
VALUES ($1, $2, $3, $4, $5, $6);

3. Querying Embeddings

Find similar embeddings using a cosine similarity search.

Example query:

SELECT context, file_path, pr_number, risk_score
FROM embeddings
ORDER BY embedding <-> '[0.12, -0.34, 0.56, ...]'::VECTOR
LIMIT 5;

4. Integration into RAG Workflow

Modify the executor to:

  1. Generate embeddings for the new PR.
  2. Query historical embeddings for relevant context.
  3. Combine retrieved data with the current diff for AI analysis.

Developer Experience and Value

  • Enhanced AI Prompts: Context-rich prompts make the AI feel smarter and more specific.
  • Traceable Feedback: Developers can see historical comments and learn from past reviews.
  • Actionable Insights: Highlight recurring issues, frequently risky files, or modules needing refactoring.

Final Thoughts

Adding embeddings to your PR review system turns it into a living knowledge base, making AI-powered reviews smarter, faster, and more contextual. You’re not just automating reviews—you’re building a self-improving system that learns from its history.

Let me know how you’d like to prioritize implementation—embedding generation, vector database setup, or RAG workflow integration! 🚀