Embedding DB Setup
You:
Let's continue with all of those suggestions, with complete details, examples, and code samples, including SQL, database schemas, and setups using Postgres database.
ChatGPT:
"Alright, let’s go full throttle and implement the whole system—embeddings, vector database setup, RAG integration, and everything in between. This is where your PR review system becomes a living, breathing contextual intelligence powerhouse. Let’s dive right in."
1. Database Setup with Postgres and pgvector
We’ll use Postgres with the pgvector extension for storing and querying embeddings efficiently.
Step 1: Install Postgres and pgvector
-
Install Postgres:
sudo apt update
sudo apt install postgresql postgresql-contrib -
Install pgvector Extension:
-
Enable the
pgvectorextension:psql -U postgres
CREATE EXTENSION vector; -
Verify installation:
SELECT extname FROM pg_extension WHERE extname = 'vector';
-
Step 2: Create Database Schema
-
Create Tables for Embeddings and Metadata:
embeddings: Stores vector embeddings and associated context.pr_metadata: Tracks pull request-level details (e.g., PR number, risk score).
CREATE TABLE embeddings (
id SERIAL PRIMARY KEY,
embedding VECTOR(1536), -- Compatible with OpenAI embeddings
type TEXT NOT NULL, -- "diff", "comment", or "pr_metadata"
file_path TEXT,
pr_number INT,
context TEXT NOT NULL,
risk_score INT DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE pr_metadata (
id SERIAL PRIMARY KEY,
pr_number INT UNIQUE NOT NULL,
title TEXT,
description TEXT,
overall_risk_score INT DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW()
); -
Indexes for Fast Queries:
-
Add indexes for performance:
CREATE INDEX idx_embedding ON embeddings USING ivfflat (embedding);
CREATE INDEX idx_pr_number ON embeddings (pr_number);
-
2. Generate and Store Embeddings
The executor or service will generate embeddings for:
- PR diffs.
- AI-generated comments.
- Metadata like PR titles/descriptions.
Generate Embeddings with OpenAI
Code for Generating Embeddings
We’ll use OpenAI’s text-embedding-ada-002 model to generate embeddings.
const axios = require('axios');
/**
* Generate embeddings using OpenAI.
* @param {string} content - The text content to embed.
* @returns {Promise<Array<number>>} - The generated embedding vector.
*/
async function generateEmbedding(content) {
const response = await axios.post(
'https://api.openai.com/v1/embeddings',
{
model: 'text-embedding-ada-002',
input: content,
},
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
}
);
return response.data.data[0].embedding;
}
Insert Embeddings into Postgres
const { Pool } = require('pg');
const pool = new Pool({
user: 'postgres',
host: 'localhost',
database: 'pr_reviews',
password: 'yourpassword',
port: 5432,
});
/**
* Insert an embedding into the database.
* @param {Object} data - Embedding data.
*/
async function insertEmbedding(data) {
const query = `
INSERT INTO embeddings (embedding, type, file_path, pr_number, context, risk_score)
VALUES ($1, $2, $3, $4, $5, $6)
`;
await pool.query(query, [
data.embedding,
data.type,
data.file_path,
data.pr_number,
data.context,
data.risk_score,
]);
}
Full Example: Generate and Store Embedding
(async () => {
const content = "diff content for src/app/module/file.ts";
const embedding = await generateEmbedding(content);
const data = {
embedding,
type: 'diff',
file_path: 'src/app/module/file.ts',
pr_number: 123,
context: content,
risk_score: 7,
};
await insertEmbedding(data);
console.log('Embedding stored successfully.');
})();
3. Query Embeddings for Context
Use vector similarity to retrieve relevant historical data during a new PR review.
SQL Query for Similar Embeddings
Use pgvector’s <-> operator for cosine similarity.
SELECT file_path, context, risk_score
FROM embeddings
ORDER BY embedding <-> '[0.12, -0.34, 0.56, ...]'::VECTOR
LIMIT 5;
Integrate Query into Code
/**
* Find similar embeddings from the database.
* @param {Array<number>} embedding - The vector to compare.
* @returns {Promise<Array>} - List of similar embeddings.
*/
async function findSimilarEmbeddings(embedding) {
const query = `
SELECT file_path, context, risk_score
FROM embeddings
ORDER BY embedding <-> $1
LIMIT 5
`;
const result = await pool.query(query, [embedding]);
return result.rows;
}
4. RAG Workflow Integration
Step 1: Enrich AI Prompts with Historical Context
- When analyzing a new PR, generate embeddings for the diff.
- Query similar embeddings from the database.
- Use retrieved data to build a contextual prompt for the AI.
Example Contextual Prompt
You are reviewing changes to src/app/module/file.ts. Here is the diff:
[current diff content]
Historical context for this file:
- Past comment: "Consider adding tests for null inputs."
- Risk score: 7
- Related PRs: #122, #119.
Code for RAG Workflow
(async () => {
const diffContent = "current diff content for src/app/module/file.ts";
// Generate embedding for the current diff
const currentEmbedding = await generateEmbedding(diffContent);
// Find similar embeddings
const similarEmbeddings = await findSimilarEmbeddings(currentEmbedding);
// Build prompt with historical context
let historicalContext = '';
for (const row of similarEmbeddings) {
historicalContext += `- Past comment: "${row.context}"\n`;
}
const prompt = `
You are reviewing changes to src/app/module/file.ts. Here is the diff:
${diffContent}
Historical context for this file:
${historicalContext}
`;
console.log('Generated Prompt:\n', prompt);
})();
5. Putting It All Together
End-to-End Workflow
-
Preprocessing:
- Extract diffs, comments, and metadata.
- Generate embeddings for each.
- Store embeddings in Postgres.
-
RAG Integration:
- For new PRs, generate embeddings for the diffs.
- Query similar embeddings from the database.
- Construct a prompt enriched with historical context.
- Send the prompt to OpenAI for analysis.
-
Post-Analysis:
- Save AI responses as new embeddings for future use.
- Persist comments and scores to GitHub and the database.
Next Steps
- Deploy the System:
- Host the database and executor in a scalable environment (e.g., Docker + Kubernetes).
- Optimize Queries:
- Use
pgvectorefficiently for large datasets with index tuning.
- Use
- Test RAG Workflow:
- Simulate PRs and validate the quality of AI responses using historical context.
Let me know how deep you want to go into deployment or scaling this masterpiece! 🚀