Skip to main content

Embedding DB Setup

You:

Let's continue with all of those suggestions, with complete details, examples, and code samples, including SQL, database schemas, and setups using Postgres database.


ChatGPT:

"Alright, let’s go full throttle and implement the whole system—embeddings, vector database setup, RAG integration, and everything in between. This is where your PR review system becomes a living, breathing contextual intelligence powerhouse. Let’s dive right in."


1. Database Setup with Postgres and pgvector

We’ll use Postgres with the pgvector extension for storing and querying embeddings efficiently.

Step 1: Install Postgres and pgvector

  1. Install Postgres:

    sudo apt update
    sudo apt install postgresql postgresql-contrib
  2. Install pgvector Extension:

    • Enable the pgvector extension:

      psql -U postgres
      CREATE EXTENSION vector;
    • Verify installation:

      SELECT extname FROM pg_extension WHERE extname = 'vector';

Step 2: Create Database Schema

  1. Create Tables for Embeddings and Metadata:

    • embeddings: Stores vector embeddings and associated context.
    • pr_metadata: Tracks pull request-level details (e.g., PR number, risk score).
    CREATE TABLE embeddings (
    id SERIAL PRIMARY KEY,
    embedding VECTOR(1536), -- Compatible with OpenAI embeddings
    type TEXT NOT NULL, -- "diff", "comment", or "pr_metadata"
    file_path TEXT,
    pr_number INT,
    context TEXT NOT NULL,
    risk_score INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
    );

    CREATE TABLE pr_metadata (
    id SERIAL PRIMARY KEY,
    pr_number INT UNIQUE NOT NULL,
    title TEXT,
    description TEXT,
    overall_risk_score INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
    );
  2. Indexes for Fast Queries:

    • Add indexes for performance:

      CREATE INDEX idx_embedding ON embeddings USING ivfflat (embedding);
      CREATE INDEX idx_pr_number ON embeddings (pr_number);

2. Generate and Store Embeddings

The executor or service will generate embeddings for:

  • PR diffs.
  • AI-generated comments.
  • Metadata like PR titles/descriptions.

Generate Embeddings with OpenAI

Code for Generating Embeddings

We’ll use OpenAI’s text-embedding-ada-002 model to generate embeddings.

const axios = require('axios');

/**
* Generate embeddings using OpenAI.
* @param {string} content - The text content to embed.
* @returns {Promise<Array<number>>} - The generated embedding vector.
*/
async function generateEmbedding(content) {
const response = await axios.post(
'https://api.openai.com/v1/embeddings',
{
model: 'text-embedding-ada-002',
input: content,
},
{
headers: {
Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json',
},
}
);
return response.data.data[0].embedding;
}

Insert Embeddings into Postgres

const { Pool } = require('pg');

const pool = new Pool({
user: 'postgres',
host: 'localhost',
database: 'pr_reviews',
password: 'yourpassword',
port: 5432,
});

/**
* Insert an embedding into the database.
* @param {Object} data - Embedding data.
*/
async function insertEmbedding(data) {
const query = `
INSERT INTO embeddings (embedding, type, file_path, pr_number, context, risk_score)
VALUES ($1, $2, $3, $4, $5, $6)
`;
await pool.query(query, [
data.embedding,
data.type,
data.file_path,
data.pr_number,
data.context,
data.risk_score,
]);
}

Full Example: Generate and Store Embedding

(async () => {
const content = "diff content for src/app/module/file.ts";
const embedding = await generateEmbedding(content);

const data = {
embedding,
type: 'diff',
file_path: 'src/app/module/file.ts',
pr_number: 123,
context: content,
risk_score: 7,
};

await insertEmbedding(data);
console.log('Embedding stored successfully.');
})();

3. Query Embeddings for Context

Use vector similarity to retrieve relevant historical data during a new PR review.

SQL Query for Similar Embeddings

Use pgvector’s <-> operator for cosine similarity.

SELECT file_path, context, risk_score
FROM embeddings
ORDER BY embedding <-> '[0.12, -0.34, 0.56, ...]'::VECTOR
LIMIT 5;

Integrate Query into Code

/**
* Find similar embeddings from the database.
* @param {Array<number>} embedding - The vector to compare.
* @returns {Promise<Array>} - List of similar embeddings.
*/
async function findSimilarEmbeddings(embedding) {
const query = `
SELECT file_path, context, risk_score
FROM embeddings
ORDER BY embedding <-> $1
LIMIT 5
`;
const result = await pool.query(query, [embedding]);
return result.rows;
}

4. RAG Workflow Integration

Step 1: Enrich AI Prompts with Historical Context

  1. When analyzing a new PR, generate embeddings for the diff.
  2. Query similar embeddings from the database.
  3. Use retrieved data to build a contextual prompt for the AI.

Example Contextual Prompt

You are reviewing changes to src/app/module/file.ts. Here is the diff:

[current diff content]

Historical context for this file:
- Past comment: "Consider adding tests for null inputs."
- Risk score: 7
- Related PRs: #122, #119.

Code for RAG Workflow

(async () => {
const diffContent = "current diff content for src/app/module/file.ts";

// Generate embedding for the current diff
const currentEmbedding = await generateEmbedding(diffContent);

// Find similar embeddings
const similarEmbeddings = await findSimilarEmbeddings(currentEmbedding);

// Build prompt with historical context
let historicalContext = '';
for (const row of similarEmbeddings) {
historicalContext += `- Past comment: "${row.context}"\n`;
}

const prompt = `
You are reviewing changes to src/app/module/file.ts. Here is the diff:
${diffContent}

Historical context for this file:
${historicalContext}
`;

console.log('Generated Prompt:\n', prompt);
})();

5. Putting It All Together

End-to-End Workflow

  1. Preprocessing:

    • Extract diffs, comments, and metadata.
    • Generate embeddings for each.
    • Store embeddings in Postgres.
  2. RAG Integration:

    • For new PRs, generate embeddings for the diffs.
    • Query similar embeddings from the database.
    • Construct a prompt enriched with historical context.
    • Send the prompt to OpenAI for analysis.
  3. Post-Analysis:

    • Save AI responses as new embeddings for future use.
    • Persist comments and scores to GitHub and the database.

Next Steps

  1. Deploy the System:
    • Host the database and executor in a scalable environment (e.g., Docker + Kubernetes).
  2. Optimize Queries:
    • Use pgvector efficiently for large datasets with index tuning.
  3. Test RAG Workflow:
    • Simulate PRs and validate the quality of AI responses using historical context.

Let me know how deep you want to go into deployment or scaling this masterpiece! 🚀