Embedding DB Setup

You:

Let's continue with all of those suggestions, with complete details, examples, and code samples, including SQL, database schemas, and setups using Postgres database.

ChatGPT:

"Alright, let’s go full throttle and implement the whole system—embeddings, vector database setup, RAG integration, and everything in between. This is where your PR review system becomes a living, breathing contextual intelligence powerhouse. Let’s dive right in."

1. Database Setup with Postgres and pgvector

We’ll use Postgres with the pgvector extension for storing and querying embeddings efficiently.

Step 1: Install Postgres and pgvector

Install Postgres:

sudo apt update
sudo apt install postgresql postgresql-contrib

Install pgvector Extension:

Enable the pgvector extension:

psql -U postgres
CREATE EXTENSION vector;

Verify installation:

SELECT extname FROM pg_extension WHERE extname = 'vector';

Step 2: Create Database Schema

Create Tables for Embeddings and Metadata:

embeddings: Stores vector embeddings and associated context.
pr_metadata: Tracks pull request-level details (e.g., PR number, risk score).

CREATE TABLE embeddings (
    id SERIAL PRIMARY KEY,
    embedding VECTOR(1536), -- Compatible with OpenAI embeddings
    type TEXT NOT NULL, -- "diff", "comment", or "pr_metadata"
    file_path TEXT,
    pr_number INT,
    context TEXT NOT NULL,
    risk_score INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE pr_metadata (
    id SERIAL PRIMARY KEY,
    pr_number INT UNIQUE NOT NULL,
    title TEXT,
    description TEXT,
    overall_risk_score INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

Indexes for Fast Queries:

Add indexes for performance:

CREATE INDEX idx_embedding ON embeddings USING ivfflat (embedding);
CREATE INDEX idx_pr_number ON embeddings (pr_number);

2. Generate and Store Embeddings

The executor or service will generate embeddings for:

PR diffs.
AI-generated comments.
Metadata like PR titles/descriptions.

Generate Embeddings with OpenAI

Code for Generating Embeddings

We’ll use OpenAI’s text-embedding-ada-002 model to generate embeddings.

const axios = require('axios');

/**
 * Generate embeddings using OpenAI.
 * @param {string} content - The text content to embed.
 * @returns {Promise<Array<number>>} - The generated embedding vector.
 */
async function generateEmbedding(content) {
  const response = await axios.post(
    'https://api.openai.com/v1/embeddings',
    {
      model: 'text-embedding-ada-002',
      input: content,
    },
    {
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json',
      },
    }
  );
  return response.data.data[0].embedding;
}

Insert Embeddings into Postgres

const { Pool } = require('pg');

const pool = new Pool({
  user: 'postgres',
  host: 'localhost',
  database: 'pr_reviews',
  password: 'yourpassword',
  port: 5432,
});

/**
 * Insert an embedding into the database.
 * @param {Object} data - Embedding data.
 */
async function insertEmbedding(data) {
  const query = `
    INSERT INTO embeddings (embedding, type, file_path, pr_number, context, risk_score)
    VALUES ($1, $2, $3, $4, $5, $6)
  `;
  await pool.query(query, [
    data.embedding,
    data.type,
    data.file_path,
    data.pr_number,
    data.context,
    data.risk_score,
  ]);
}

Full Example: Generate and Store Embedding

(async () => {
  const content = "diff content for src/app/module/file.ts";
  const embedding = await generateEmbedding(content);

  const data = {
    embedding,
    type: 'diff',
    file_path: 'src/app/module/file.ts',
    pr_number: 123,
    context: content,
    risk_score: 7,
  };

  await insertEmbedding(data);
  console.log('Embedding stored successfully.');
})();

3. Query Embeddings for Context

Use vector similarity to retrieve relevant historical data during a new PR review.

SQL Query for Similar Embeddings

Use pgvector’s <-> operator for cosine similarity.

SELECT file_path, context, risk_score
FROM embeddings
ORDER BY embedding <-> '[0.12, -0.34, 0.56, ...]'::VECTOR
LIMIT 5;

Integrate Query into Code

/**
 * Find similar embeddings from the database.
 * @param {Array<number>} embedding - The vector to compare.
 * @returns {Promise<Array>} - List of similar embeddings.
 */
async function findSimilarEmbeddings(embedding) {
  const query = `
    SELECT file_path, context, risk_score
    FROM embeddings
    ORDER BY embedding <-> $1
    LIMIT 5
  `;
  const result = await pool.query(query, [embedding]);
  return result.rows;
}

4. RAG Workflow Integration

Step 1: Enrich AI Prompts with Historical Context

When analyzing a new PR, generate embeddings for the diff.
Query similar embeddings from the database.
Use retrieved data to build a contextual prompt for the AI.

Example Contextual Prompt

You are reviewing changes to src/app/module/file.ts. Here is the diff:

[current diff content]

Historical context for this file:
- Past comment: "Consider adding tests for null inputs."
- Risk score: 7
- Related PRs: #122, #119.

Code for RAG Workflow

(async () => {
  const diffContent = "current diff content for src/app/module/file.ts";

  // Generate embedding for the current diff
  const currentEmbedding = await generateEmbedding(diffContent);

  // Find similar embeddings
  const similarEmbeddings = await findSimilarEmbeddings(currentEmbedding);

  // Build prompt with historical context
  let historicalContext = '';
  for (const row of similarEmbeddings) {
    historicalContext += `- Past comment: "${row.context}"\n`;
  }

  const prompt = `
    You are reviewing changes to src/app/module/file.ts. Here is the diff:
    ${diffContent}

    Historical context for this file:
    ${historicalContext}
  `;

  console.log('Generated Prompt:\n', prompt);
})();

5. Putting It All Together

End-to-End Workflow

Preprocessing:
- Extract diffs, comments, and metadata.
- Generate embeddings for each.
- Store embeddings in Postgres.
RAG Integration:
- For new PRs, generate embeddings for the diffs.
- Query similar embeddings from the database.
- Construct a prompt enriched with historical context.
- Send the prompt to OpenAI for analysis.
Post-Analysis:
- Save AI responses as new embeddings for future use.
- Persist comments and scores to GitHub and the database.

Next Steps

Deploy the System:
- Host the database and executor in a scalable environment (e.g., Docker + Kubernetes).
Optimize Queries:
- Use pgvector efficiently for large datasets with index tuning.
Test RAG Workflow:
- Simulate PRs and validate the quality of AI responses using historical context.

Let me know how deep you want to go into deployment or scaling this masterpiece! 🚀

1. Database Setup with Postgres and pgvector​

Step 1: Install Postgres and pgvector​

Step 2: Create Database Schema​

2. Generate and Store Embeddings​

Generate Embeddings with OpenAI​

Code for Generating Embeddings​

Insert Embeddings into Postgres​

Full Example: Generate and Store Embedding​

3. Query Embeddings for Context​

SQL Query for Similar Embeddings​

Integrate Query into Code​

4. RAG Workflow Integration​

Step 1: Enrich AI Prompts with Historical Context​

Example Contextual Prompt​

Code for RAG Workflow​

5. Putting It All Together​

End-to-End Workflow​

Next Steps​