top of page

How to Build a YouTube Knowledge Agent With Local AI

The amount of technical content on YouTube is overwhelming — conference talks, tool demos, tutorial series, paper walkthroughs. The problem: it's ephemeral. You watch it, you forget it, you can't search it later.

This article shows how to build an agent that automatically processes YouTube videos, extracts structured knowledge, vectorizes it, and lets you query it in plain language whenever you need it.

The Architecture

Three steps. No paid video APIs, no computer vision. Just clean text and local embeddings.

YouTube URL → [yt-dlp] → transcript .vtt → [LLM] → structured analysis → [ChromaDB + embeddings] → queryable knowledge base

Step 1 — Extract the Transcript With yt-dlp

yt-dlp (https://github.com/yt-dlp/yt-dlp) downloads YouTube's auto-generated subtitles without touching the video file itself. It's fast, free, and works with virtually every technical channel.

import subprocess
import re
import shlex

def get_transcript(url: str) -> str | None:
    video_id_match = re.search(r"v=([a-zA-Z0-9_-]+)", url)
    if not video_id_match:
        return None
    video_id = video_id_match.group(1)

    subprocess.run(["mkdir", "-p", "/tmp/yt_subs"], check=True)

    cmd = (
        f"yt-dlp --write-auto-sub --sub-lang en "
        f"--skip-download --output \"/tmp/yt_subs/%(id)s\" {shlex.quote(url)}"
    )
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=120)

    if result.returncode != 0:
        return None

    vtt_path = f"/tmp/yt_subs/{video_id}.en.vtt"
    try:
        with open(vtt_path) as f:
            return clean_vtt(f.read())
    except FileNotFoundError:
        return None


def clean_vtt(vtt: str) -> str:
    text = re.sub(r"WEBVTT(?:\n.*?)*?\n\n", "", vtt, flags=re.DOTALL)
    text = re.sub(r"\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3}.*?\n", "", text)
    text = re.sub(r"<[^>]*>", "", text)
    text = re.sub(r"\n\s*\n", "\n", text).strip()
    return text

Tip: If yt-dlp shows a "No supported JavaScript runtime" warning, install deno: sudo apt install deno

Step 2 — Let the LLM Structure the Knowledge

Raw transcripts are noisy — filler words, repetition, no structure. The LLM turns them into clean, actionable JSON. Using litellm (https://github.com/BerriAI/litellm) keeps the code provider-agnostic: swap between Gemini, GPT-4o, Claude, or any local model without changing a line.

import litellm
import json

SYSTEM_PROMPT = """
You are a technical analyst specializing in AI and software development.
Given a video transcript, extract ONLY what is relevant and actionable.
Respond in JSON with this exact format:

{
  "title": "...",
  "channel": "...",
  "summary": "2-3 sentences on the main content",
  "key_points": ["point 1", "point 2", "point 3", "point 4", "point 5"],
  "tools_mentioned": [{"name": "...", "purpose": "..."}],
  "difficulty": "beginner|intermediate|advanced",
  "topics": ["topic1", "topic2"],
  "tags": ["tag1", "tag2"]
}
"""

def analyze_video(url: str, transcript: str, model: str = "gemini/gemini-2.0-flash") -> dict:
    response = litellm.completion(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"URL: {url}\n\nTRANSCRIPT:\n{transcript[:12000]}"}
        ],
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

Step 3 — Store It in ChromaDB With Local Embeddings

ChromaDB (https://docs.trychroma.com/) is a vector database that runs entirely locally. Embeddings are generated by nomic-embed-text via Ollama (https://ollama.com) — no cost, no data sent to third parties.

Install dependencies:

pip install chromadb litellm yt-dlp ollama pull nomic-embed-text

import chromadb
from chromadb.utils import embedding_functions
import hashlib
from datetime import datetime

ef = embedding_functions.OllamaEmbeddingFunction(
    url="http://localhost:11434/api/embeddings",
    model_name="nomic-embed-text"
)

client = chromadb.PersistentClient(path="./knowledge_base")
collection = client.get_or_create_collection("videos", embedding_function=ef)


def store_video(url: str, analysis: dict):
    doc_id = hashlib.md5(url.encode()).hexdigest()

    content = f"{analysis['summary']}\n\n" + "\n".join(analysis['key_points'])
    if analysis.get('tools_mentioned'):
        tools_text = "\n".join(
            f"{t['name']}: {t['purpose']}" for t in analysis['tools_mentioned']
        )
        content += f"\n\nTools:\n{tools_text}"

    collection.upsert(
        ids=[doc_id],
        documents=[content],
        metadatas=[{
            "url": url,
            "title": analysis.get("title", ""),
            "channel": analysis.get("channel", ""),
            "difficulty": analysis.get("difficulty", ""),
            "tags": ",".join(analysis.get("tags", [])),
            "processed_at": datetime.now().isoformat()
        }]
    )

Step 4 — Query Your Knowledge in Plain Language

This is where it pays off. Instead of searching by title or tag, you describe what you're looking for and the vector search finds semantically similar content.

def query(question: str, n: int = 5) -> list[dict]:
    results = collection.query(query_texts=[question], n_results=n)

    output = []
    for i, doc in enumerate(results['documents'][0]):
        meta = results['metadatas'][0][i]
        output.append({
            "title": meta['title'],
            "url": meta['url'],
            "channel": meta['channel'],
            "excerpt": doc[:300]
        })
    return output


# Examples:
# query("how to implement RAG with local embeddings")
# query("fine-tuning vs prompt engineering tradeoffs")
# query("agent memory architectures long term")
# query("multimodal models vision language 2024")

The Full Pipeline

def process_video(url: str):
    print(f"Processing: {url}")

    transcript = get_transcript(url)
    if not transcript:
        print("  No transcript available")
        return None

    analysis = analyze_video(url, transcript)
    store_video(url, analysis)

    print(f"  Saved: {analysis.get('title', '')}")
    print(f"  Tags: {', '.join(analysis.get('tags', []))}")
    return analysis


urls = [
    "https://www.youtube.com/watch?v=...",
    "https://www.youtube.com/watch?v=...",
]

for url in urls:
    process_video(url)

Automating It With a Nightly Cron

Add a simple cron job to feed the system automatically every night with new videos from your favourite channels:

0 2 * python /home/user/agent/process_feed.py >> /var/log/yt-agent.log 2>&1

process_feed.py reads a list of channels, fetches the latest videos using yt-dlp --playlist-end 3, and passes them through the pipeline. Zero manual effort after setup.

Full Stack — All Open Source

yt-dlp — Transcript download: https://github.com/yt-dlp/yt-dlp litellm — LLM abstraction layer: https://github.com/BerriAI/litellm nomic-embed-text — Local embeddings: https://ollama.com/library/nomic-embed-text Ollama — Local model runtime: https://ollama.com ChromaDB — Vector database: https://docs.trychroma.com

Conclusion

After indexing 50–100 technical videos, semantic search becomes genuinely useful — you find connections between content you didn't even remember watching. The knowledge base compounds over time with zero extra effort.

The natural next step is connecting this base to a chat interface: ask the agent a question and have it answer by citing the source videos. That's a topic for another post.

Recent Posts

See All
Unlock the Power: Running Local AI on AMD Hardware

Artificial Intelligence is transforming industries, but often the focus is on cloud-based solutions or specialized Nvidia hardware. What if you want to harness AI locally using your existing AMD-power

 
 
 
Building a Profitable AI Roadmap

🔗 https://shre.ink/Tobias-Zwing-Profitable-AI AI adoption is booming, but many companies struggle to make their AI investments...

 
 
 

Comments


bottom of page