How to Build a YouTube Knowledge Agent With Local AI
- CuriousAI.net

- May 25
- 3 min read
The amount of technical content on YouTube is overwhelming — conference talks, tool demos, tutorial series, paper walkthroughs. The problem: it's ephemeral. You watch it, you forget it, you can't search it later.
This article shows how to build an agent that automatically processes YouTube videos, extracts structured knowledge, vectorizes it, and lets you query it in plain language whenever you need it.
The Architecture
Three steps. No paid video APIs, no computer vision. Just clean text and local embeddings.
YouTube URL → [yt-dlp] → transcript .vtt → [LLM] → structured analysis → [ChromaDB + embeddings] → queryable knowledge baseStep 1 — Extract the Transcript With yt-dlp
yt-dlp (https://github.com/yt-dlp/yt-dlp) downloads YouTube's auto-generated subtitles without touching the video file itself. It's fast, free, and works with virtually every technical channel.
import subprocess
import re
import shlex
def get_transcript(url: str) -> str | None:
video_id_match = re.search(r"v=([a-zA-Z0-9_-]+)", url)
if not video_id_match:
return None
video_id = video_id_match.group(1)
subprocess.run(["mkdir", "-p", "/tmp/yt_subs"], check=True)
cmd = (
f"yt-dlp --write-auto-sub --sub-lang en "
f"--skip-download --output \"/tmp/yt_subs/%(id)s\" {shlex.quote(url)}"
)
result = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=120)
if result.returncode != 0:
return None
vtt_path = f"/tmp/yt_subs/{video_id}.en.vtt"
try:
with open(vtt_path) as f:
return clean_vtt(f.read())
except FileNotFoundError:
return None
def clean_vtt(vtt: str) -> str:
text = re.sub(r"WEBVTT(?:\n.*?)*?\n\n", "", vtt, flags=re.DOTALL)
text = re.sub(r"\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3}.*?\n", "", text)
text = re.sub(r"<[^>]*>", "", text)
text = re.sub(r"\n\s*\n", "\n", text).strip()
return textTip: If yt-dlp shows a "No supported JavaScript runtime" warning, install deno: sudo apt install deno
Step 2 — Let the LLM Structure the Knowledge
Raw transcripts are noisy — filler words, repetition, no structure. The LLM turns them into clean, actionable JSON. Using litellm (https://github.com/BerriAI/litellm) keeps the code provider-agnostic: swap between Gemini, GPT-4o, Claude, or any local model without changing a line.
import litellm
import json
SYSTEM_PROMPT = """
You are a technical analyst specializing in AI and software development.
Given a video transcript, extract ONLY what is relevant and actionable.
Respond in JSON with this exact format:
{
"title": "...",
"channel": "...",
"summary": "2-3 sentences on the main content",
"key_points": ["point 1", "point 2", "point 3", "point 4", "point 5"],
"tools_mentioned": [{"name": "...", "purpose": "..."}],
"difficulty": "beginner|intermediate|advanced",
"topics": ["topic1", "topic2"],
"tags": ["tag1", "tag2"]
}
"""
def analyze_video(url: str, transcript: str, model: str = "gemini/gemini-2.0-flash") -> dict:
response = litellm.completion(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"URL: {url}\n\nTRANSCRIPT:\n{transcript[:12000]}"}
],
response_format={"type": "json_object"}
)
return json.loads(response.choices[0].message.content)Step 3 — Store It in ChromaDB With Local Embeddings
ChromaDB (https://docs.trychroma.com/) is a vector database that runs entirely locally. Embeddings are generated by nomic-embed-text via Ollama (https://ollama.com) — no cost, no data sent to third parties.
Install dependencies:
pip install chromadb litellm yt-dlp ollama pull nomic-embed-text
import chromadb
from chromadb.utils import embedding_functions
import hashlib
from datetime import datetime
ef = embedding_functions.OllamaEmbeddingFunction(
url="http://localhost:11434/api/embeddings",
model_name="nomic-embed-text"
)
client = chromadb.PersistentClient(path="./knowledge_base")
collection = client.get_or_create_collection("videos", embedding_function=ef)
def store_video(url: str, analysis: dict):
doc_id = hashlib.md5(url.encode()).hexdigest()
content = f"{analysis['summary']}\n\n" + "\n".join(analysis['key_points'])
if analysis.get('tools_mentioned'):
tools_text = "\n".join(
f"{t['name']}: {t['purpose']}" for t in analysis['tools_mentioned']
)
content += f"\n\nTools:\n{tools_text}"
collection.upsert(
ids=[doc_id],
documents=[content],
metadatas=[{
"url": url,
"title": analysis.get("title", ""),
"channel": analysis.get("channel", ""),
"difficulty": analysis.get("difficulty", ""),
"tags": ",".join(analysis.get("tags", [])),
"processed_at": datetime.now().isoformat()
}]
)Step 4 — Query Your Knowledge in Plain Language
This is where it pays off. Instead of searching by title or tag, you describe what you're looking for and the vector search finds semantically similar content.
def query(question: str, n: int = 5) -> list[dict]:
results = collection.query(query_texts=[question], n_results=n)
output = []
for i, doc in enumerate(results['documents'][0]):
meta = results['metadatas'][0][i]
output.append({
"title": meta['title'],
"url": meta['url'],
"channel": meta['channel'],
"excerpt": doc[:300]
})
return output
# Examples:
# query("how to implement RAG with local embeddings")
# query("fine-tuning vs prompt engineering tradeoffs")
# query("agent memory architectures long term")
# query("multimodal models vision language 2024")The Full Pipeline
def process_video(url: str):
print(f"Processing: {url}")
transcript = get_transcript(url)
if not transcript:
print(" No transcript available")
return None
analysis = analyze_video(url, transcript)
store_video(url, analysis)
print(f" Saved: {analysis.get('title', '')}")
print(f" Tags: {', '.join(analysis.get('tags', []))}")
return analysis
urls = [
"https://www.youtube.com/watch?v=...",
"https://www.youtube.com/watch?v=...",
]
for url in urls:
process_video(url)Automating It With a Nightly Cron
Add a simple cron job to feed the system automatically every night with new videos from your favourite channels:
0 2 * python /home/user/agent/process_feed.py >> /var/log/yt-agent.log 2>&1
process_feed.py reads a list of channels, fetches the latest videos using yt-dlp --playlist-end 3, and passes them through the pipeline. Zero manual effort after setup.
Full Stack — All Open Source
yt-dlp — Transcript download: https://github.com/yt-dlp/yt-dlp litellm — LLM abstraction layer: https://github.com/BerriAI/litellm nomic-embed-text — Local embeddings: https://ollama.com/library/nomic-embed-text Ollama — Local model runtime: https://ollama.com ChromaDB — Vector database: https://docs.trychroma.com
Conclusion
After indexing 50–100 technical videos, semantic search becomes genuinely useful — you find connections between content you didn't even remember watching. The knowledge base compounds over time with zero extra effort.
The natural next step is connecting this base to a chat interface: ask the agent a question and have it answer by citing the source videos. That's a topic for another post.

Comments