YouTube AI Pipeline

podtube-pipeline · 2024 — present

YouTube動画を自動収集 → 文字起こし → AI解析する一連のパイプライン。動画尺で処理モードを自動切替 (全文一括 vs 時間窓)、並列ダウンロード × 直列GPU処理でスループットを最大化。A pipeline that auto-crawls YouTube, transcribes audio and analyses it with an LLM. Switches strategy by video length (full-text vs windowed) and pairs parallel download with serial GPU work to maximise throughput.

Typeデータ取得 · 解析パイプラインIngestion + analysis pipeline

Roleアーキテクチャ · 実装Architecture · Implementation

StackPython · faster-whisper · yt-dlp · Ollama · Gemini · Supabase

Throughput約 60本 / 時 (ローカルRTX環境)~ 60 videos / hour (local RTX env)

Status継続開発中Active development

PythonWhisperOllamaGeminiSupabaseyt-dlp

GitHub を見るView on GitHub→この件で相談するDiscuss this

Background

特定ジャンルのYouTube動向を継続的に追うために、視聴ではなく構造化データとして蓄積する必要があった。動画長や言語が混在するため、固定パイプラインではコストとレイテンシの両立が難しい。To track a niche on YouTube continuously, the goal was to store content as structured data rather than watch it. With mixed video lengths and languages, a static pipeline can't balance cost and latency.

Pipeline stages

N° 01YouTube収集YouTube crawlcrawl.py · yt-dlp · parallel download

N° 02Whisper文字起こしWhisper transcriptionfaster-whisper · GPU serial queue

N° 03LLM解析 (適応モード)LLM analysis (adaptive)Ollama (local) · Gemini (long form)

N° 04構造化保存Structured storeSupabase · pgvector embeddings

Outcomes & decisions

短尺は全文一括、長尺は時間窓 ─ 動画長で処理モードを自動切替。Short clips: full-pass. Long clips: windowed. Mode switches automatically by length.

ダウンロード並列 × GPU処理直列でボトルネック解消。Parallel downloads + serial GPU work removed the throughput bottleneck.

機密性の高いログはローカルOllama、汎用解析はGemini。コストと精度を両立。Sensitive transcripts stay local on Ollama; general analysis goes to Gemini — cost & quality balanced.

Supabaseで埋め込み付きで保存 ─ 後段の検索 / 分析に直結。Stored in Supabase with embeddings — feeds search and downstream analytics directly.

YouTube AI Pipeline

Background

Pipeline stages

Outcomes & decisions

Internal FAQ RAG Chatbot