52 lines
2.7 KiB
Markdown
52 lines
2.7 KiB
Markdown
# Copilot Instructions for Schloter Project
|
|
|
|
## Project Overview
|
|
This project automates the extraction of transcripts from a specific YouTube channel, analyzes them to extract quotes, stores those quotes in a SQLite database, and serves them via an API. A separate script can call this API daily to post a quote to a Microsoft Teams channel.
|
|
|
|
## Architecture & Key Components
|
|
- **YouTube Transcript Extractor** (`YouTube Transcript Extractor with line correction.py`):
|
|
- Downloads German (manual preferred, auto-generated fallback) subtitles for all videos in a channel using `yt-dlp`.
|
|
- Parses `.vtt` files, splits text into sentences, and saves them as `.txt` files.
|
|
- **Quote API** (`quotes_api.py`):
|
|
- FastAPI app serving `/quotes` endpoint.
|
|
- Returns a random quote from the SQLite database, avoiding the last 20 served quotes.
|
|
- **Database**:
|
|
- SQLite database (`quotes.db`) with a `quotes` table (`id`, `quote`).
|
|
- **Teams Integration** (planned):
|
|
- A script will call the API and post the quote to a Teams channel (not yet implemented).
|
|
|
|
## Developer Workflows
|
|
- **Extracting Transcripts**: Run the transcript extractor script. It will create `.vtt` and `.txt` files in the `transcripts/` directory.
|
|
- **Populating Quotes Database**: (Manual/Scripted) Parse `.txt` files and insert quotes into the `quotes` table in `quotes.db`.
|
|
- **Running the API**: Start with `uvicorn quotes_api:app --reload`.
|
|
- **Testing the API**: Call `GET /quotes` to receive a random quote.
|
|
|
|
## Conventions & Patterns
|
|
- Always prefer manual subtitles over auto-generated for accuracy.
|
|
- Quotes are stored as one sentence per line in `.txt` files, then inserted into the database.
|
|
- The API avoids repeating the last 20 quotes by tracking IDs in memory.
|
|
- All scripts assume the working directory is the project root.
|
|
|
|
## External Dependencies
|
|
- `yt-dlp` (for subtitle download, called via Python subprocess)
|
|
- `ffmpeg` (binary required for best `yt-dlp` results)
|
|
- `fastapi`, `uvicorn` (for API)
|
|
- `sqlite3` (for database)
|
|
|
|
## Example Data Flow
|
|
1. Extractor downloads and parses subtitles → `.txt` files.
|
|
2. Quotes are loaded into `quotes.db`.
|
|
3. API serves random quotes.
|
|
4. (Planned) Teams bot posts a quote daily.
|
|
|
|
## Key Files
|
|
- `YouTube Transcript Extractor with line correction.py`: Transcript download and parsing logic.
|
|
- `quotes_api.py`: FastAPI app for serving quotes.
|
|
- `transcripts/`: Stores all subtitle and parsed text files.
|
|
- `quotes.db`: SQLite database of quotes.
|
|
|
|
## Tips for AI Agents
|
|
- When adding new extraction or analysis logic, follow the pattern of sentence splitting and file naming in the extractor script.
|
|
- When extending the API, maintain the non-repetition logic for quotes.
|
|
- If adding Teams integration, use the API endpoint for quote retrieval.
|