Turns Codebase into Easy Tutorial with AI

Ever stared at a new codebase written by others feeling completely lost? This tutorial shows you how to build an AI agent that analyzes GitHub repositories and creates beginner-friendly tutorials explaining exactly how the code works.

This is a tutorial project of Pocket Flow, a 100-line LLM framework. It crawls GitHub repositories and builds a knowledge base from the code. It analyzes entire codebases to identify core abstractions and how they interact, and transforms complex code into beginner-friendly tutorials with clear visualizations.

Check out the book "Crack Any Codebase with AI" for more!
Check out the YouTube Development Tutorial for more!
Check out the Substack Post Tutorial for more!

🔸 🎉 Reached Hacker News Front Page (April 2025) with >900 up‑votes: Discussion »

⭐ Example Results for Popular GitHub Repositories!

🤯 All these tutorials are generated entirely by AI by crawling the GitHub repo!

AutoGen Core - Build AI teams that talk, think, and solve problems together like coworkers!
Browser Use - Let AI surf the web for you, clicking buttons and filling forms like a digital assistant!
Celery - Supercharge your app with background tasks that run while you sleep!
Click - Turn Python functions into slick command-line tools with just a decorator!
Codex - Turn plain English into working code with this AI terminal wizard!
Crawl4AI - Train your AI to extract exactly what matters from any website!
CrewAI - Assemble a dream team of AI specialists to tackle impossible problems!
DSPy - Build LLM apps like Lego blocks that optimize themselves!
FastAPI - Create APIs at lightning speed with automatic docs that clients will love!
Flask - Craft web apps with minimal code that scales from prototype to production!
Google A2A - The universal language that lets AI agents collaborate across borders!
LangGraph - Design AI agents as flowcharts where each step remembers what happened before!
LevelDB - Store data at warp speed with Google's engine that powers blockchains!
MCP Python SDK - Build powerful apps that communicate through an elegant protocol without sweating the details!
NumPy Core - Master the engine behind data science that makes Python as fast as C!
OpenManus - Build AI agents with digital brains that think, learn, and use tools just like humans do!
PocketFlow - 100-line LLM framework. Let Agents build Agents!
Pydantic Core - Validate data at rocket speed with just Python type hints!
Requests - Talk to the internet in Python with code so simple it feels like cheating!
SmolaAgents - Build tiny AI agents that punch way above their weight class!
Showcase Your AI-Generated Tutorials in Discussions!

🚀 Getting Started

Clone this repository

git clone https://github.com/The-Pocket/PocketFlow-Tutorial-Codebase-Knowledge

Install dependencies:
```
pip install -r requirements.txt
```
Set up LLM in utils/call_llm.py by providing credentials. To do so, you can put the values in a .env file. By default, you can use the AI Studio key with this client for Gemini Pro 2.5 by setting the GEMINI_API_KEY environment variable. If you want to use another LLM, you can set the LLM_PROVIDER environment variable (e.g. XAI), and then set the model, url, and API key (e.g. XAI_MODEL, XAI_URL,XAI_API_KEY). If using Ollama, the url is http://localhost:11434/ and the API key can be omitted. You can use your own models. We highly recommend the latest models with thinking capabilities (Claude 3.7 with thinking, O1). You can verify that it is correctly set up by running:
```
python utils/call_llm.py
```
Generate a complete codebase tutorial by running the main script:
```
# Analyze a GitHub repository
python main.py --repo https://github.com/username/repo --include "*.py" "*.js" --exclude "tests/*" --max-size 50000

# Or, analyze a local directory
python main.py --dir /path/to/your/codebase --include "*.py" --exclude "*test*"

# Or, turn a single large Markdown document (book/report) into a tutorial
python main.py --file /path/to/book.md --language "Chinese"

# Or, generate a tutorial in Chinese
python main.py --repo https://github.com/username/repo --language "Chinese"
```
- --repo, --dir, or --file - Specify a GitHub repo URL, a local directory, or a single large Markdown document (required, mutually exclusive)
- -n, --name - Project name (optional, derived from URL/directory/file if omitted)
- -t, --token - GitHub token (or set GITHUB_TOKEN environment variable)
- -o, --output - Output directory (default: ./output)
- -i, --include - Files to include (e.g., "*.py" "*.js")
- -e, --exclude - Files to exclude (e.g., "tests/*" "docs/*")
- -s, --max-size - Maximum file size in bytes (default: 100KB)
- --language - Language for the generated tutorial (default: "english")
- --max-abstractions - Maximum number of core concepts (chapters) for the whole input. Default: 10 for codebases; auto-scaled with document size in --file mode
- --min-chunk-tokens - Minimum approx. tokens per section in document mode (default: 1500). Tiny sections below this size may be merged with a neighbor.
- --max-chunk-tokens - Maximum approx. tokens per section in document mode (default: 10000). Oversized sections above this size are split into smaller pieces.
- --no-cache - Disable LLM response caching (default: caching enabled)

The application will crawl the repository, analyze the codebase structure, generate tutorial content in the specified language, and save the output in the specified directory (default: ./output).

📖 Document mode (books & reports)

Besides codebases, you can point the tool at a single large Markdown document such as a textbook, technical report, or whitepaper using --file. Instead of crawling source files, it:

Splits the document into structure-aware "virtual files" based on its heading hierarchy (#, ##, ###). Oversized chapters are recursively broken down and tiny sections are merged, so each piece stays within a target token range. Each piece keeps a breadcrumb (e.g. Chapter 2 > 2.1 Entropy) so it never loses its place in the document.
Summarizes each section (~300 words) so the whole document fits in the LLM context during topic identification and relationship analysis, while the original text is preserved for writing detailed chapters.
Reuses the same pipeline to identify core concepts, order them, and write beginner-friendly tutorial chapters.

# Generate a tutorial from a textbook, with custom section sizing
python main.py --file /path/to/textbook.md --language "Chinese" \
  --min-chunk-tokens 2000 --max-chunk-tokens 8000

Normalize converted Markdown headings

Some converted Markdown files, especially files produced from PDF/Word sources, mark every detected heading as ##. This makes chapter boundaries hard to detect. Use the standalone normalizer before running document mode:

python scripts/normalize_markdown_headings.py /path/to/book.md \
  -o /path/to/book.normalized.md \
  --drop-leading-toc \
  --title "Book Title"

python main.py --file /path/to/book.normalized.md --language "Chinese"

The script learns the chapter pattern from the first valid ## heading, then restores a standard hierarchy:

# Book Title
## Chapter 1 ...
### 1.1 ...
#### 1.1.1 ...

It supports common chapter forms such as 第 1 章, Chapter 1, Chap. 1, Ch. 1, and numbered headings like 1 Introduction. Use --dry-run to preview the normalization summary without writing a file, and omit --title if the document should not get a top-level # title.

Chunk size options

Document mode cuts one large Markdown file into smaller chunks before sending them to the LLM. The chunk-size options control how big those pieces should be:

--max-chunk-tokens: "too big, split it." If a chapter or section is larger than this value, the splitter breaks it into smaller chunks so it fits in the LLM context.
--min-chunk-tokens: "too small, merge it." If a section is smaller than this value, the splitter may merge it with a nearby section to avoid wasting an LLM call on a tiny piece.

For example, if a section has only 600 tokens and --min-chunk-tokens is 1500, it may be merged with a neighbor. If a chapter has 30000 tokens and --max-chunk-tokens is 10000, it will be split into multiple chunks.

In normalized book Markdown, ## chapter headings are kept as visible chunk starts even when their intro text is short; smaller headings such as ### and #### may still be merged.

Tip: for very large books, increasing --max-chunk-tokens produces fewer, larger sections (fewer LLM calls); decreasing it produces finer-grained sections.

🐳 Running with Docker

To run this project in a Docker container, you'll need to pass your API keys as environment variables.

Build the Docker image
```
docker build -t pocketflow-app .
```

Run the container

You'll need to provide your GEMINI_API_KEY for the LLM to function. If you're analyzing private GitHub repositories or want to avoid rate limits, also provide your GITHUB_TOKEN.

Mount a local directory to /app/output inside the container to access the generated tutorials on your host machine.

Example for analyzing a public GitHub repository:

docker run -it --rm \
  -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
  -v "$(pwd)/output_tutorials":/app/output \
  pocketflow-app --repo https://github.com/username/repo

Example for analyzing a local directory:

docker run -it --rm \
  -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
  -v "/path/to/your/local_codebase":/app/code_to_analyze \
  -v "$(pwd)/output_tutorials":/app/output \
  pocketflow-app --dir /app/code_to_analyze

Example for turning a single Markdown document (book/report) into a tutorial:

docker run -it --rm \
  -e GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE" \
  -v "/path/to/your/docs":/app/docs \
  -v "$(pwd)/output_tutorials":/app/output \
  pocketflow-app --file /app/docs/book.md

💡 Development Tutorial

I built using Agentic Coding, the fastest development paradigm, where humans simply design and agents code.
The secret weapon is Pocket Flow, a 100-line LLM framework that lets Agents (e.g., Cursor AI) build for you
Check out the Step-by-step YouTube development tutorial:

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
assets		assets
docs		docs
scripts		scripts
utils		utils
.clinerules		.clinerules
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.env.sample		.env.sample
.gitignore		.gitignore
.windsurfrules		.windsurfrules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
flow.py		flow.py
main.py		main.py
nodes.py		nodes.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turns Codebase into Easy Tutorial with AI

⭐ Example Results for Popular GitHub Repositories!

🚀 Getting Started

📖 Document mode (books & reports)

Normalize converted Markdown headings

Chunk size options

💡 Development Tutorial

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Turns Codebase into Easy Tutorial with AI

⭐ Example Results for Popular GitHub Repositories!

🚀 Getting Started

📖 Document mode (books & reports)

Normalize converted Markdown headings

Chunk size options

💡 Development Tutorial

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages