scripts

CUA Documentation Scripts

This directory contains scripts for crawling, indexing, and serving CUA documentation through a Model Context Protocol (MCP) server.

Scripts

Local Scripts

crawl_docs.py: Crawls cua.ai/docs using crawl4ai
generate_db.py: Creates LanceDB vector database for semantic search
generate_sqlite.py: Creates SQLite FTS5 database for full-text search

Modal Deployment

modal_app.py: Complete Modal app with scheduled crawling and MCP server deployment

Installation

Install dependencies using uv:

# From the repository root
uv sync --group docs-scripts

Usage

Option 1: Local Development

1. Crawl Documentation

uv run docs/scripts/crawl_docs.py

2. Generate Databases

# Generate vector database for semantic search
uv run docs/scripts/generate_db.py

# Generate SQLite FTS5 database for full-text search
uv run docs/scripts/generate_sqlite.py

Option 2: Modal Deployment (Production)

The Modal app provides a production-ready deployment with:

Scheduled daily crawling at 6 AM UTC
Persistent storage using Modal volumes
Scalable MCP server with automatic database regeneration

Initial Setup

Install Modal CLI:

pip install modal

Authenticate with Modal:

modal setup

Deploy to Modal

# Initial deployment with data generation
modal run docs/scripts/modal_app.py

# Deploy the app (includes scheduled crawling + MCP server)
modal deploy docs/scripts/modal_app.py

Access the MCP Server

After deployment, Modal will provide a public URL for the MCP server:

https://your-username--cua-docs-mcp-web.modal.run/mcp/

Use this URL with the MCP Inspector or any MCP client:

npx @modelcontextprotocol/inspector
# Enter URL: https://your-username--cua-docs-mcp-web.modal.run/mcp/
# Transport: Streamable HTTP

Monitor Scheduled Crawls

View scheduled crawl runs in the Modal dashboard:

modal app show cua-docs-mcp

The crawler runs daily at 6 AM UTC and automatically updates the databases.

Code Indexing

The Modal app also indexes the CUA source code across all git tags, enabling semantic and full-text search over versioned code.

Architecture

Code indexing uses parallel sharded processing for performance:

┌─────────────────────────────────────────────────────────────────┐
│                  generate_code_index_parallel()                  │
│                                                                 │
│  1. Clone/fetch git repository                                  │
│  2. Get all tags (e.g., agent-v0.7.3, computer-v0.5.0)         │
│  3. Group tags by component                                     │
│  4. Dispatch parallel workers via Modal starmap                 │
└─────────────────────────────────────────────────────────────────┘
                              │
           ┌──────────────────┼──────────────────┐
           ▼                  ▼                  ▼
    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │ index_      │    │ index_      │    │ index_      │
    │ component   │    │ component   │    │ component   │
    │ (agent)     │    │ (computer)  │    │ (lume)      │
    │             │    │             │    │             │
    │ 112 tags    │    │ 58 tags     │    │ 49 tags     │
    └─────────────┘    └─────────────┘    └─────────────┘
           │                  │                  │
           ▼                  ▼                  ▼
    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
    │ SQLite +    │    │ SQLite +    │    │ SQLite +    │
    │ LanceDB     │    │ LanceDB     │    │ LanceDB     │
    │ (agent)     │    │ (computer)  │    │ (lume)      │
    └─────────────┘    └─────────────┘    └─────────────┘

Each component gets its own databases:

code_index_{component}.sqlite - FTS5 full-text search
code_index_{component}.lancedb/ - Vector embeddings for semantic search

Running Code Indexing

# Run parallel code indexing (default)
modal run docs/scripts/modal_app.py --code-only

# Run in detached mode to monitor via dashboard
modal run --detach docs/scripts/modal_app.py --code-only

# Run sequential (legacy) mode
modal run docs/scripts/modal_app.py --code-only --no-parallel

# Skip code indexing, only crawl docs
modal run docs/scripts/modal_app.py --skip-code

MCP Server: Querying Sharded Databases

The MCP server automatically discovers and queries across all component databases:

SQLite Queries - Uses ATTACH DATABASE to create a unified view:

-- This queries across ALL component databases
SELECT component, version, file_path
FROM code_files
WHERE component = 'agent' AND version = '0.7.3'

-- Full-text search across all components
SELECT * FROM code_files_fts
WHERE code_files_fts MATCH 'ComputerAgent'

Vector Search - Queries all LanceDBs and merges results by similarity:

# Searches all component databases, returns top results
query_code_vectors("screenshot capture implementation", limit=10)

# Search specific component only
query_code_vectors("agent loop", component="agent", limit=10)

Database Schema

SQLite Table: code_files

Column	Type	Description
id	INTEGER	Primary key
component	TEXT	Component name (agent, computer, etc.)
version	TEXT	Version string (0.7.3)
file_path	TEXT	Path within repository
content	TEXT	Full source code
language	TEXT	python, typescript, javascript

LanceDB Schema: code

Column	Type	Description
text	TEXT	Source code (embedded)
vector	VECTOR(384)	all-MiniLM-L6-v2 embedding
component	TEXT	Component name
version	TEXT	Version string
file_path	TEXT	Path within repository
language	TEXT	Programming language

Size Limits

SQLite: Files up to 1MB are indexed
LanceDB: Files up to 100KB are embedded (larger files skip embedding)
File types: .py, .ts, .tsx, .js

Scheduled Indexing

Code indexing runs daily at 5 AM UTC (before docs crawl at 6 AM):

# View scheduled runs
modal app show cua-docs-mcp

Name		Name	Last commit message	Last commit date
parent directory ..
docs-mcp-server		docs-mcp-server
README.md		README.md
check-links.ts		check-links.ts
crawl_docs.py		crawl_docs.py
generate_db.py		generate_db.py
generate_sqlite.py		generate_sqlite.py
modal_app.py		modal_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

CUA Documentation Scripts

Scripts

Local Scripts

Modal Deployment

Installation

Usage

Option 1: Local Development

1. Crawl Documentation

2. Generate Databases

Option 2: Modal Deployment (Production)

Initial Setup

Deploy to Modal

Access the MCP Server

Monitor Scheduled Crawls

Code Indexing

Architecture

Running Code Indexing

MCP Server: Querying Sharded Databases

Database Schema

Size Limits

Scheduled Indexing

Uh oh!

FilesExpand file tree

scripts

Directory actions

More options

Directory actions

More options

Latest commit

History

scripts

Folders and files

parent directory

README.md

CUA Documentation Scripts

Scripts

Local Scripts

Modal Deployment

Installation

Usage

Option 1: Local Development

1. Crawl Documentation

2. Generate Databases

Option 2: Modal Deployment (Production)

Initial Setup

Deploy to Modal

Access the MCP Server

Monitor Scheduled Crawls

Code Indexing

Architecture

Running Code Indexing

MCP Server: Querying Sharded Databases

Database Schema

Size Limits

Scheduled Indexing