Scripts¶
The scripts/ directory contains utilities for processing Lean code data, generating embeddings, and building search indices.
Overview¶
See the Scripts README for detailed documentation on all scripts.
Key Scripts¶
Data Processing¶
populate_db.py: Populates the database from Lean declaration filesupdate_primary_declarations.py: Updates primary declaration assignmentspagerank.py: Calculates PageRank scores for declarations
Embeddings and Search¶
prepare_embedding_input.py: Prepares text for embedding generationgenerate_embeddings.py: Generates vector embeddings from textbuild_faiss_index.py: Builds FAISS search index from embeddings
LLM Processing¶
lean_to_english.py: Generates English descriptions using LLMsget_summaries.py: Generates summaries for statement groups
Documentation¶
generate_docs_data.py: Generates documentation datagenerate_manifest.py: Creates manifest files for data toolchains
Pipeline¶
The typical data processing pipeline:
- Extract Lean declarations (using Lean scripts in
extractor/) - Populate database:
populate_db.py - Update primary declarations:
update_primary_declarations.py - Calculate PageRank:
pagerank.py - Generate English descriptions:
lean_to_english.py - Generate summaries:
get_summaries.py - Prepare embedding input:
prepare_embedding_input.py - Generate embeddings:
generate_embeddings.py - Build FAISS index:
build_faiss_index.py - Generate manifest:
generate_manifest.py
Next Steps¶
- Data Pipeline - Detailed pipeline documentation
- Contributing - How to contribute