A modular text-to-SQL project with a full pipeline for:
- Dataset preprocessing
- SQL data augmentation
- Model training
- Multi-stage inference
This repository is organized by stage. The root README is a navigation guide. For detailed usage and arguments, please go to each subdirectory README.
🚧 Note: This repository is under construction. 🚧
- Preprocessing:
preprocess/README.md - Data augmentation:
data_augment/README.md - Training:
training/README.md - Inference:
infer/README.md - Artifacts layout:
artifacts/README.md
├── preprocess/ # dataset prep, IR generation, schema input construction
├── data_augment/ # schema linking / SQL augmentation / pairwise CoT annotation
├── training/ # stage-wise training entrypoints and launch scripts
├── infer/ # unified inference pipeline
├── schema_utils/ # IR -> schema rendering utilities
├── value_index/ # value embedding and vector index utilities
├── artifacts/ # intermediate files and generated artifacts
└── dataset/ # benchmark datasets (can be a symlink)
Because vLLM and training stacks have different dependency constraints, we recommend separate environments.
conda create -n train_env python=3.12
conda activate train_env
pip install -r requirements-train.txt
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolationconda create -n eval_env python=3.12
conda activate eval_env
pip install -r requirements-eval.txt- Prepare datasets
python preprocess/prepare_datasets.py --all- Build IR / index / schema input
python preprocess/schema_to_ir.py --all
python preprocess/build_value_index.py --all --device cuda:0
python preprocess/schema_input.py --bench Spider_dev --device cuda:0- Generate training data
python data_augment/schema_linking_augment.py- Train stage models
bash training/train.sh global-sft <lr> <epoch> <model_name_or_path>
bash training/train.sh local-linker <lr> <epoch> <model_name_or_path>
bash training/train.sh generator <lr> <epoch> <model_name_or_path>
bash training/train.sh selector <lr> <epoch> <model_name_or_path>- Run unified inference
python infer/inference.py --help
# or
bash infer/start_pipeline.sh <BENCHMARK> <SETTING>