GitHub - TsinghuaDatabaseGroup/OpenSQL: Data-Efficient Text-to-SQL for Open-Source LLMs via Synthesized Intermediate Supervision

A modular text-to-SQL project with a full pipeline for:

Dataset preprocessing
SQL data augmentation
Model training
Multi-stage inference

This repository is organized by stage. The root README is a navigation guide. For detailed usage and arguments, please go to each subdirectory README.

🚧 Note: This repository is under construction. 🚧

subdirectories

Preprocessing: preprocess/README.md
Data augmentation: data_augment/README.md
Training: training/README.md
Inference: infer/README.md
Artifacts layout: artifacts/README.md

Repository Layout

├── preprocess/      # dataset prep, IR generation, schema input construction
├── data_augment/    # schema linking / SQL augmentation / pairwise CoT annotation
├── training/        # stage-wise training entrypoints and launch scripts
├── infer/           # unified inference pipeline
├── schema_utils/    # IR -> schema rendering utilities
├── value_index/     # value embedding and vector index utilities
├── artifacts/       # intermediate files and generated artifacts
└── dataset/         # benchmark datasets (can be a symlink)

Environment Setup

Because vLLM and training stacks have different dependency constraints, we recommend separate environments.

Training Environment

conda create -n train_env python=3.12
conda activate train_env
pip install -r requirements-train.txt
conda install ninja
MAX_JOBS=64 pip install flash-attn --no-build-isolation

Inference Environment

conda create -n eval_env python=3.12
conda activate eval_env
pip install -r requirements-eval.txt

Workflow

Prepare datasets

python preprocess/prepare_datasets.py --all

Build IR / index / schema input

python preprocess/schema_to_ir.py --all
python preprocess/build_value_index.py --all --device cuda:0
python preprocess/schema_input.py --bench Spider_dev --device cuda:0

Generate training data

python data_augment/schema_linking_augment.py

Train stage models

bash training/train.sh global-sft <lr> <epoch> <model_name_or_path>
bash training/train.sh local-linker <lr> <epoch> <model_name_or_path>
bash training/train.sh generator <lr> <epoch> <model_name_or_path>
bash training/train.sh selector <lr> <epoch> <model_name_or_path>

Run unified inference

python infer/inference.py --help
# or
bash infer/start_pipeline.sh <BENCHMARK> <SETTING>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

subdirectories

Repository Layout

Environment Setup

Training Environment

Inference Environment

Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_augment		data_augment
infer		infer
preprocess		preprocess
schema_utils		schema_utils
training		training
value_index		value_index
.DS_Store		.DS_Store
README.md		README.md
requirements-eval.txt		requirements-eval.txt
requirements-train.txt		requirements-train.txt

Folders and files

Latest commit

History

Repository files navigation

subdirectories

Repository Layout

Environment Setup

Training Environment

Inference Environment

Workflow

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages