Unified Python library for Chinese LLMs, with flexible batch capacity, feedback on vendor-native parameter validation, and structured overview and automated accumulation for streaming.
-
Updated
May 29, 2026 - Python
Unified Python library for Chinese LLMs, with flexible batch capacity, feedback on vendor-native parameter validation, and structured overview and automated accumulation for streaming.
Learn GenAI and Agentic AI from Zero to Production
Advanced RAG pipelines for medical (HealthBench, MedCaseReasoning, MetaMedQA, PubMedQA) and financial (FinanceBench, Earnings Calls) QA. LangGraph orchestration + BAML structructed generation, Milvus Hybrid search (Dense + BM25 + RRF), three-layer Metadata Enrichment, Contextual AI instruction-following reranker, and DeepEval evaluation.
Advanced RAG pipeline optimization framework using DSPy. Implements modular RAG pipelines with Query-Rewriting, Sub-Query Decomposition, and Hybrid Search via Weaviate. Automates prompt tuning and few-shot selection using GEPA, SIMBA, MIPRO, COPRO, and BootstrapFewShot optimizers on datasets like FreshQA, HotpotQA, TriviaQA, Wikipedia and PubMedQA.
Self-Reflective Question Answering for Biomedical Reasoning. GRPO fine-tuning via QLoRA & Unsloth with rewards for correctness, relevance, groundness, utility & XML structure. Structured think → answer → self-reflection with context grading, relevance assessment & groundness evaluation. DeepEval LLM-as-a-Judge (GEval, Faithfulness, Relevancy).
🚀 Production-ready modular RAG monorepo: Local LLM inference (vLLM) • Hybrid retrieval with Qdrant • Semantic caching • Docling document parsing • Cross-encoder reranking • DeepEval evaluation • Full observability with Langfuse • Open WebUI chat interface • OpenAI-compatible API • Fully Dockerized
Multi-agent LangGraph app turning raw requirements into backlog-ready user stories, ACs, and test cases. Live Gemini-Flash CI eval gate + nightly N=10 adversarial regression.
BNS-LexAI is an AI-powered legal information and case understanding assistant.
A hands-on exploration of Deepeval — an open-source framework for evaluating and red-teaming large language models (LLMs). This repository documents my journey of testing, benchmarking, and improving LLM reliability using custom prompts, metrics, and pipelines.
pytest lab for testing LLMs: RAG eval, red teaming, guardrails, drift monitoring — 14 modules, 382 tests, zero API calls needed
Production RAG system in Python: Haystack pipelines, FastAPI SSE streaming, Qdrant hybrid retrieval, OpenAI embeddings, DeepEval golden-set evaluation, and Langfuse tracing. Includes latency benchmarks (P50/P95 TTFT), retrieval failure-mode analysis, and chunking-strategy decision logs.
Drop in deal documents → get an onboarding plan, draft invoice, and stakeholder summary. Multi-agent LangGraph pipeline with RAG, human approval, and self-correcting retries.
Production-grade LLM evaluation pipeline for RAG chatbot — DeepEval + RAGAS + Garak + CI/CD | Financial domain | 7 metrics | Adversarial testing
Automatically discover where and why your LLM is failing — embedding-space clustering + statistical hypothesis testing to surface input slices with elevated failure rates and audit test suite coverage gaps.
Automated RAG pipeline evaluation framework that scores faithfulness, hallucination rate, and retrieval quality with GitHub Actions CI/CD and React dashboard
A research project to measure AI agent robustness. Contains automated testing pipelines and a benchmarking methodology developed to audit Agentic AI architectures for complex reasoning flaws.
[UNDER DEVELOPMENT] Clinical-RAG is a production-grade, citation-backed AI system designed to bridge the "Trust Gap" in medical information retrieval.
Add a description, image, and links to the deepeval topic page so that developers can more easily learn about it.
To associate your repository with the deepeval topic, visit your repo's landing page and select "manage topics."