Build software better, together

kanchengw / cnllm

Unified Python library for Chinese LLMs, with flexible batch capacity, feedback on vendor-native parameter validation, and structured overview and automated accumulation for streaming.

Updated May 29, 2026
Python

bansalkanav / GenAI-AgenticAI-From-Zero-to-Production

Star

Learn GenAI and Agentic AI from Zero to Production

Updated May 10, 2026
Jupyter Notebook

Advanced RAG pipelines for medical (HealthBench, MedCaseReasoning, MetaMedQA, PubMedQA) and financial (FinanceBench, Earnings Calls) QA. LangGraph orchestration + BAML structructed generation, Milvus Hybrid search (Dense + BM25 + RRF), three-layer Metadata Enrichment, Contextual AI instruction-following reranker, and DeepEval evaluation.

pubmed unstructured rag baml milvus earnings-calls contextual-ai llm langgraph rag-pipeline agentic-rag deepeval financebench healthbench

Updated Jun 3, 2026
Python

avnlp / dspy-opt

Star

Advanced RAG pipeline optimization framework using DSPy. Implements modular RAG pipelines with Query-Rewriting, Sub-Query Decomposition, and Hybrid Search via Weaviate. Automates prompt tuning and few-shot selection using GEPA, SIMBA, MIPRO, COPRO, and BootstrapFewShot optimizers on datasets like FreshQA, HotpotQA, TriviaQA, Wikipedia and PubMedQA.

metadata-extraction query-rewriting rag weaviate dspy rag-pipeline deepeval sub-query-generation

Updated Jun 8, 2026
Python

avnlp / biothink

Star

Self-Reflective Question Answering for Biomedical Reasoning. GRPO fine-tuning via QLoRA & Unsloth with rewards for correctness, relevance, groundness, utility & XML structure. Structured think → answer → self-reflection with context grading, relevance assessment & groundness evaluation. DeepEval LLM-as-a-Judge (GEval, Faithfulness, Relevancy).

self-reflection rag biomedical-question-answering self-rag grpo deepeval

Updated Jun 3, 2026
Python

MERakram / Advanced-RAG-monorepo

Star

🚀 Production-ready modular RAG monorepo: Local LLM inference (vLLM) • Hybrid retrieval with Qdrant • Semantic caching • Docling document parsing • Cross-encoder reranking • DeepEval evaluation • Full observability with Langfuse • Open WebUI chat interface • OpenAI-compatible API • Fully Dockerized

python nlp ai self-hosted reranking rag fastapi vector-database cross-encoder qdrant vllm langfuse open-webui deepeval

Updated Jan 28, 2026
Python

JohnRitchie / qa-llm-guard

Star

python pytest allure testing-framework qa-automation llm-testing deepeval

Updated May 20, 2025
Python

augustineuzokwe / rtia

Star

Multi-agent LangGraph app turning raw requirements into backlog-ready user stories, ACs, and test cases. Live Gemini-Flash CI eval gate + nightly N=10 adversarial regression.

python multi-agent gemini gradio requirements-engineering fastapi ai-qa langchain llm-evaluation langgraph agentic-ai deepeval

Updated Jun 10, 2026
Python

adityapradhan202 / BNS-LexAI

Star

BNS-LexAI is an AI-powered legal information and case understanding assistant.

docker python3 fastapi streamlit generative-ai pineconedb google-ai-studio deepeval

Updated Feb 1, 2026
Jupyter Notebook

dts26 / rag-uigreenmetric

Star

A hybrid RAG system that answers complex questions about the UI GreenMetric Sustainable University Rankings by combining unstructured narrative guidelines with structured tabular data.

nlp rag chromadb retrieval-augmented-generation deepseek deepeval

Updated Jun 3, 2026
Python

avi350751 / test-llm-with-deepeval

Star

A hands-on exploration of Deepeval — an open-source framework for evaluating and red-teaming large language models (LLMs). This repository documents my journey of testing, benchmarking, and improving LLM reliability using custom prompts, metrics, and pipelines.

evals deepeval llmtesting

Updated Nov 2, 2025
Jupyter Notebook

hellolets / letsrag

Star

Step-by-step guide to building a local RAG system from scratch. Learn hybrid search, reranking, HyDE, and evaluation... 100% free, no cloud required.

python semantic-search bm25 reranker rag fastapi hybrid-search llm ollama chonkie deepeval

Updated Mar 1, 2026
Python

gonzaloMorenoc / ai-testing-lab

Star

pytest lab for testing LLMs: RAG eval, red teaming, guardrails, drift monitoring — 14 modules, 382 tests, zero API calls needed

Updated May 13, 2026
Python

mohsinsheikhani / advanced-rag-engineering

Star

Production RAG system in Python: Haystack pipelines, FastAPI SSE streaming, Qdrant hybrid retrieval, OpenAI embeddings, DeepEval golden-set evaluation, and Langfuse tracing. Includes latency benchmarks (P50/P95 TTFT), retrieval failure-mode analysis, and chunking-strategy decision logs.

Updated May 26, 2026
Python

ahmedbutt2015 / deal-agent

Star

Drop in deal documents → get an onboarding plan, draft invoice, and stakeholder summary. Multi-agent LangGraph pipeline with RAG, human approval, and self-correcting retries.

multi-agent openai ai-agents fastapi streamlit document-intelligence langchain llm-agent retrieval-augmented-generation langgraph deepeval

Updated Apr 16, 2026
Python

kothakota-bindu / finsight-ai-testing

Star

Production-grade LLM evaluation pipeline for RAG chatbot — DeepEval + RAGAS + Garak + CI/CD | Financial domain | 7 metrics | Adversarial testing

python pytest fintech llama rag github-actions groq langchain ai-quality llm-evaluation ragas llm-testing deepeval garak

Updated May 6, 2026
Python

gabonavarroo / faultmap

Star

Automatically discover where and why your LLM is failing — embedding-space clustering + statistical hypothesis testing to surface input slices with elevated failure rates and audit test suite coverage gaps.

python testing clustering evaluation embeddings hypothesis-testing observability hdbscan llm litellm ragas deepeval

Updated Apr 15, 2026
Python

rojitharepalle / LLM-Evaluation-Framework

Star

Automated RAG pipeline evaluation framework that scores faithfulness, hallucination rate, and retrieval quality with GitHub Actions CI/CD and React dashboard

react python ai evaluation hallucination rag github-actions fastapi llm langchain ragas deepeval

Updated May 31, 2026
Python

Michelin-Ensimag / AI-Agent-Testing

Star

A research project to measure AI agent robustness. Contains automated testing pipelines and a benchmarking methodology developed to audit Agentic AI architectures for complex reasoning flaws.

python benchmarking mcp ai-agents generative-ai langchain llm-evaluation llm-as-a-judge deepeval

Updated May 21, 2026
Python

SchadenKai / Clinical-RAG

Star

[UNDER DEVELOPMENT] Clinical-RAG is a production-grade, citation-backed AI system designed to bridge the "Trust Gap" in medical information retrieval.

milvus healthcare-ai langchain-python rag-pipeline rag-chatbot langgraph-python deepeval

Updated Mar 14, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepeval

Here are 76 public repositories matching this topic...

kanchengw / cnllm

bansalkanav / GenAI-AgenticAI-From-Zero-to-Production

avnlp / rag-pipelines

avnlp / dspy-opt

avnlp / biothink

MERakram / Advanced-RAG-monorepo

JohnRitchie / qa-llm-guard

augustineuzokwe / rtia

adityapradhan202 / BNS-LexAI

dts26 / rag-uigreenmetric

avi350751 / test-llm-with-deepeval

hellolets / letsrag

gonzaloMorenoc / ai-testing-lab

mohsinsheikhani / advanced-rag-engineering

ahmedbutt2015 / deal-agent

kothakota-bindu / finsight-ai-testing

gabonavarroo / faultmap

rojitharepalle / LLM-Evaluation-Framework

Michelin-Ensimag / AI-Agent-Testing

SchadenKai / Clinical-RAG

Improve this page

Add this topic to your repo