rag-evaluation

Here are 153 public repositories matching this topic...

Giskard-AI / giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Jun 11, 2026
Python

Marker-Inc-Korea / AutoRAG

Star

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated Jun 5, 2026
Python

Agenta-AI / agenta

Star

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

evaluation agents observability prompt-engineering llmops prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability

Updated Jun 10, 2026
TypeScript

onyx-dot-app / EnterpriseRAG-Bench

Star

Dataset and benchmark for RAG on company internal documents.

python enterprise benchmark information-retrieval evaluation dataset question-answering knowledge-base semantic-search enterprise-search synthetic-data rag synthetic-data-generation large-language-models llm generative-ai retrieval-augmented-generation llm-evaluation rag-evaluation

Updated May 8, 2026

frutik / Awesome-RAG

Star

rag rag-implementation rag-evaluation

Updated Sep 7, 2025

vectara / open-rag-eval

Star

RAG evaluation without the need for "golden answers"

metrics evaluation-metrics rag vectara retrieval-augmented-generation rag-evaluation

Updated Jun 2, 2026
Python

LLAMATOR-Core / llamator

Star

Red Teaming python-framework for testing chatbots and GenAI systems.

Updated May 20, 2026
Python

GiovanniPasq / chunky

Star

Open-source toolkit for reliable RAG pipelines: convert PDFs to Markdown, clean documents, inspect chunks, compare chunking strategies, and enrich metadata for LLM applications.

Updated Jun 6, 2026
Python

mburaksayici / RAG-Boilerplate

Star

RAG boilerplate with semantic/propositional chunking, hybrid search (BM25 + dense), LLM reranking, query enhancement agents, CrewAI orchestration, Qdrant vector search, Redis/Mongo sessioning, Celery ingestion pipeline, Gradio UI, and an evaluation suite (Hit-Rate, MRR, hybrid configs).

ai-agents reranking rag vector-database hybrid-search qdrant llm retrieval-augmented-generation rag-evaluation semantic-chunking crewai rag-pipeline propositional-models query-enhancement

Updated Nov 18, 2025
Python

dokimos-dev / dokimos

Star

LLM and agent evaluation for Java & Kotlin. Runs in JUnit and CI. Spring AI, LangChain4j, Koog, Embabel, and any LLM client.

Updated Jun 10, 2026
Java

Vbj1808 / Dokis

Star

Lightweight RAG provenance middleware. Verifies every claim in an LLM response is grounded in a retrieved source - without an LLM call.

python middleware provenance citations developer-tools ai-safety rag guardrails trustworthy-ai llm langchain retrieval-augmented-generation llm-evaluation rag-evaluation hallucination-detection

Updated Apr 28, 2026
Python

mts-ai / rurage

Star

information-retrieval question-answering rag llm-evaluation rag-evaluation

Updated Apr 14, 2025
Python

vero-labs-ai / vero-eval

Star

Open source framework for evaluating AI Agents

python testing evaluation datasets dataset-generation evaluation-metrics evaluation-framework testing-framework testing-library synthetic-dataset-generation user-persona evals llm-evaluation rag-evaluation llm-evaluation-framework langgraph rag-testing

Updated Feb 24, 2026
Python

HZYAI / RagScore

Star

⚡️ The "1-Minute RAG Audit" — Generate QA datasets & evaluate RAG systems in Colab, Jupyter, or CLI. Privacy-first, async, visual reports.

privacy jupyter mcp evaluation colab dataset-generation synthetic-data fine-tuning rag qa-generation ai-evaluation llm llmops local-llm ollama rag-evaluation llm-as-a-judge

Updated May 29, 2026
Python

mburaksayici / smallevals

Star

smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.

qa chroma question-generation weaviate qa-generation milvus vector-database qdrant chromadb rag-evaluation tiny-llm retrieval-evaluation offline-evaluation retrieval-metrics

Updated Dec 4, 2025
Python

Evaliphy / evaliphy

Star

The E2E AI testing tool | No ML Overhead

ai test-automation testing-tools end-to-end-testing test-automation-framework rag ai-testing llm-evaluation rag-evaluation llm-evaluation-toolkit llm-evaluation-framework rag-pipeline llm-testing ai-testing-tool ai-test-automation

Updated May 7, 2026
TypeScript

oztrkoguz / RAG-Framework-Evaluation

Star

This project aims to compare different Retrieval-Augmented Generation (RAG) frameworks in terms of speed and performance.

swarms autogen rag langchain llamaindex rag-evaluation crewai langchain-rag autogen-rag crewai-rag llamaindex-rag swarms-rag

Updated Jul 28, 2024
Python

simranjeet97 / Learn_RAG_from_Scratch_LLM

Star

Learn Retrieval-Augmented Generation (RAG) from Scratch using LLMs from Hugging Face and Langchain or Python

artificial-intelligence rag datascience-machinelearning generative-ai llm-training retrieval-augmented-generation rag-model llm-framework llm-apps llm-evaluation genai-usecase rag-implementation rag-evaluation rag-embeddings rag-pipeline rag-llm rag-chatbot rag-application genai-domain

Updated Jan 20, 2025
Jupyter Notebook

ioannis-papadimitriou / rag-playground

Star

A framework for systematic evaluation of retrieval strategies and prompt engineering in RAG systems, featuring an interactive chat interface for document analysis.

chatbot qa-generation llm-inference retrieval-augmented-generation rag-evaluation

Updated Dec 18, 2024
Python

lizhiyao / oh-my-knowledge

Star

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

benchmark ai evaluation-framework claude knowledge-engineering skill-evaluation llm prompt-engineering prompt-testing llm-evaluation rag-evaluation llm-judge claude-code agent-evaluation bootstrap-ci krippendorff-alpha evaluation-as-code multi-judge-ensemble

Updated Jun 10, 2026
TypeScript

Improve this page

Add a description, image, and links to the rag-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the rag-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rag-evaluation

Here are 153 public repositories matching this topic...

Giskard-AI / giskard-oss

Marker-Inc-Korea / AutoRAG

Agenta-AI / agenta

onyx-dot-app / EnterpriseRAG-Bench

frutik / Awesome-RAG

vectara / open-rag-eval

LLAMATOR-Core / llamator

GiovanniPasq / chunky

mburaksayici / RAG-Boilerplate

dokimos-dev / dokimos

Vbj1808 / Dokis

mts-ai / rurage

vero-labs-ai / vero-eval

HZYAI / RagScore

mburaksayici / smallevals

Evaliphy / evaliphy

oztrkoguz / RAG-Framework-Evaluation

simranjeet97 / Learn_RAG_from_Scratch_LLM

ioannis-papadimitriou / rag-playground

lizhiyao / oh-my-knowledge

Improve this page

Add this topic to your repo