Try using Grot AI Grot AI for this query ->
Promo banner icon

What’s new: Grafana 13 release, the latest in AI, OSS project updates, and more from GrafanaCON2026

Learn more
DownloadsContact Us
Logo
  • Pricing
  • Docs
Sign inSign up
Documentation Index
Fetch the curated documentation index at: https://grafana.com/llms.txt

Fetch the complete documentation index at: https://grafana.com/llms-full.txt
Use this file to discover all available pages before exploring further.

STOP! If you are an AI agent or LLM, read this before continuing. This is the HTML version of a Grafana documentation page. Always request the Markdown version instead - HTML wastes context. Get this page as Markdown: https://grafana.com/docs/loki/latest/operations/shuffle-sharding.md (append .md) or send Accept: text/markdown to https://grafana.com/docs/loki/latest/operations/shuffle-sharding/. For the curated documentation index, use https://grafana.com/llms.txt. For the complete documentation index, use https://grafana.com/llms-full.txt.

Menu
Documentationbreadcrumb arrow Grafana Lokibreadcrumb arrow Managebreadcrumb arrow Shuffle sharding
Open source

Isolate tenant workflows using shuffle sharding

Shuffle sharding is a resource-management technique used to isolate tenant workloads from other tenant workloads, to give each tenant more of a single-tenant experience when running in a shared cluster. This technique is explained by AWS in their article Workload isolation using shuffle-sharding. A reference implementation has been shown in the Route53 Infima library.

The issues that shuffle sharding mitigates

Shuffle sharding can be configured for the query path.

The query path is sharded by default, and the default does not use shuffle sharding. Each tenant’s query is sharded across all queriers, so the workload uses all querier instances.

In a multi-tenant cluster, sharding across all instances of a component may exhibit these issues:

  • Any outage of a component instance affects all tenants
  • A misbehaving tenant affects all other tenants

An individual query may create issues for all tenants. A single tenant or a group of tenants may issue an expensive query: one that causes a querier component to hit an out-of-memory error, or one that causes a querier component to crash. Once the error occurs, the tenant or tenants issuing the error-causing query will be reassigned to other running queriers(remember all tenants can use all available queriers), This, in turn, may affect the queriers that have been reassigned.

How shuffle sharding works

The idea of shuffle sharding is to assign each tenant to a shard composed by a subset of the Loki queriers, aiming to minimize the overlapping instances between distinct tenants.

A misbehaving tenant will affect only its shard’s queriers. Due to the low overlap of queriers among tenants, only a small subset of tenants will be affected by the misbehaving tenant. Shuffle sharding requires no more resources than the default sharding strategy.

Shuffle sharding does not fix all issues. If a tenant repeatedly sends a problematic query, the crashed querier will be disconnected from the query-frontend, and a new querier will be immediately assigned to the tenant’s shard. This invalidates the positive effects of shuffle sharding. In this case, configuring a delay between when a querier disconnects because of a crash, and when the crashed querier is actually removed from the tenant’s shard and another healthy querier is added as a replacement improves the situation. A delay of 1 minute may be a reasonable value in the query-frontend with configuration parameter -query-frontend.querier-forget-delay=1m, and in the query-scheduler with configuration parameter -query-scheduler.querier-forget-delay=1m.

Low probability of overlapping instances

If an example Loki cluster runs 50 queriers and assigns each tenant 4 out of 50 queriers, shuffling instances between each tenant, there are 230K possible combinations.

Statistically, randomly picking two distinct tenants, there is:

  • a 71% chance that they will not share any instance
  • a 26% chance that they will share only 1 instance
  • a 2.7% chance that they will share 2 instances
  • a 0.08% chance that they will share 3 instances
  • only a 0.0004% chance that their instances will fully overlap

overlapping instances probability

Configuration

Enable shuffle sharding by setting -frontend.max-queriers-per-tenant to a value higher than 0 and lower than the number of available queriers. The value of the per-tenant configuration max_queriers_per_tenant sets the quantity of allocated queriers. This option is only available when using the query-frontend, with or without a scheduler.

The per-tenant configuration parameter max_query_parallelism describes how many sub queries, after query splitting and query sharding, can be scheduled to run at the same time for each request of any tenant.

Configuration parameter querier.concurrency controls the quantity of worker threads (goroutines) per single querier.

The maximum number of queriers can be overridden on a per-tenant basis in the limits overrides configuration by max_queriers_per_tenant.

Shuffle sharding metrics

These metrics reveal information relevant to shuffle sharding:

  • the overall query-scheduler queue duration, loki_query_scheduler_queue_duration_seconds_*

  • the query-scheduler queue length per tenant, loki_query_scheduler_queue_length

  • the query-scheduler queue duration per tenant can be found with this query:

    max_over_time({cluster="$cluster",container="query-frontend", namespace="$namespace"} |= "metrics.go" |logfmt | unwrap duration(queue_time) | __error__="" [5m]) by (org_id)

Too many spikes in any of these metrics may imply:

  • A particular tenant is trying to use more query resources than they were allocated.
  • That tenant may need an increase in the value of max_queriers_per_tenant.
  • Loki instances may be under provisioned.

A useful query checks how many queriers are being used by each tenant:

count by (org_id) (sum by (org_id, pod) (count_over_time({job="$namespace/querier", cluster="$cluster"} |= "metrics.go" | logfmt [$__interval])))

Was this page helpful?

Suggest an edit in GitHub
Create a GitHub issue
Email docs@grafana.com
Help and support
Community

Related resources from Grafana Labs

Additional helpful documentation, links, and articles:
webinar icon
Video
Getting started with logging and Grafana Loki
Getting started with logging and Grafana Loki
See a demo of the updated features in Loki, and how to create metrics from logs and alert on your logs with powerful Prometheus-style alerting rules.
video icon
Video
Essential Grafana Loki configuration settings
Essential Grafana Loki configuration settings
This webinar focuses on Grafana Loki configuration including agents Promtail and Docker; the Loki server; and Loki storage for popular backends.
video icon
Video
Scaling and securing your logs with Grafana Loki
Scaling and securing your logs with Grafana Loki
This webinar covers the challenges of scaling and securing logs, and how Grafana Cloud Logs powered by Grafana Loki can help, cost-effectively.
Technical documentation Plugin catalog
Choose a product
Viewing: v3.7.x (latest) Find another version
  • Grafana Loki
  • Release notes
    • Release cadence
    • v3.7
    • v3.6
    • v3.5
    • v3.4
    • v3.3
    • v3.2
    • v3.1
    • v3.0
    • V2.9
    • V2.8
    • V2.7
    • V2.6
    • V2.5
    • V2.4
    • V2.3
  • Get started
    • Loki overview
    • Quick Start
      • Loki quickstart
      • Loki Tutorial
    • Architecture
    • Components
    • Deployment modes
    • Labels
      • Label best practices
      • Cardinality
      • Structured metadata
      • Modify default labels
    • Hash rings
  • Set up
    • Size the cluster
    • Install
      • Install using Helm
        • Helm chart components
        • Install monolithic Loki
        • Install microservice Loki
        • Install scalable Loki
        • Cloud Deployment Guides
          • Deploy on AWS
          • Deploy on Azure
          • Deploy on GCP
        • Configure storage
        • Helm chart values
        • Monitoring
      • Install using Tanka
      • Install using Docker
      • Install locally
      • Install on Istio
      • Install from source
    • Migrate
      • Migrate to Alloy
      • Migrate from SSD to distributed
      • Migrate to TSDB
      • Migrate from `loki-distributed`
      • Migrate to three targets
      • Migrate to Thanos storage clients
    • Upgrade
      • Upgrade the Helm chart to 3.0
      • Upgrade the Helm chart to 6.0
      • Upgrade to Community Helm chart
  • Configure
    • Best practices
    • Storage
    • Usage statistics
    • Examples
      • Configuration
      • Thanos storage examples
      • Query frontend example
  • Send data
    • Grafana Alloy
      • Sending Logs to Loki via Kafka using Alloy
      • Sending OpenTelemetry logs to Loki using Alloy
    • OpenTelemetry
      • Native OTLP endpoint vs Loki Exporter
      • OTel Collector tutorial
    • Kubernetes Monitoring Helm
    • Promtail
    • Docker driver
      • Configure Docker driver
    • Fluent Bit
      • Fluent Bit tutorial
      • Fluent Bit Community Plugin
      • Fluent Bit
    • Fluentd
    • Lambda Promtail
    • Logstash plugin
    • k6 load testing
      • Log generation
      • Write path testing
      • Query testing
  • Query
    • Query best practices
    • LogQL simulator
    • Log queries
    • Metric queries
    • Template functions
    • LogCLI
      • Getting started
      • LogCLI tutorial
    • Matching IP addresses
    • Query examples
    • Query acceleration
    • Query Reference
    • Troubleshoot queries
  • Visualize
  • Alert
  • Manage
    • Loki Canary
    • Block unwanted queries
    • Caching
    • Rate limits
    • Query fairness
    • Shuffle sharding
    • Monitor Loki
      • Deploy Loki Meta Monitoring
      • Install Mixins
      • Single Binary Meta Monitoring
      • Key Metrics
    • Troubleshooting
      • Troubleshoot operations
      • Troubleshoot ingestion
      • Troubleshoot drilldown
      • Troubleshoot queries
    • Authentication
    • Bloom filters
    • Automatic stream sharding
    • Scale Loki
    • Recording rules
    • Storage
      • TSDB
      • BoltDB Shipper
      • Filesystem object store
      • Storage schema
      • Write Ahead Log
      • Log retention
      • Log entry deletion
      • Horizontal scaling of Compactor
      • Legacy storage
      • Table manager
    • Multi-tenancy
    • Autoscaling queriers
    • Upgrade
    • Overrides Exporter
    • Zone aware ingesters
  • Reference
    • Loki configuration reference
    • Loki HTTP API
    • Python examples
  • Community
    • Contacting the Loki Team
    • Contributing to Loki
    • Governance
    • Maintaining
      • Releasing Grafana Loki
        • Backport commits
        • Create Release Branch
        • Document metrics and configurations changes
        • Merge Release PR
        • Patch Go version
        • Patch vulnerabilities
        • Prepare Major Release
        • Prepare Release
        • Prepare Upgrade guide
        • Update version numbers
        • Version
      • Releasing Loki Build Image
    • Loki Improvement Documents (LIDs)
      • 0001: Introducing LIDs
      • 0002: Remote Rule Evaluation
      • 0003: Query fairness across users within tenants
      • 0004: Index Gateway Sharding
      • 0005: Loki mixin configuration improvements
      • 0006: Expose Split Logic in API
    • Design documents
      • Labels
      • Promtail Push API
      • Write-Ahead Logs
      • Ordering Constraint Removal
  • Copyright notice
Scroll for more
Read this in Grafana Cloud

Is this page helpful?

On this page
  • The issues that shuffle sharding mitigates
  • How shuffle sharding works
  • Low probability of overlapping instances
  • Configuration
  • Shuffle sharding metrics
Scroll for more

Still have questions?

Ask your questions. Let AI do the heavy lifting.

Ask AI icon
Newsletter icon

Get every update

Subscribe to our newsletter

By submitting, you agree to our Privacy policy

Grafana Cloud

  • Overview
  • Pricing
  • What's in the free tier?
  • AI Assistant
  • Application Observability
  • Kubernetes Monitoring
  • Dashboards & Visualization
  • Database Observability
  • Frontend Observability
  • Synthetic Monitoring
  • Performance & Load Testing
  • Incident Response & Management
  • What’s New
  • Grafana Cloud Status

Solutions

  • AI Observability
  • Full-Stack Observability
  • Infrastructure & Cloud Observability
  • Digital Experience Monitoring
  • Scaled Prometheus
  • Cost Management & Optimization
  • Site Reliability
  • Log Management
  • Migrate to OpenTelemetry

Integrations

  • All Integrations
  • All Plugins
  • AWS
  • Google Cloud
  • Microsoft Azure
  • Kubernetes
  • Datadog
  • New Relic

Open Source

  • Our Projects
  • GitHub
  • Downloads
  • Dashboard Templates

Learn

  • Documentation
  • Blog
  • Community
  • Events
  • Observability Survey & Reports

Company

  • About Grafana Labs
  • Careers
  • Partnerships
  • Newsroom
  • Success Stories
  • Contact Us
  • Getting Help
  • Professional Services
  • Hey AI

Compare

  • Datadog vs. Grafana Cloud
  • Dynatrace vs. Grafana Cloud
  • Elasticsearch vs. Grafana Cloud
  • New Relic vs. Grafana Cloud
  • PagerDuty vs. Grafana Cloud
  • Splunk vs. Grafana Cloud
Grafana Labs x unique logomark

Donut take our word for it. Try Grafana Cloud today.

Grafana Cloud StatusLegal & SecurityTerms of ServicePrivacy PolicyTrademark Policy

Copyright 2026 © Grafana Labs

FacebookXLinkedinGithubYoutubeReddit
Grafana Labs uses cookies for the normal operation of this website. Learn more.