Skip to content

Metrics

DRL exposes a Prometheus-compatible metrics endpoint on the standard port (default 9090) at /metrics. A /health liveness endpoint is available on the same port.

The endpoint uses a dedicated private registry — it does not expose Go runtime or process metrics by default, keeping the surface clean and DRL-specific.

Endpoint configuration

metrics {
    port 9090
}

Scrape target for Prometheus:

- job_name: drl
  static_configs:
    - targets: ["<drl-host>:9090"]

Metric reference

gRPC

MetricTypeLabelsDescription
drl_grpc_check_totalCounterTotal ShouldRateLimit requests received from Envoy. This is the primary throughput signal.
drl_grpc_response_code_totalCountercodeResponses broken down by decision. code values: OK, OVER_LIMIT. Use the ratio OVER_LIMIT / (OK + OVER_LIMIT) as the block rate.

Key alerts:

  • Sudden drop in drl_grpc_check_total → DRL is unreachable from Envoy.
  • Sustained spike in OVER_LIMIT share → active attack or misconfigured rate limit.

Rate limiting

MetricTypeLabelsDescription
drl_ratelimit_blocks_totalCounterrule_name, reasonEntities newly added to the blocklist. reason values: rate_exceeded, manual.
drl_ratelimit_propagation_latency_msHistogramTime in milliseconds from the moment a block is decided on the owner node to the moment the gossip event is received cluster-wide. Buckets: 1 ms → ~2 s (exponential ×2, 12 steps).

Key alerts:

  • drl_ratelimit_propagation_latency_ms p99 > 500 ms → gossip is lagging; check cluster network and drl_membership_cluster_size.

Cache

Both blocklist and accounting report under the same metric names, distinguished by the cache_type label.

MetricTypeLabelsDescription
drl_cache_hits_totalCountercache_typeSuccessful lookups.
drl_cache_misses_totalCountercache_typeFailed lookups (entry absent or expired).
drl_cache_evictions_totalCountercache_typeEntries evicted due to memory pressure (MaximumWeight reached). Non-zero evictions on the blocklist mean blocked entities are being silently dropped — increase blocklist_max_size_mb.
drl_cache_memory_bytesGaugecache_typeEstimated current memory consumption reported by the cache (based on the weigher — see Sizing Guide for the relationship to actual RSS).
drl_sync_duration_secondsHistogramTime taken for the initial Push/Pull state sync on node startup. Buckets: 1 ms → ~16 s. High values indicate a large blocklist is being transferred or network latency between nodes is elevated.

cache_type label values: blocklist, accounting.

Key alerts:

  • drl_cache_evictions_total{cache_type="blocklist"} > 0 → blocklist is full; blocked entities are being silently dropped.
  • drl_cache_memory_bytes approaching max_size_mb × 1,048,576 → consider increasing the configured limit.

Accounting

MetricTypeLabelsDescription
drl_accounting_local_increments_totalCounterCounter increments processed directly by this node (this node is the consistent-hash owner for the entity).
drl_accounting_remote_increments_totalCounterCounter increments forwarded to a remote owner node via UDP batch.
drl_accounting_flush_totalCounterNumber of UDP CounterBatch flushes sent to peer nodes. Each flush bundles multiple increments.
drl_accounting_msg_recv_totalCounterUDP CounterBatch messages received and processed.
drl_accounting_bulk_load_totalCounterresultEntries processed via the private bulk-load API. result values: no_match, accepted_local, accepted_remote, dropped, invalid.

The ratio remote / (local + remote) reflects how evenly traffic is distributed across the hash ring. In a balanced N-node cluster this should be approximately (N−1) / N.

Key alerts:

  • drl_accounting_remote_increments_total near zero on a multi-node cluster → the hash ring may not have converged; check drl_membership_cluster_size.

Membership

MetricTypeLabelsDescription
drl_membership_cluster_sizeGaugeCurrent number of live members as seen by this node’s memberlist. Expected value equals the number of running DRL instances.
drl_membership_events_totalCounterevent_typeMembership change events. event_type values: join, leave, update, reap.
drl_membership_reliable_msgs_totalCounterMessages sent via memberlist’s reliable (TCP) transport (used for large payloads such as full-state Push/Pull).
drl_membership_best_effort_msgs_totalCounterMessages sent via memberlist’s best-effort (UDP) transport (used for gossip and block broadcast events).

Key alerts:

  • drl_membership_cluster_size < expected node count → a node has left or become unreachable; block propagation will be incomplete until it recovers or is replaced.

Handover

Handover metrics track the graceful counter-migration that happens when a node leaves the cluster, ensuring the departing node’s owned accounting counters are transferred to a new owner before it shuts down.

MetricTypeLabelsDescription
drl_accounting_handover_out_entitiesCounterEntities exported by this node during a handover (leaving-node perspective).
drl_accounting_handover_in_entitiesCounterEntities received and imported by this node during a handover (adopter-node perspective).
drl_accounting_handover_duration_msHistogramEnd-to-end handover time in milliseconds. Buckets: 10 ms → ~20 s.
drl_accounting_handover_failed_totalCounterFailed handover attempts. A non-zero value means some counter state was lost during a rolling update.

Key alerts:

  • drl_accounting_handover_failed_total > 0 → investigate network connectivity between nodes at shutdown time.

Recommended Grafana dashboard panels

PanelQueryPurpose
Request raterate(drl_grpc_check_total[1m])Overall throughput
Block raterate(drl_grpc_response_code_total{code="OVER_LIMIT"}[1m]) / rate(drl_grpc_check_total[1m])Fraction of requests blocked
New blocks/srate(drl_ratelimit_blocks_total[1m])Rate of new entity blocks
Propagation p99histogram_quantile(0.99, rate(drl_ratelimit_propagation_latency_ms_bucket[5m]))Worst-case gossip convergence
Cluster sizedrl_membership_cluster_sizeLive node count
Blocklist evictionsrate(drl_cache_evictions_total{cache_type="blocklist"}[5m])Memory pressure on blocklist
Blocklist memorydrl_cache_memory_bytes{cache_type="blocklist"}Current blocklist footprint
Accounting balancerate(drl_accounting_remote_increments_total[1m]) / (rate(drl_accounting_local_increments_total[1m]) + rate(drl_accounting_remote_increments_total[1m]))Hash ring load balance (target ≈ (N−1)/N)