Operations

Inference Monitoring for Production ML: Metrics That Matter

September 8, 2024 Latentforce Engineering

Inference monitoring dashboard for production ML systems

Most production ML incidents are not discovered by alerts — they are discovered by users who notice that the application has started behaving badly. Response times have climbed. Outputs have become inconsistent. Requests are failing with cryptic errors. By the time users notice and report these problems, the underlying issue has often been developing for hours or days, gradually degrading quality or reliability until it crosses the threshold of user-visible impact. Good inference monitoring exists to catch these problems before users do.

The challenge is that ML inference has a broader set of failure modes than traditional web services. In addition to the standard operational concerns — latency, error rate, and availability — inference systems can fail in uniquely ML ways: models can produce outputs that are technically successful (status 200) but qualitatively degraded due to distribution shift or configuration changes. Monitoring for ML inference therefore requires both operational metrics (is the service up and fast?) and quality signals (is the service producing good outputs?). This article focuses on the operational side — the infrastructure metrics that every production inference deployment should be tracking.

The Core Latency Metrics: TTFT and TPOT

LLM inference has two distinct latency components that users experience differently and that require different optimization approaches. Time to First Token (TTFT) is the latency from request submission to the generation of the first output token. It encompasses request queuing time, prefill computation (processing the input prompt), and first token sampling. TTFT is what users experience as "how long until the response starts appearing" in streaming UIs, and it is dominated by the prefill computation time for long prompts.

Time Per Output Token (TPOT), also called inter-token latency, is the average time between successive generated tokens once generation has begun. It is dominated by the decoding computation time and is largely independent of input prompt length (though it varies with batch size and KV cache pressure). Users in streaming interfaces experience TPOT as the "speed of the text appearing" — TPOT under 50ms (20+ tokens per second) typically feels responsive; above 100ms (10 tokens per second) it starts to feel slow.

Both metrics should be tracked as histograms, not averages. P50 latency tells you the median experience; P95 tells you what 1 in 20 users experiences; P99 tells you what 1 in 100 users experiences. In practice, tail latency in LLM serving is often much worse than median latency because requests that happen to arrive when the server is under high KV cache pressure or handling a very long preceding request can queue for significantly longer. Your SLA commitments should reference P95 or P99, not averages, and your monitoring should track accordingly.

Throughput: Tokens Per Second and Requests Per Second

Throughput metrics measure the aggregate output capacity of your inference infrastructure. Tokens per second (TPS) — specifically output tokens per second summed across all concurrent requests — is the primary throughput metric for LLM serving. It reflects how efficiently the GPU is being utilized for generation work. A GPU running at near-maximum throughput will generate tokens at close to its theoretical peak TPS; a GPU that is being underutilized (sparse batches, excessive idle time) will generate significantly below peak.

Requests per second (RPS) is a coarser throughput metric that counts completed requests regardless of their length. It is useful for capacity planning at the API level but can be misleading as a server efficiency metric: short requests have high RPS but low GPU utilization per request; long requests have lower RPS but better GPU utilization. TPS is the more accurate efficiency metric; RPS is the more useful capacity planning metric.

Track both metrics over time and during load tests. Knowing your peak sustainable TPS (the throughput at which P99 latency stays below your SLA target) tells you exactly how many requests you can serve before you need additional GPU capacity. This number should be measured empirically on production traffic patterns, not extrapolated from synthetic benchmarks.

GPU Resource Metrics: Utilization, Memory, and Thermal State

GPU compute utilization measures what fraction of GPU compute cycles are being used for actual computation versus sitting idle. For a well-configured LLM serving deployment with adequate demand, GPU utilization should be 70 to 90 percent. Below 70 percent suggests the server is waiting for requests (underloaded) or spending disproportionate time on memory management, communication overhead, or other non-compute operations. Above 90 percent sustained utilization means the server is near capacity and latency will increase as requests queue.

GPU memory utilization — specifically VRAM utilization — is a critical safety metric. As discussed in detail in our KV cache article, VRAM is shared between model weights and the KV cache, and exhaustion causes requests to be preempted or rejected. Track VRAM utilization continuously; a sustained increase toward 95 to 100 percent is a leading indicator of imminent preemption pressure and should trigger autoscaling before service degradation occurs. Separate tracking of weight memory (stable) versus KV cache memory (variable with load) gives more actionable insight than aggregate VRAM utilization alone.

GPU temperature and power draw are operational metrics that reveal hardware stress. GPUs throttle clock speeds when temperatures exceed target thresholds (typically 83°C for A100, 83°C for H100), which reduces inference performance without obvious error signals. A GPU running consistently at 85 to 90°C is likely thermal throttling, potentially reducing throughput by 10 to 30 percent. Power draw provides a complementary signal: lower-than-expected power draw despite high utilization requests can indicate throttling.

Queue Depth and Batch Composition

The serving request queue is where latency spikes incubate. A queue depth of zero means requests are being served as fast as they arrive; a growing queue means demand exceeds capacity. Monitor queue depth as a time-series metric and alert on sustained queue depth above a threshold (e.g., more than 10 pending requests for more than 30 seconds). This provides earlier warning of capacity problems than latency metrics alone, since latency increases only after requests have already been queuing.

Batch size distribution reveals whether the serving engine is effectively utilizing GPU compute. Modern continuous batching systems dynamically adjust batch size based on request arrival rate and KV cache availability. Tracking the distribution of batch sizes — what fraction of compute time is spent at batch size 1 vs 8 vs 32 vs 64 — reveals whether the system is achieving the batching efficiency needed to justify the GPU investment. Persistent low batch sizes on a GPU that should be running at higher utilization suggest either low demand, request routing problems, or configuration issues with batching parameters.

Error Rates and Error Classification

Not all errors in LLM serving are equal, and distinguishing between them is important for diagnosis. Context length exceeded errors occur when a request exceeds the configured maximum sequence length and should be monitored separately from infrastructure errors. A spike in context length errors may indicate that user behavior has changed (longer prompts than expected) or that a system prompt update has increased per-request context consumption. Rate limit errors (429s) indicate that request volume exceeds concurrency limits and suggest either autoscaling failures or load shedding policy triggers.

Infrastructure errors — OOM failures, CUDA errors, timeout errors — indicate genuine system problems. Track error rates by type and set alert thresholds appropriate to each type. A 0.1 percent OOM rate might be acceptable during load spikes; a 0.1 percent CUDA error rate is never acceptable and warrants immediate investigation. OOM errors that persist at low load levels indicate memory leaks or configuration problems that will worsen over time. Time-to-error from service restart is a useful metric for catching slow memory leaks before they cause service degradation.

Key Takeaways

Track TTFT and TPOT as P50/P95/P99 histograms — averages hide the tail latency that defines user experience
Tokens per second (TPS) measures GPU efficiency; requests per second (RPS) measures API capacity — both are needed
VRAM utilization is a leading indicator of service degradation; alert before it approaches 95% to allow autoscaling time to respond
GPU temperature above 83°C sustained indicates thermal throttling that silently reduces throughput
Queue depth monitoring provides earlier warning of capacity problems than latency monitoring alone
Classify errors by type — context length errors, rate limits, OOM, CUDA errors each have different causes and responses

Conclusion

Effective inference monitoring is not about collecting every possible metric — it is about identifying the small set of metrics that give early warning of every important failure mode and surfacing them in ways that enable fast diagnosis and response. The metrics described in this article — TTFT, TPOT, TPS, GPU utilization, VRAM utilization, queue depth, and error rates by type — form a minimal viable observability stack for production LLM inference. Teams that have these metrics instrumented, alerting appropriately, and visible in dashboards will resolve production incidents in minutes rather than hours.

The Latentforce platform provides all of these metrics as first-class outputs with pre-built dashboards and configurable alerting. We also expose per-request latency attribution — showing how much of each request's latency came from queuing, prefill, and decoding — which makes diagnosis of latency regressions significantly faster than aggregate metrics alone. If you are running inference without this level of visibility, you are operating blind in a system that fails in complex ways.