TL;DR

Tuning LLMs on IaaS requires balancing TTFT, ITL, and TPS metrics to ensure stable performance under mixed traffic conditions. Guardrails like admission control are essential for maintaining SLAs without compromising tail latencies.

What happened

Discussed the nuances of serving large language models (LLMs) on Infrastructure-as-a-Service (IaaS), focusing on how tuning throughput versus latency affects overall system performance, especially under mixed traffic conditions. Emphasized the importance of guardrails such as admission control to maintain service level agreements without compromising tail latencies.

Why it matters for ops

Understanding the differences between TTFT (time to first token), ITL (inter-token latency), and TPS (throughput) metrics is crucial for operational efficiency when tuning LLMs on IaaS. Proper configuration of vLLM's parameters and implementing guardrails like admission control are necessary to achieve stable performance under varying workloads.

Action items

  • Implement strict admission controls to prevent overloading the system with long requests from batch clients.
  • Configure vLLM parameters such as --max-num-seqs, --max-num-batched-tokens, and gpu_memory_utilization carefully.
  • Monitor key metrics like TTFT p50/p95/p99, ITL distribution, queue depth, and reject rate to predict potential performance issues.

Source link

https://dev.to/daya-shankar/serving-llms-on-iaas-throughput-vs-latency-tuning-with-practical-guardrails-1boh