$ cd ../blog/
$ cat ./blog/benchmarking-llm-inference-vllm.md

Benchmarking LLM Inference with vLLM

Title: Benchmarking LLM Inference with vLLM
Date: February 3, 2026
Author: Badal Satyarthi
Tags: [LLMs] [vLLM] [Infrastructure]

Benchmarking LLM Inference with vLLM

TTFT, TPOT, throughput, and what actually matters when you're serving a fine-tuned model

vLLM inference benchmarking

TL;DR

Benchmark your vLLM deployment with four methods: latency (offline), throughput (offline), serving (online with percentiles), and custom async. In our production benchmarks, a fine-tuned 0.6B model on an A40 delivers 30-40 req/s. Key optimizations: prefix caching (40% TTFT improvement in our tests), chunked prefill, and 0.90 GPU memory utilization. Capacity planning: GPUs needed = (daily volume / 86,400) / throughput per GPU.

Table of Contents


You fine-tuned a model. It works. Now you need to know: how fast can you serve it?

Not a theoretical number from a blog post. Your model, your hardware, your input distribution. Benchmarks on someone else's setup won't tell you.

I spent a while benchmarking inference for a fine-tuned Qwen3 model served through vLLM. Here's what I learned.


The metrics that matter

There are too many metrics floating around. These are the ones that actually affect production decisions:

TTFT (Time to First Token) is how long until the user sees the first token. This is perceived latency. For interactive applications, this is the metric that determines whether your service feels fast.

TPOT (Time Per Output Token) is how long between each subsequent token. Determines streaming speed. Less important if you're not streaming.

Throughput (requests/sec) is how many complete requests you can serve per second. This is what determines your infrastructure cost. If you need to process 1M items per day, this is the number you divide by.

Total token throughput (tok/sec) is how many tokens your system processes per second, input and output combined. Useful for comparing across different input/output length distributions.

MetricWhat it tells youWho cares
TTFTPerceived latencyUser-facing apps
TPOTStreaming speedChat interfaces
Throughput (req/s)Capacity planningBatch processing
P99 latencyWorst-case experienceSLA commitments

Four ways to benchmark

vLLM has three built-in benchmark tools, and you can write a fourth for custom metrics. Each answers a different question.

1. Latency benchmark (offline)

"How fast is a single request with no contention?"

vllm bench latency \
  --model your-model-name \
  --input-len 600 \
  --output-len 30 \
  --batch-size 1 \
  --num-iters 100 \
  --enable-prefix-caching

This loads the model directly, no server needed. Batch size 1 gives the purest latency measurement. Increase batch size to see how latency degrades under load.

# Sweep across batch sizes
for bs in 1 4 8 16 32; do
  echo "Batch size: $bs"
  vllm bench latency \
    --model your-model-name \
    --input-len 600 \
    --output-len 30 \
    --batch-size $bs \
    --num-iters 50
done

2. Throughput benchmark (offline)

"What's the maximum throughput on this hardware?"

vllm bench throughput \
  --model your-model-name \
  --dataset-name sharegpt \
  --dataset-path benchmark_data.json \
  --num-prompts 1000 \
  --enable-prefix-caching

This pushes the model as hard as possible. No network overhead, no API layer. The number you get here is your theoretical ceiling.

Use your actual data distribution, not synthetic prompts. Input length variation matters a lot for throughput because vLLM's continuous batching behaves differently with mixed-length sequences.

3. Serving benchmark (online)

"What's the performance of my actual deployed server?"

# Terminal 1: Start the server
vllm serve your-model-name \
  --port 8000 \
  --enable-prefix-caching \
  --enable-chunked-prefill

# Terminal 2: Benchmark it
vllm bench serve \
  --model your-model-name \
  --base-url http://localhost:8000 \
  --endpoint /v1/chat/completions \
  --dataset-name sharegpt \
  --dataset-path benchmark_data.json \
  --num-prompts 1000 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --metric-percentiles 50,90,99 \
  --save-result

This is the most realistic test. It includes network overhead, API parsing, scheduling, everything your production requests hit.

The --request-rate flag matters more than you'd think. Set it to inf for max throughput testing, or to a specific number (like 50) to simulate steady-state production traffic. Latency numbers look very different at different request rates.

4. Custom benchmark (parallel API calls)

Benchmark comparison

When you need metrics that vLLM's built-in tools don't track (per-request token counts, success rates, custom output validation), write your own:

from concurrent.futures import ThreadPoolExecutor, as_completed
import requests, time

def run_single_request(sample, url, model, max_tokens):
    response = requests.post(url, json={
        "model": model,
        "messages": [
            {"role": "system", "content": sample["instruction"]},
            {"role": "user", "content": sample["input"]}
        ],
        "max_tokens": max_tokens,
        "temperature": 0.0
    })
    result = response.json()
    usage = result.get("usage", {})
    return {
        "prompt_tokens": usage.get("prompt_tokens", 0),
        "completion_tokens": usage.get("completion_tokens", 0),
    }

# Run with 50 parallel workers
start = time.time()
with ThreadPoolExecutor(max_workers=50) as executor:
    futures = {executor.submit(run_single_request, s, url, model, 100): i
               for i, s in enumerate(samples)}
    for future in as_completed(futures):
        results[futures[future]] = future.result()

elapsed = time.time() - start
throughput = len(samples) / elapsed

For better performance, use aiohttp instead of threaded requests:

async def run_async_benchmark(samples, url, model, max_concurrency=50):
    semaphore = asyncio.Semaphore(max_concurrency)
    connector = aiohttp.TCPConnector(
        limit=max_concurrency,
        keepalive_timeout=30,
    )
    async with aiohttp.ClientSession(connector=connector) as session:
        tasks = [call_api(session, s, url, model, semaphore) for s in samples]
        results = await asyncio.gather(*tasks)
    return results

In our A/B tests (same 1,000-request workload, same server config on an A40), the async aiohttp version outperformed ThreadPoolExecutor by 10-30%. Thread context switching adds up, especially at high concurrency.


What the numbers look like

For a fine-tuned 0.6B model on an A40 (48GB) with ~600 token inputs and ~30 token outputs, here's what we measured in our internal benchmarks (vLLM with PagedAttention, prefix caching enabled):

MetricValue
Request throughput30-40 req/s
Output tokens/sec1,000-1,500 tok/s
Total tokens/sec20,000-25,000 tok/s
TTFT (median)15-30 ms
P99 latency50-100 ms

In production terms:

Items/minute:  ~2,000
Items/hour:    ~130,000
Items/day:     ~3 million

On a single GPU. For a 0.6B model with ~600 token inputs and ~30 token outputs.

Your numbers will be different. Run the benchmarks yourself.


Optimization levers

Server-side

vllm serve your-model \
  --gpu-memory-utilization 0.90 \     # Use more VRAM
  --enable-prefix-caching \           # Cache shared prefixes (system prompts)
  --enable-chunked-prefill \          # Better scheduling for mixed batches
  --max-num-seqs 128 \               # More concurrent sequences
  --disable-log-requests              # Reduce logging overhead

Prefix caching was our biggest win. As described in the vLLM PagedAttention paper (Kwon et al., 2023), vLLM caches the KV-cache for shared prefixes so each new request only processes the unique part. If your requests share a common system prompt, this optimization applies directly. In our benchmarks (1,000 requests with a shared system prompt on an A40), this cut our median TTFT by about 40%.

Chunked prefill helps when you have a mix of long and short inputs. Instead of blocking on a long prefill before scheduling new requests, vLLM splits the prefill into chunks and interleaves them with decoding steps.

Hardware-specific

GPUMemory utilMax seqsMax model len
A40 (48GB)0.901284096
A100 (80GB)0.952568192
H100 (80GB)0.9551216384

More VRAM means more concurrent sequences, which means higher throughput. The relationship is roughly linear, but past ~256 concurrent sequences, scheduling overhead starts eating into gains.

Client-side

The number of parallel workers matters. Too few and you're underutilizing the server. Too many and you overwhelm it.

Start at 50 workers for a single-GPU vLLM server and adjust based on P99 latency. If P99 stays under your SLA, increase workers. If it spikes, back off.


Common mistakes

Benchmarking with synthetic data is the big one. If your benchmark uses fixed-length random inputs but your production data has variable-length structured text, your numbers are wrong. Always benchmark with representative data.

Ignoring warmup throws off results too. The first few requests after server startup are slow (model loading, CUDA graph capture, KV cache allocation). Exclude the first 10-20 requests.

Measuring throughput at low concurrency is misleading. Throughput of 5 req/s with 1 worker doesn't mean your server tops out at 5 req/s. It means your latency is 200ms. With 50 concurrent workers, you might hit 40 req/s.

And if all your requests share a system prompt and you haven't enabled --enable-prefix-caching, you're leaving free performance on the table.


Capacity planning

Once you have your throughput number, the math is simple:

Required throughput = daily_volume / seconds_per_day
GPUs needed = required_throughput / throughput_per_gpu

Example: You need to process 5 million items per day.

5,000,000 / 86,400 = ~58 req/s
With 35 req/s per A40: ceil(58 / 35) = 2 GPUs

Add headroom for traffic spikes. 2x is standard, so 4 GPUs for this workload.


Performance dashboard

The difference between a benchmarked deployment and a guess is about 3x in infrastructure cost. Measure first.

Badal Satyarthi
Badal Satyarthi
AI Consultant

AI Consultant. 9+ years building production AI. Previously Chief Data Scientist at recruitRyte. IIT Dhanbad.