When should you fine-tune a small LLM instead of using GPT-4?

Fine-tune small LLMs when you're doing structured output tasks (scoring, classification, field extraction) where the output format is fixed and you have 50K+ training examples. A fine-tuned Qwen3-0.6B matches GPT-4 quality at 900x lower cost ($30/day vs $27,000/day).

What accuracy can a fine-tuned small model achieve on scoring tasks?

A 0.6B–1.7B model can hit 90%+ off-by-1 accuracy on structured scoring tasks. Using SFT alone, the model achieves around 70% exact match accuracy; adding GRPO with a well-designed reward function pushes accuracy above 80%.

What is QLoRA and why use it for fine-tuning small LLMs?

QLoRA loads the base model in 4-bit quantization and trains small LoRA adapters on top. For a 1.7B model this drops GPU memory from ~14GB to under 8GB, making single-GPU fine-tuning practical without sacrificing quality on structured tasks.

How much training data do you need to fine-tune a small LLM for scoring?

Around 50K–100K labeled examples works well for structured scoring. Below 10K the model often fails to learn the output format; above 100K returns diminish per dollar of compute. Quality of label distribution matters more than raw count above 50K.

What hardware setup works for fine-tuning a 1.7B model?

A single A40 or A100 handles QLoRA on a 1.7B model. A 4x A40 multi-GPU setup with DDP cuts training time roughly 3.5x at 4x the rental cost — worth it for iteration speed during development, less worth it once your training loop is stable.

Fine-Tuning Small LLMs for Structured Scoring Tasks

How a 1.7B parameter model can replace GPT-4 for specific tasks at a fraction of the cost

Small LLM fine-tuning

TL;DR

A fine-tuned Qwen3-0.6B matches GPT-4 for structured scoring tasks at 900x lower cost ($30/day vs $27,000/day). The pipeline: QLoRA fine-tuning with Axolotl, deployed on vLLM with prefix caching. You need 50K+ training examples. Adding GRPO after SFT pushes exact-match accuracy from 70% to 80%+.

Why small models work for scoring
The pipeline
Deployment
When this doesn't work
The two-stage approach

A lot of production LLM usage isn't about general intelligence. It's structured output from unstructured input. Score this text from 0 to 5. Classify this document. Extract fields from a paragraph.

For these tasks, a fine-tuned small model (0.6B-1.7B parameters) can match frontier model quality at a fraction of the inference cost. I've done this with QLoRA, Axolotl, and vLLM. Here's the whole pipeline.

Why small models work for scoring

Frontier models can write poetry, debug code, and explain quantum mechanics. You're paying for all of that when you only need one thing: look at this text, output some numbers.

Fine-tuning narrows the problem. You're not asking the model to reason from scratch. You're training it on thousands of examples until the pattern is in the weights. The model doesn't need to be general. It needs to be accurate on your distribution.

For structured scoring tasks where the output is a fixed format like space-separated integers, a 0.6B model with good training data can hit 90%+ off-by-1 accuracy (evaluated on held-out test sets of 5K+ samples). That's often better than prompting a frontier model, because prompting is noisy and fine-tuning isn't.

The pipeline

Data Collection → Format Conversion → QLoRA Fine-Tuning → Evaluation → Deployment

Step 1: Data format

Axolotl supports multiple formats. Alpaca is the simplest for instruction-following tasks:

{
  "instruction": "Your system prompt here - scoring criteria, scale definition, output format",
  "input": "The text to evaluate...",
  "output": "3 4 2 5"
}

The system prompt defines the task. The input is the content to score. The output is the expected scores, space-separated.

SYSTEM_PROMPT = """You will be provided with a text and a list of questions.
For each question, provide a score from 0 to 5 where:
- 0: Not applicable / no information
- 1: Very poor alignment
- 2: Below average
- 3: Average / neutral
- 4: Good alignment
- 5: Excellent alignment

Return only the numerical scores separated by spaces."""

Keep the output format dead simple. Space-separated integers. No JSON, no explanations, no markdown. The simpler the format, the fewer parse errors during inference.

Step 2: Data preprocessing

Token length matters. If your inputs exceed the model's context length, it either truncates or degrades. Filter during preprocessing:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-1.7B")

def is_within_limit(sample, max_tokens=4000):
    full_text = sample["instruction"] + sample["input"] + sample["output"]
    tokens = tokenizer.encode(full_text)
    return len(tokens) <= max_tokens

filtered = [s for s in dataset if is_within_limit(s)]

Deduplicate too. Hash-based deduplication on normalized JSON catches exact duplicates:

import hashlib, json

seen = set()
unique = []
for sample in dataset:
    h = hashlib.md5(json.dumps(sample, sort_keys=True).encode()).hexdigest()
    if h not in seen:
        seen.add(h)
        unique.append(sample)

Step 3: QLoRA configuration

Training setup

QLoRA means 4-bit quantized base model with low-rank adapter layers on top. You get the quality of a 1.7B model while training with the memory footprint of something much smaller.

Here's a working Axolotl config for multi-GPU training:

base_model: Qwen/Qwen3-1.7B
chat_template: qwen3

datasets:
  - path: data/train.jsonl
    ds_type: json
    type: alpaca

# QLoRA
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - gate_proj
  - up_proj
  - down_proj

# Training
sequence_len: 4096
sample_packing: true
micro_batch_size: 4
gradient_accumulation_steps: 4
num_epochs: 3
optimizer: adamw_torch_4bit
learning_rate: 0.0002
lr_scheduler: cosine
warmup_ratio: 0.1

# Memory optimization
gradient_checkpointing: true
flash_attention: true
bf16: auto
tf32: true

# Multi-GPU
deepspeed: deepspeed_configs/zero2.json

A few things worth noting in this config.

LoRA rank 16, alpha 32. The 2:1 ratio is standard. Rank 16 is enough for scoring tasks since you're not teaching the model new knowledge, just a specific input-output mapping.

Sample packing is why Axolotl is worth using. It packs multiple short examples into a single sequence so you don't waste compute on padding. For scoring tasks with short outputs (5-10 tokens), this roughly doubles effective throughput.

DeepSpeed ZeRO-2 partitions gradients and optimizer states across GPUs. ZeRO-3 partitions model weights too but adds communication overhead. For a 1.7B model, ZeRO-2 is enough.

Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them. On an A40 (48GB), this lets you fit larger batch sizes.

Step 4: Training

# Single GPU
axolotl train config.yml

# Multi-GPU (4x A40)
accelerate launch -m axolotl.cli.train config.yml

With 4x A40s and the config above, effective batch size is 64 (4 micro-batch x 4 gradient accumulation x 4 GPUs). 3 epochs on 100K samples takes roughly 2-3 hours.

After training, merge the LoRA adapter back into the base model:

axolotl merge-lora config.yml --lora-model-dir ./outputs/checkpoint-final

Step 5: Evaluation

Serve the merged model with vLLM and evaluate against held-out test data:

vllm serve ./merged_model --port 8000 --enable-prefix-caching

The metrics that matter for scoring:

Metric	What it tells you
Exact match accuracy	% of scores predicted exactly right
Per-answer exact accuracy	Per-score accuracy (not all-or-nothing)
Off-by-1 accuracy	% of scores within +/-1 of ground truth
MAE	Average error magnitude
Pearson correlation	Does the model rank correctly even if absolute values drift?

# Per-position accuracy breakdown
for position in range(num_questions):
    position_preds = [p[position] for p in predictions]
    position_truth = [g[position] for g in ground_truths]
    accuracy = sum(p == g for p, g in zip(position_preds, position_truth)) / len(position_preds)
    print(f"Position {position + 1}: {accuracy:.2%}")

Check per-position accuracy. If position 1 is 85% but position 5 is 60%, the model is losing attention over the sequence. You might need to increase context length or look at how your training data is distributed across sequence positions.

Deployment

vLLM with prefix caching. If all your requests share the same system prompt (they do for scoring tasks), the KV-cache for that prefix is computed once and reused.

vllm serve ./merged_model \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-seqs 128

A fine-tuned Qwen3-0.6B on a single A40 does 30-40 requests/second. That's around 3 million evaluations per day from one GPU.

Compare that to GPT-4 at ~$15 per million input tokens. At 600 tokens per request, 3 million requests costs $27,000/day. A single A40 costs about $30/day. That's a 900x difference.

When this doesn't work

If your production data looks nothing like your training data, the fine-tuned model will be confidently wrong. Fine-tuning gives you pattern memorization, not generalization. Monitor distribution drift.

If the scoring requires multi-step reasoning ("this person has skill X which implies they could do Y which means they'd score Z"), small models struggle. Frontier models with chain-of-thought are still better for that.

If your scoring rubric changes frequently, retraining every time gets expensive. A well-prompted frontier model with few-shot examples might be more practical even though it costs more per-call.

Below 5K training examples, fine-tuning tends to overfit. You need enough data to cover the variation in your inputs. 50K-100K is where we saw good results.

The two-stage approach

For the best quality, combine SFT with GRPO:

Stage 1: SFT → Model learns format + approximate answers
Stage 2: GRPO → Model refines accuracy using reward functions

SFT alone gets around 70% exact match accuracy in our experience (evaluated across multiple held-out test splits). Adding GRPO with a well-designed reward function (exact match + off-by-1 + MAE + format) pushed that above 80%, using Hugging Face TRL. The improvement was consistent across different test splits.

The full pipeline (data processing, SFT, GRPO, evaluation, deployment) is more engineering work than calling an API. But for anything processing millions of inputs, the cost savings pay for the engineering effort within days.

Deployment architecture

The API is the prototype. The fine-tuned model is the production system. Same accuracy, 900x cheaper. The tradeoff is engineering effort upfront.