Unsloth + Training Hub: Lightning-Fast LoRA in Production — ContentBuffer guide

Unsloth + Training Hub: Lightning-Fast LoRA in Production

K
Kodetra Technologies··7 min read Intermediate

Summary

Fine-tune 7B LLMs on one 24GB GPU with 70% less VRAM

Why Unsloth + Training Hub Is the New LoRA Standard

Fine-tuning a 7B parameter language model used to mean renting a multi-GPU node, wrestling with FSDP configs, and burning a weekend on a job that might OOM at hour eleven. As of April 2026, that workflow is gone. Red Hat's Training Hub v0.4.0 ships with Unsloth as a first-class backend, and the combination lets you fine-tune a 7B instruction model on a single 24GB consumer GPU in under three hours, with roughly 70% less VRAM than full fine-tuning and about 2x the throughput of vanilla LoRA pipelines.

This guide walks through the full pipeline end to end: installing Training Hub, preparing a ChatML dataset, configuring the Unsloth-backed QLoRA recipe, kicking off training, merging the adapter, and exporting a GGUF model you can serve with vLLM or llama.cpp. By the end you will have a reproducible recipe that you can drop into a CI pipeline or hand to a colleague.

Prerequisites

  • A Linux box with one NVIDIA GPU that has at least 24GB of VRAM (RTX 3090, 4090, A10G, A100, L4 all work).
  • CUDA 12.4 or newer drivers installed and visible to Python (nvidia-smi shows your card).
  • Python 3.11 or 3.12 in a clean virtualenv. Mixing Unsloth into an existing PyTorch env almost always breaks something.
  • A Hugging Face token with read access to the base model you plan to fine-tune.
  • About 60GB of free disk for the base weights, adapter checkpoints, and merged GGUF output.

Step 1: Install Training Hub With the Unsloth Extra

Training Hub treats backends as optional extras. Installing the unsloth extra pulls a pinned PyTorch build, the matching xFormers wheel, and the Unsloth kernels in one shot. Do it in a fresh venv to avoid the torch / triton version dance:

python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install "training-hub[unsloth]==0.4.0"

# Verify the stack
python -c "import torch, unsloth, training_hub; print(torch.__version__, unsloth.__version__, training_hub.__version__)"
# Expected output (versions may shift slightly):
# 2.5.1+cu124 2026.4.3 0.4.0

If the import line prints without errors and torch reports a CUDA build, you are ready. If you see 'undefined symbol' from xFormers, your PyTorch and CUDA toolkit are mismatched. Wipe the venv and reinstall rather than upgrading in place.

Step 2: Prepare a ChatML Dataset

Training Hub expects JSONL where each line is a conversation in ChatML format. The minimum useful dataset is around 1,000 examples; below that you tend to get a stylistic shift but no meaningful capability gain. Keep system prompts short and consistent across rows so the model does not learn to imitate prompt drift.

# data/sql_assistant.jsonl  (one conversation per line, formatted on multiple lines here for clarity)
{
  "messages": [
    {"role": "system", "content": "You are a senior data engineer. Reply with one Postgres query."},
    {"role": "user",   "content": "Top 5 customers by revenue in Q1 2026."},
    {"role": "assistant", "content": "SELECT customer_id, SUM(amount) AS revenue FROM orders WHERE created_at >= '2026-01-01' AND created_at < '2026-04-01' GROUP BY 1 ORDER BY revenue DESC LIMIT 5;"}
  ]
}

A few dataset rules that save you a failed run later. Strip examples whose token count exceeds your max_seq_length, because Training Hub will silently truncate them and you will lose the answer. Deduplicate on the user turn so the model does not memorize a single prompt. Hold out 5% as a validation split before you shuffle, never after.

import json, random
from pathlib import Path

rows = [json.loads(l) for l in Path('data/sql_assistant.jsonl').read_text().splitlines()]
random.Random(42).shuffle(rows)
split = int(len(rows) * 0.95)
Path('data/train.jsonl').write_text('\n'.join(json.dumps(r) for r in rows[:split]))
Path('data/eval.jsonl').write_text('\n'.join(json.dumps(r) for r in rows[split:]))
print(f'train={split}  eval={len(rows)-split}')

Step 3: Write the Training Hub Recipe

Training Hub uses a single YAML recipe file as the source of truth for a run. The Unsloth backend is selected with backend: unsloth, and everything below it is forwarded to Unsloth's FastLanguageModel. The defaults below are the 2026 community baseline for instruction-style QLoRA: rank 16, alpha 16, all linear modules targeted, DoRA enabled, two epochs at 2e-4.

# recipes/sql_assistant_qlora.yaml
name: sql-assistant-qlora
backend: unsloth

model:
  base: meta-llama/Llama-3.1-8B-Instruct
  load_in_4bit: true        # NF4 quantization, the 'Q' in QLoRA
  max_seq_length: 4096
  dtype: bfloat16

data:
  train: data/train.jsonl
  eval:  data/eval.jsonl
  format: chatml

lora:
  r: 16
  alpha: 16
  dropout: 0.0
  target_modules: all-linear
  use_dora: true
  use_rslora: false

training:
  epochs: 2
  per_device_batch_size: 2
  gradient_accumulation_steps: 4   # effective batch = 8
  learning_rate: 2.0e-4
  warmup_ratio: 0.03
  weight_decay: 0.01
  lr_scheduler: cosine
  optimizer: adamw_8bit
  packing: true
  gradient_checkpointing: unsloth   # 30% slower, saves ~5GB
  bf16: true
  seed: 42

output:
  dir: runs/sql-assistant-qlora
  save_strategy: epoch
  save_total_limit: 2

Two settings deserve a sentence each. target_modules: all-linear tells Unsloth to attach LoRA adapters to every nn.Linear in the transformer block, which empirically beats targeting only q_proj and v_proj for instruction following. use_dora: true switches to DoRA (Weight-Decomposed Low-Rank Adaptation), which adds about 5% wall-clock overhead but consistently gains a few points on benchmarks at the same rank.

Step 4: Launch Training

With the recipe in place, the actual command is one line. Training Hub handles tokenizer setup, packing, checkpointing, and Weights and Biases or MLflow logging if you have either configured.

training-hub run recipes/sql_assistant_qlora.yaml \
  --logger wandb \
  --tags lora,unsloth,llama-3.1-8b

# First few lines you should see:
# [training-hub] backend=unsloth gpu=NVIDIA RTX 4090 (24GB)
# [unsloth]      Patching FastLlamaModel: 32 layers, 4096 hidden
# [unsloth]      Trainable params: 41,943,040 / 8.03B  (0.52%)
# [unsloth]      Estimated step time: 1.4s   peak VRAM: 18.2 GB
# {'loss': 1.41, 'grad_norm': 0.78, 'learning_rate': 1.94e-04, 'epoch': 0.02}

On a single RTX 4090, the example above completes in roughly 2 hours 40 minutes for 1,200 training rows. Peak VRAM stays around 18GB, leaving headroom for a longer max_seq_length or a larger effective batch. If your card has 16GB, drop max_seq_length to 2048 and set per_device_batch_size to 1, then bump gradient_accumulation_steps to 8 to keep the same effective batch.

Step 5: Evaluate, Merge, and Export

When the run finishes you have an adapter directory under runs/sql-assistant-qlora/checkpoint-final. Two things to do before you ship it. First, eyeball generations against the eval split with the built-in training-hub eval command. Second, merge the adapter back into the base weights so downstream serving frameworks do not need to know about LoRA.

# 1. Quick sanity check on the eval split
training-hub eval runs/sql-assistant-qlora/checkpoint-final \
  --data data/eval.jsonl \
  --max-new-tokens 256 \
  --num-samples 20

# 2. Merge LoRA into the base weights and save as 16-bit safetensors
training-hub merge runs/sql-assistant-qlora/checkpoint-final \
  --output runs/sql-assistant-merged \
  --dtype bfloat16

# 3. Export a 4-bit GGUF for llama.cpp / Ollama serving
training-hub export runs/sql-assistant-merged \
  --format gguf \
  --quant Q4_K_M \
  --output runs/sql-assistant.Q4_K_M.gguf

The Q4_K_M GGUF lands at roughly 4.6GB for an 8B model and runs comfortably on CPU-only laptops at 8 to 12 tokens per second. If you plan to serve with vLLM instead, skip the GGUF export and point vLLM at the merged directory directly. vLLM picks up the safetensors and tokenizer.json without further configuration.

Common Pitfalls and How to Avoid Them

Loss going to NaN in the first 50 steps. Almost always a learning rate issue when target_modules is all-linear. Drop the LR to 1e-4 and re-enable warmup_ratio: 0.05. If it still blows up, your dataset has rows with unusually long assistant turns; cap max_seq_length explicitly.

OOM at the start of training, not during. The culprit is usually packing: true combined with a high max_seq_length. Packing concatenates short examples into a single sequence to avoid wasted padding, which raises peak memory. Either lower max_seq_length to 2048 or set packing: false and accept slower throughput.

Adapter looks fine in eval but the merged model is worse. This is almost always a tokenizer mismatch. The merge step copies the tokenizer from the base model; if your training data used special tokens you added at fine-tune time, you must save the modified tokenizer alongside the adapter and pass --tokenizer to the merge command.

GGUF generations contain literal '<|begin_of_text|>' tokens. The Llama 3 chat template needs to be embedded in the GGUF metadata. Add --chat-template llama3 to the export command and the special tokens get rendered, not printed.

Quick Reference: 2026 QLoRA Defaults

SettingValueWhy
LoRA rank (r)16Best ROI; raise to 32 only if you have 50k+ examples
LoRA alpha16Match r so scaling factor stays near 1
target_modulesall-linearBeats q_proj/v_proj only by ~3 pts on instruct evals
DoRAtrue+5% wall time, ~2 pts on benchmarks at same r
QuantizationNF4 (load_in_4bit)75% memory savings, negligible quality loss
Optimizeradamw_8bitSaves another ~3GB optimizer state vs fp32
Learning rate2e-4QLoRA tolerates higher LR than full FT
Epochs2-3More than 3 usually overfits instruction data
Effective batch8-16Use grad accumulation to reach this
Gradient checkpointingunslothCustom kernel saves ~5GB at 30% speed cost

Where to Go Next

Once the single-GPU recipe works for your task, scaling up is mostly a configuration change. Training Hub supports both data parallel (more samples per step) and model parallel (split a single forward pass across GPUs) by setting accelerator.num_gpus and accelerator.strategy in the recipe. For multi-node runs, point at an SLURM or Ray cluster the same way.

If your task involves reasoning rather than style or knowledge, swap the SFT recipe for a GRPO recipe; Training Hub uses the same backend abstraction, so you keep your dataset format and only change the training stanza. And if you need to chain SFT, reward modeling, and preference tuning into one pipeline, the recipe DAG support landed in v0.4.0 specifically for that.


The headline number is real: a single 24GB GPU, an afternoon, and a 1k-row dataset is now enough to ship a domain-tuned 8B model into production. The hard part has shifted from infrastructure to dataset quality, which is where it belonged all along.

Comments

Subscribe to join the conversation...

Be the first to comment