How I Built a Scientific LLM, Trained It for ~$200, and Hit ~878 tok/s on Consumer GPUs

10 Nov, 2025

End-to-end distillation, deployment, and inference R&D on dual RTX 4090s

10 Nov, 2025

TL;DR: In ~two weeks I stood up my training/inference platform (Nexa_Compute), distilled a 10B scientific model from ~100k GPT-4 Q&A pairs for about $200, and then benchmarked it on a dual-RTX 4090 box using vLLM. The best bf16 config (no quant) reached ~878 tokens/sec with ~0.29 s average latency. The results line up with what practitioners report online: batching is king, context length kills throughput, and TP=1 beats TP=2 on PCIe systems.

Why I Built This

I wanted a scientific assistant that thinks like a researcher: clear SOPs, real instruments, reproducible steps, no hallucinated science. After distilling Allanatrix/Nexa_Sci_distilled_Falcon-10B, the next question was simple:

How fast can I run it on commodity hardware?

What I Stood Up

Nexa_Compute — my training + deployment backbone (distillation, evals, artifact tracking)
Model — Allanatrix/Nexa_Sci_distilled_Falcon-10B (10B, distilled from GPT-4 Q&A)
Data — ~100k scientific Q&A pairs (chemistry, bio, materials, quantum, methods)
Cost — end-to-end around $200 (Q&A gen → train/distill → evals)
Timeline — ~2 weeks (and v2 is already shipping)

Hardware & Tools

Component	Spec
GPUs	2 × RTX 4090 (24 GB each)
Inference Engine	vLLM
Precision	bf16 (no quantization)
Workload	Long-form scientific Q&A generation
Goal	≥ 1,000 tokens/second

What Others Report (R&D From the Wild)

Patterns that kept showing up across docs, papers, and community posts:

Batch size > FLOPs — memory and parallelism dominate.
KV cache dominates VRAM — sequence length is the silent tax.
TP=1 > TP=2 on PCIe — cross-GPU attention eats any gains without NVLink.
Quantization unlocks concurrency — smaller weights → more KV → bigger batches.
OOM is normal when exploring the frontier — pushing VRAM is how you find the ceiling.

A few representative sources:

vLLM performance guidance on gpu_memory_utilization and batching
https://docs.vllm.ai/en/latest/performance/optimization.html
“Throughput landscapes are irregular” (small hyperparam changes → large perf jumps)
https://arxiv.org/abs/2408.01050
Community inference notes on minimal multi-GPU gains without NVLink
https://www.reddit.com/r/LocalLLaMA/

Shortish context + big batch + TP=1 + high memory utilization looked like the winning recipe — and that’s exactly what my results showed.

Benchmarks (bf16)

Config	TP	Max Seq	Max New	Batch	GPU Util	Tokens/s	Avg Latency
`tp2_len1536_batch4`	2	1536	192	4	0.88	237.9	0.807 s
`tp1_len1536_batch6`	1	1536	256	6	0.95	244.6	1.047 s
`tp1_len1024_batch24`	1	1024	256	24	0.93	878.0	0.292 s
`tp1_len768_batch32`	1	768	192	12	0.88	478.5	0.401 s
`tp1_len2048_batch4_long`	1	2048	1024	4	0.95	148.0	6.149 s

No quantization in these runs; all bf16.

What Controlled Performance

1) Tensor Parallelism (TP)

TP=1 outperformed TP=2 consistently because Falcon-10B fits in 24 GB, and sharding introduces cross-GPU attention over PCIe. If it fits, keep it on one GPU.

2) Context Length

Longer max_model_len crushed throughput by expanding the KV cache footprint and shrinking effective batch.
Example: 2,048-token context + 1,024 new tokens with batch=4 → 148 tok/s and ~6.1 s latency.

3) Batch Size

Batch is king. Once the KV had headroom for ~24 concurrent sequences, throughput jumped to ~878 tok/s with ~0.29 s latency. This mirrors public reports where strong results only appear with 16–64+ way batching.

4) vLLM’s Memory Posture

vLLM aggressively reserves VRAM (paged KV, continuous batching, CUDA graphs). That’s by design. During sweeps I occasionally cleaned up between runs:

gc.collect()
torch.cuda.empty_cache()

A reboot after heavy KV scenarios also helped — expected when you sit on the frontier.

Dashboard & Artifacts

Every run emits *_metrics.json and *_responses.parquet
A Streamlit dashboard loads both for:
- throughput/latency charts
- Q&A side-by-side
- truncation visibility

If you tunnel the dashboard, bind to localhost: streamlit run app.py --server.address 127.0.0.1

Cost & Timeline (Reality Check)

Infra: stood up Nexa_Compute for training + serving
Total spend: ~$200
Time: ~2 weeks
Pipeline: data gen → distill → eval → deploy → benchmark → document
Status: v2 already out

What I’ll Test Next

INT4/AWQ quantization — frees weight memory, inflates KV headroom, and typically pushes batch 24 → 64+ (and throughput > 1k tok/s).
Automated GPU resets / isolation — cleaner TP=2 stress tests for long contexts.
Rubric-based quality scoring — correlate scientific correctness with speed inside the dashboard.

Key Takeaways

VRAM matters more than FLOPs for LLM inference.
Batch beats brute force.
TP=1 > TP=2 on PCIe consumer cards if the model fits.
Context length is the silent throughput killer.
With sane engineering, a 10B scientific model can be both accurate and fast on gaming GPUs.

Appendix: Example vLLM Flags (bf16, decode-heavy)

vllm serve Allanatrix/Nexa_Sci_distilled_Falcon-10B \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.93 \
  --max-model-len 1024 \
  --max-num-seqs 24 \
  --max-new-tokens 256

For long contexts, expect lower throughput; for INT4, lift max_num_seqs and consider --max-num-batched-tokens 6144–8192.

Link to the full dashboard and metrics full evals coming soon: Streamlit Dashboard