How I Built a Scientific LLM, Trained It for ~$200, and Hit ~878 tok/s on Consumer GPUs
End-to-end distillation, deployment, and inference R&D on dual RTX 4090s
10 Nov, 2025
TL;DR: In ~two weeks I stood up my training/inference platform (Nexa_Compute), distilled a 10B scientific model from ~100k GPT-4 Q&A pairs for about $200, and then benchmarked it on a dual-RTX 4090 box using vLLM. The best bf16 config (no quant) reached ~878 tokens/sec with ~0.29 s average latency. The results line up with what practitioners report online: batching is king, context length kills throughput, and TP=1 beats TP=2 on PCIe systems.
Why I Built This
I wanted a scientific assistant that thinks like a researcher: clear SOPs, real instruments, reproducible steps, no hallucinated science. After distilling Allanatrix/Nexa_Sci_distilled_Falcon-10B, the next question was simple:
How fast can I run it on commodity hardware?
What I Stood Up
- Nexa_Compute — my training + deployment backbone (distillation, evals, artifact tracking)
- Model —
Allanatrix/Nexa_Sci_distilled_Falcon-10B(10B, distilled from GPT-4 Q&A) - Data — ~100k scientific Q&A pairs (chemistry, bio, materials, quantum, methods)
- Cost — end-to-end around $200 (Q&A gen → train/distill → evals)
- Timeline — ~2 weeks (and v2 is already shipping)
Hardware & Tools
| Component | Spec |
|---|---|
| GPUs | 2 × RTX 4090 (24 GB each) |
| Inference Engine | vLLM |
| Precision | bf16 (no quantization) |
| Workload | Long-form scientific Q&A generation |
| Goal | ≥ 1,000 tokens/second |
What Others Report (R&D From the Wild)
Patterns that kept showing up across docs, papers, and community posts:
- Batch size > FLOPs — memory and parallelism dominate.
- KV cache dominates VRAM — sequence length is the silent tax.
- TP=1 > TP=2 on PCIe — cross-GPU attention eats any gains without NVLink.
- Quantization unlocks concurrency — smaller weights → more KV → bigger batches.
- OOM is normal when exploring the frontier — pushing VRAM is how you find the ceiling.
A few representative sources:
- vLLM performance guidance on
gpu_memory_utilizationand batching
https://docs.vllm.ai/en/latest/performance/optimization.html - “Throughput landscapes are irregular” (small hyperparam changes → large perf jumps)
https://arxiv.org/abs/2408.01050 - Community inference notes on minimal multi-GPU gains without NVLink
https://www.reddit.com/r/LocalLLaMA/
Shortish context + big batch + TP=1 + high memory utilization looked like the winning recipe — and that’s exactly what my results showed.
Benchmarks (bf16)
| Config | TP | Max Seq | Max New | Batch | GPU Util | Tokens/s | Avg Latency |
|---|---|---|---|---|---|---|---|
tp2_len1536_batch4 |
2 | 1536 | 192 | 4 | 0.88 | 237.9 | 0.807 s |
tp1_len1536_batch6 |
1 | 1536 | 256 | 6 | 0.95 | 244.6 | 1.047 s |
tp1_len1024_batch24 |
1 | 1024 | 256 | 24 | 0.93 | 878.0 | 0.292 s |
tp1_len768_batch32 |
1 | 768 | 192 | 12 | 0.88 | 478.5 | 0.401 s |
tp1_len2048_batch4_long |
1 | 2048 | 1024 | 4 | 0.95 | 148.0 | 6.149 s |
No quantization in these runs; all bf16.
What Controlled Performance
1) Tensor Parallelism (TP)
TP=1 outperformed TP=2 consistently because Falcon-10B fits in 24 GB, and sharding introduces cross-GPU attention over PCIe. If it fits, keep it on one GPU.
2) Context Length
Longer max_model_len crushed throughput by expanding the KV cache footprint and shrinking effective batch.
Example: 2,048-token context + 1,024 new tokens with batch=4 → 148 tok/s and ~6.1 s latency.
3) Batch Size
Batch is king. Once the KV had headroom for ~24 concurrent sequences, throughput jumped to ~878 tok/s with ~0.29 s latency. This mirrors public reports where strong results only appear with 16–64+ way batching.
4) vLLM’s Memory Posture
vLLM aggressively reserves VRAM (paged KV, continuous batching, CUDA graphs). That’s by design. During sweeps I occasionally cleaned up between runs:
gc.collect()
torch.cuda.empty_cache()
A reboot after heavy KV scenarios also helped — expected when you sit on the frontier.
Dashboard & Artifacts
Every run emits
*_metrics.jsonand*_responses.parquetA Streamlit dashboard loads both for:
- throughput/latency charts
- Q&A side-by-side
- truncation visibility
If you tunnel the dashboard, bind to localhost:
streamlit run app.py --server.address 127.0.0.1
Cost & Timeline (Reality Check)
- Infra: stood up Nexa_Compute for training + serving
- Total spend: ~$200
- Time: ~2 weeks
- Pipeline: data gen → distill → eval → deploy → benchmark → document
- Status: v2 already out
What I’ll Test Next
- INT4/AWQ quantization — frees weight memory, inflates KV headroom, and typically pushes batch 24 → 64+ (and throughput > 1k tok/s).
- Automated GPU resets / isolation — cleaner TP=2 stress tests for long contexts.
- Rubric-based quality scoring — correlate scientific correctness with speed inside the dashboard.
Key Takeaways
- VRAM matters more than FLOPs for LLM inference.
- Batch beats brute force.
- TP=1 > TP=2 on PCIe consumer cards if the model fits.
- Context length is the silent throughput killer.
- With sane engineering, a 10B scientific model can be both accurate and fast on gaming GPUs.
Appendix: Example vLLM Flags (bf16, decode-heavy)
vllm serve Allanatrix/Nexa_Sci_distilled_Falcon-10B \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--gpu-memory-utilization 0.93 \
--max-model-len 1024 \
--max-num-seqs 24 \
--max-new-tokens 256
For long contexts, expect lower throughput; for INT4, lift max_num_seqs and consider --max-num-batched-tokens 6144–8192.
Link to the full dashboard and metrics full evals coming soon: Streamlit Dashboard