About me

How I Built a Scientific LLM, Trained It for ~$200, and Hit ~878 tok/s on Consumer GPUs

End-to-end distillation, deployment, and inference R&D on dual RTX 4090s

10 Nov, 2025

TL;DR: In ~two weeks I stood up my training/inference platform (Nexa_Compute), distilled a 10B scientific model from ~100k GPT-4 Q&A pairs for about $200, and then benchmarked it on a dual-RTX 4090 box using vLLM. The best bf16 config (no quant) reached ~878 tokens/sec with ~0.29 s average latency. The results line up with what practitioners report online: batching is king, context length kills throughput, and TP=1 beats TP=2 on PCIe systems.


Why I Built This

I wanted a scientific assistant that thinks like a researcher: clear SOPs, real instruments, reproducible steps, no hallucinated science. After distilling Allanatrix/Nexa_Sci_distilled_Falcon-10B, the next question was simple:

How fast can I run it on commodity hardware?


What I Stood Up


Hardware & Tools

Component Spec
GPUs 2 × RTX 4090 (24 GB each)
Inference Engine vLLM
Precision bf16 (no quantization)
Workload Long-form scientific Q&A generation
Goal 1,000 tokens/second

What Others Report (R&D From the Wild)

Patterns that kept showing up across docs, papers, and community posts:

A few representative sources:

Shortish context + big batch + TP=1 + high memory utilization looked like the winning recipe — and that’s exactly what my results showed.


Benchmarks (bf16)

Config TP Max Seq Max New Batch GPU Util Tokens/s Avg Latency
tp2_len1536_batch4 2 1536 192 4 0.88 237.9 0.807 s
tp1_len1536_batch6 1 1536 256 6 0.95 244.6 1.047 s
tp1_len1024_batch24 1 1024 256 24 0.93 878.0 0.292 s
tp1_len768_batch32 1 768 192 12 0.88 478.5 0.401 s
tp1_len2048_batch4_long 1 2048 1024 4 0.95 148.0 6.149 s

No quantization in these runs; all bf16.


What Controlled Performance

1) Tensor Parallelism (TP)

TP=1 outperformed TP=2 consistently because Falcon-10B fits in 24 GB, and sharding introduces cross-GPU attention over PCIe. If it fits, keep it on one GPU.

2) Context Length

Longer max_model_len crushed throughput by expanding the KV cache footprint and shrinking effective batch.
Example: 2,048-token context + 1,024 new tokens with batch=4 → 148 tok/s and ~6.1 s latency.

3) Batch Size

Batch is king. Once the KV had headroom for ~24 concurrent sequences, throughput jumped to ~878 tok/s with ~0.29 s latency. This mirrors public reports where strong results only appear with 16–64+ way batching.

4) vLLM’s Memory Posture

vLLM aggressively reserves VRAM (paged KV, continuous batching, CUDA graphs). That’s by design. During sweeps I occasionally cleaned up between runs:

gc.collect()
torch.cuda.empty_cache()

A reboot after heavy KV scenarios also helped — expected when you sit on the frontier.


Dashboard & Artifacts

If you tunnel the dashboard, bind to localhost: streamlit run app.py --server.address 127.0.0.1


Cost & Timeline (Reality Check)


What I’ll Test Next


Key Takeaways


Appendix: Example vLLM Flags (bf16, decode-heavy)

vllm serve Allanatrix/Nexa_Sci_distilled_Falcon-10B \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.93 \
  --max-model-len 1024 \
  --max-num-seqs 24 \
  --max-new-tokens 256

For long contexts, expect lower throughput; for INT4, lift max_num_seqs and consider --max-num-batched-tokens 6144–8192.

Link to the full dashboard and metrics full evals coming soon: Streamlit Dashboard