Evaluating NexaSci-Falcon-10B: A 3-Week Scientific Model That Punches Near Frontier
Technical Evaluation of NexaSci-Falcon-10B
A Lightweight Scientific Reasoning Model
This project began as a controlled experiment: take a curated 100k scientific dataset, distill it into a 10B parameter model, add a small agentic post-training layer, and evaluate it directly against frontier systems under identical conditions. The goal was not scale. The goal was to understand what a carefully curated dataset plus a disciplined engineering pipeline can produce on modest compute.
Dataset and Distillation Process
The core dataset contains one hundred thousand scientific Q&A samples generated with GPT-4-mini. The distribution spans physics, chemistry, materials science, foundational ML, and applied computational tasks. After filtering and ranking for quality, almost no noise remained: each sample contains explicit reasoning, structured scientific format, and domain-grounded methodology.
Distillation was performed on a Falcon-10B base using QLoRA with double-quantized adapters. Training converged smoothly, dropping from roughly 0.8 → 0.4 loss before merging. After merging, a second post-training pass was performed using a much smaller dataset focused on tool reasoning, simulation-writing, evidence checking, and context-retrieval behavior. This post-training step does not expand coverage; it refines the model’s decision-making and its ability to select tools appropriately.
Evaluation Protocol
The evaluation suite is the same process I used in production settings:
a standardized prompt set, a rubric-based scoring pass, and a model-agnostic output comparison.
The evaluation pipeline consists of:
- A fixed base parquet of scientific prompts (hypothesis + methodology oriented).
- A
generate.pyscript producing outputs from each model via vLLM (local) or OpenRouter (frontier models). - A
judge.pyscript that scores each output on correctness, grounding, clarity, specificity, and methodological soundness. - A consolidated dashboard showing model-to-model comparisons.
The comparator group included:
- GPT-4, GPT-4o-mini, GPT-5 variant
- Claude 3.5 Sonnet
- Gemini 1.5 Pro
- High-parameter open-source models (≥30B)
Each model received the same prompts, temperature, and max token settings.
Results
Despite being three to seven times smaller, the 10B model consistently scored within ~10% of frontier systems. It never dropped below a score of 4 on the rubric, and its structured outputs (problem framing, scientific rationale, step-wise methodology) were consistently stable. Larger open-source models showed more variability and occasionally degraded specificity under long prompts.
The most notable result is cost-efficiency:
the model produces near-frontier scientific outputs while being deployable on a single GPU.
This is largely attributable to:
- the cleanliness of the dataset
- the distilled alignment signal
- the structured nature of the target domain
- the stability of Falcon-10B under QLoRA merging
In several cases, frontier models were truncated due to token constraints, making the 10B model’s stable generation even more interesting.
Behavior Under Agentic Conditions
During isolated testing with a FastAPI tool server, the post-trained model demonstrated:
- correct sequencing of tool use (retrieve → analyze → simulate → answer)
- the ability to request literature when information was missing
- stable Python simulation generation
- low hallucination under uncertainty (defaulting to retrieval)
- reproducible reasoning traces across runs
The post-training dataset only contained a few hundred samples, but because they were dense with reasoning patterns and tool-choice demonstrations, the learned behavior generalizes surprisingly well.
The model is not an autonomous agent by itself; it is a scientific reasoning core that can operate inside a tool-augmented environment.
Embedding + Literature Mapping
In parallel, the Specter-2 model was used to build a compact ML/CS literature explorer:
- Papers pulled from arXiv via API
- Embeddings computed and cached
- Clustering performed via k-means
- 2D and 3D projection rendered through Gradio
- Sub-100ms render time for 1k papers
This mapping system is intended to be the retrieval substrate under the scientific agent. It can scale by replacing the .npy store with a vector database and incrementally embedding new papers.
Why These Results Matter
This project demonstrates a consistent pattern:
well-curated data plus tight engineering often beats parameter count.
A 10B model should not be able to compete with 30B–70B models in open-ended reasoning tasks. Yet in scientific hypothesis generation and methodology planning — domains where structure dominates — the distilled model holds its own.
For practitioners, this means:
- building domain-specific scientific assistants does not require massive compute
- clean SFT data still yields enormous gains
- agentic capabilities can be taught with small, highly-targeted datasets
- retrieval + simulation tools amplify the capabilities of small models dramatically
The takeaway is that specialized models can be both small and effective when the pipeline is engineered from end to end.
Next Steps
The next deliverable is NexaSci AgentKit:
- the distilled 10B scientific model
- the post-trained agent weights
- the FastAPI tool server
- the Specter-2 retrieval and visualization module
- a reproducible Docker environment
- a local UI for running scientific tasks
Everything will run on a single GPU, with no hosted inference required.
Closing
This project began as a temporary side build while waiting for computing for a much larger molecular modeling initiative. It ultimately produced a strong, reproducible scientific reasoning model with near-frontier behavior at a fraction of the scale. With AgentKit, the model becomes usable for researchers, engineers, and students who need a grounded scientific assistant that doesn’t depend on cloud APIs.