Evaluating NexaSci-Falcon-10B: A 3-Week Scientific Model That Punches Near Frontier

18 Nov, 2025

Technical Evaluation of NexaSci-Falcon-10B

A Lightweight Scientific Reasoning Model

This project began as a controlled experiment: take a curated 100k scientific dataset, distill it into a 10B parameter model, add a small agentic post-training layer, and evaluate it directly against frontier systems under identical conditions. The goal was not scale. The goal was to understand what a carefully curated dataset plus a disciplined engineering pipeline can produce on modest compute.

Dataset and Distillation Process

The core dataset contains one hundred thousand scientific Q&A samples generated with GPT-4-mini. The distribution spans physics, chemistry, materials science, foundational ML, and applied computational tasks. After filtering and ranking for quality, almost no noise remained: each sample contains explicit reasoning, structured scientific format, and domain-grounded methodology.

Distillation was performed on a Falcon-10B base using QLoRA with double-quantized adapters. Training converged smoothly, dropping from roughly 0.8 → 0.4 loss before merging. After merging, a second post-training pass was performed using a much smaller dataset focused on tool reasoning, simulation-writing, evidence checking, and context-retrieval behavior. This post-training step does not expand coverage; it refines the model’s decision-making and its ability to select tools appropriately.

Evaluation Protocol

The evaluation suite is the same process I used in production settings:
a standardized prompt set, a rubric-based scoring pass, and a model-agnostic output comparison.

The evaluation pipeline consists of:

A fixed base parquet of scientific prompts (hypothesis + methodology oriented).
A generate.py script producing outputs from each model via vLLM (local) or OpenRouter (frontier models).
A judge.py script that scores each output on correctness, grounding, clarity, specificity, and methodological soundness.
A consolidated dashboard showing model-to-model comparisons.

The comparator group included:

GPT-4, GPT-4o-mini, GPT-5 variant
Claude 3.5 Sonnet
Gemini 1.5 Pro
High-parameter open-source models (≥30B)

Each model received the same prompts, temperature, and max token settings.

Results

Despite being three to seven times smaller, the 10B model consistently scored within ~10% of frontier systems. It never dropped below a score of 4 on the rubric, and its structured outputs (problem framing, scientific rationale, step-wise methodology) were consistently stable. Larger open-source models showed more variability and occasionally degraded specificity under long prompts.

The most notable result is cost-efficiency:
the model produces near-frontier scientific outputs while being deployable on a single GPU.

This is largely attributable to:

the cleanliness of the dataset
the distilled alignment signal
the structured nature of the target domain
the stability of Falcon-10B under QLoRA merging

In several cases, frontier models were truncated due to token constraints, making the 10B model’s stable generation even more interesting.

Behavior Under Agentic Conditions

During isolated testing with a FastAPI tool server, the post-trained model demonstrated:

correct sequencing of tool use (retrieve → analyze → simulate → answer)
the ability to request literature when information was missing
stable Python simulation generation
low hallucination under uncertainty (defaulting to retrieval)
reproducible reasoning traces across runs

The post-training dataset only contained a few hundred samples, but because they were dense with reasoning patterns and tool-choice demonstrations, the learned behavior generalizes surprisingly well.

The model is not an autonomous agent by itself; it is a scientific reasoning core that can operate inside a tool-augmented environment.

Embedding + Literature Mapping

In parallel, the Specter-2 model was used to build a compact ML/CS literature explorer:

Papers pulled from arXiv via API
Embeddings computed and cached
Clustering performed via k-means
2D and 3D projection rendered through Gradio
Sub-100ms render time for 1k papers

This mapping system is intended to be the retrieval substrate under the scientific agent. It can scale by replacing the .npy store with a vector database and incrementally embedding new papers.

Why These Results Matter

This project demonstrates a consistent pattern:
well-curated data plus tight engineering often beats parameter count.

A 10B model should not be able to compete with 30B–70B models in open-ended reasoning tasks. Yet in scientific hypothesis generation and methodology planning — domains where structure dominates — the distilled model holds its own.

For practitioners, this means:

building domain-specific scientific assistants does not require massive compute
clean SFT data still yields enormous gains
agentic capabilities can be taught with small, highly-targeted datasets
retrieval + simulation tools amplify the capabilities of small models dramatically

The takeaway is that specialized models can be both small and effective when the pipeline is engineered from end to end.

Next Steps

The next deliverable is NexaSci AgentKit:

the distilled 10B scientific model
the post-trained agent weights
the FastAPI tool server
the Specter-2 retrieval and visualization module
a reproducible Docker environment
a local UI for running scientific tasks

Everything will run on a single GPU, with no hosted inference required.

Closing

This project began as a temporary side build while waiting for computing for a much larger molecular modeling initiative. It ultimately produced a strong, reproducible scientific reasoning model with near-frontier behavior at a fraction of the scale. With AgentKit, the model becomes usable for researchers, engineers, and students who need a grounded scientific assistant that doesn’t depend on cloud APIs.