About me

Evaluating NexaSci-Falcon-10B: A 3-Week Scientific Model That Punches Near Frontier

Technical Evaluation of NexaSci-Falcon-10B

A Lightweight Scientific Reasoning Model

This project began as a controlled experiment: take a curated 100k scientific dataset, distill it into a 10B parameter model, add a small agentic post-training layer, and evaluate it directly against frontier systems under identical conditions. The goal was not scale. The goal was to understand what a carefully curated dataset plus a disciplined engineering pipeline can produce on modest compute.

Dataset and Distillation Process

The core dataset contains one hundred thousand scientific Q&A samples generated with GPT-4-mini. The distribution spans physics, chemistry, materials science, foundational ML, and applied computational tasks. After filtering and ranking for quality, almost no noise remained: each sample contains explicit reasoning, structured scientific format, and domain-grounded methodology.

Distillation was performed on a Falcon-10B base using QLoRA with double-quantized adapters. Training converged smoothly, dropping from roughly 0.8 → 0.4 loss before merging. After merging, a second post-training pass was performed using a much smaller dataset focused on tool reasoning, simulation-writing, evidence checking, and context-retrieval behavior. This post-training step does not expand coverage; it refines the model’s decision-making and its ability to select tools appropriately.

Evaluation Protocol

The evaluation suite is the same process I used in production settings:
a standardized prompt set, a rubric-based scoring pass, and a model-agnostic output comparison.

The evaluation pipeline consists of:

  1. A fixed base parquet of scientific prompts (hypothesis + methodology oriented).
  2. A generate.py script producing outputs from each model via vLLM (local) or OpenRouter (frontier models).
  3. A judge.py script that scores each output on correctness, grounding, clarity, specificity, and methodological soundness.
  4. A consolidated dashboard showing model-to-model comparisons.

The comparator group included:

Each model received the same prompts, temperature, and max token settings.

Results

Despite being three to seven times smaller, the 10B model consistently scored within ~10% of frontier systems. It never dropped below a score of 4 on the rubric, and its structured outputs (problem framing, scientific rationale, step-wise methodology) were consistently stable. Larger open-source models showed more variability and occasionally degraded specificity under long prompts.

The most notable result is cost-efficiency:
the model produces near-frontier scientific outputs while being deployable on a single GPU.

This is largely attributable to:

In several cases, frontier models were truncated due to token constraints, making the 10B model’s stable generation even more interesting.

Behavior Under Agentic Conditions

During isolated testing with a FastAPI tool server, the post-trained model demonstrated:

The post-training dataset only contained a few hundred samples, but because they were dense with reasoning patterns and tool-choice demonstrations, the learned behavior generalizes surprisingly well.

The model is not an autonomous agent by itself; it is a scientific reasoning core that can operate inside a tool-augmented environment.

Embedding + Literature Mapping

In parallel, the Specter-2 model was used to build a compact ML/CS literature explorer:

This mapping system is intended to be the retrieval substrate under the scientific agent. It can scale by replacing the .npy store with a vector database and incrementally embedding new papers.

Why These Results Matter

This project demonstrates a consistent pattern:
well-curated data plus tight engineering often beats parameter count.

A 10B model should not be able to compete with 30B–70B models in open-ended reasoning tasks. Yet in scientific hypothesis generation and methodology planning — domains where structure dominates — the distilled model holds its own.

For practitioners, this means:

The takeaway is that specialized models can be both small and effective when the pipeline is engineered from end to end.

Next Steps

The next deliverable is NexaSci AgentKit:

Everything will run on a single GPU, with no hosted inference required.

Closing

This project began as a temporary side build while waiting for computing for a much larger molecular modeling initiative. It ultimately produced a strong, reproducible scientific reasoning model with near-frontier behavior at a fraction of the scale. With AgentKit, the model becomes usable for researchers, engineers, and students who need a grounded scientific assistant that doesn’t depend on cloud APIs.