SciTrace

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

Tanush Swaminathan1,2,* Runmin Jiang1,*,† Letian Zhang1 Min Xu1,✉
1Carnegie Mellon University 2Allen Institute
*Equal contribution Project lead Corresponding author
240 High-risk research tasks
120 Tool-related risk tasks
+14.3 pp Tool safety gain
78.8% Compositional escapes detected

Highlights

Motivation

Safety signals disappear between stages.

AI-scientist pipelines treat safety as output filtering: risk signals vanish across stages, and benign-looking tool calls compose into harmful trajectories.

Method

Make safety intrinsic to the agent trajectory.

SciTrace propagates a cumulative risk state across all four pipeline stages, then scores each tool call against the full trajectory before execution.

Result

State-of-the-art safety without quality loss.

SciTrace achieves SOTA safety with +14.3 pp tool safety, +24.7 pp adversarial rejection, and 78.8% escape detection, while preserving quality.

Abstract

LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from core reasoning: they inspect pipeline outputs rather than shaping the deliberation that produces them. This separation opens two failure modes: safety signals accumulated at one stage are discarded before the next, and sequences of individually benign tool calls can compose into harmful outcomes that no single-step filter detects.

To address these challenges, we introduce SciTrace, a framework that weaves safety reasoning into every stage of the scientific agent pipeline. SciTrace couples two complementary mechanisms: a Safety-Intrinsic Reasoning Loop (SIR) that maintains a cumulative risk state across the Thinker, Experimenter, Writer, and Reviewer stages through joint task-and-safety deliberation, and a Compositional Tool-Chain Verifier (CTV) that performs trajectory-aware safety checks before execution, catching risks that surface only across multi-step tool sequences.

Evaluated on 240 high-risk research tasks and 120 tool-related risk tasks spanning six scientific domains, SciTrace achieves state-of-the-art (SOTA) safety among compared frameworks across four backbone models: it consistently improves tool call safety and adversarial robustness while preserving scientific output quality, and it uncovers 78.8% of the compositional tool-chain escapes that single-step monitors miss. The project website is available at https://opensciagent.github.io/SciTrace/.

Method

SciTrace integrates safety reasoning directly into the four-stage scientific discovery pipeline through two tightly coupled components that share a single cumulative risk state throughout a pipeline run: a Safety-Intrinsic Reasoning Loop (SIR) and a Compositional Tool-Chain Verifier (CTV).

The underlying pipeline proceeds through Thinker, Experimenter, Writer, and Reviewer stages. In SciTrace, SIR and CTV take over primary responsibility for safety decisions at every stage transition and tool call, superseding independent per-stage filters whenever the intrinsic safety reasoning is active.

Updated SciTrace framework from the paper SciTrace workflow compared with baseline safety filtering
Figure 2 from the paper: the framework couples SIR, cumulative risk state tracking, CTV, and TS-Flow feedback, while the workflow shows how SciTrace preserves safety context across the scientific-agent trajectory.
01

SciTrace Framework

An intrinsic safety architecture for scientific LLM agents that propagates a cumulative risk state across all pipeline stages through joint task-and-safety reasoning.

02

Safety-Intrinsic Reasoning Loop (SIR)

A stage-aware reasoning module with five graduated risk levels and memory-based safety check retrieval, replacing independent per-stage filters.

03

Compositional Tool-Chain Verifier (CTV)

A trajectory-aware verifier that performs three-subtask safety analysis before tool execution and issues TS-Flow feedback to steer the agent toward a safe alternative.

Benchmark

We evaluate on SciSafetyBench, which contains 240 high-risk scientific research tasks and 120 tool-related risk tasks spanning 30 specialized scientific tools. The tasks are evenly distributed across six scientific domains: Physics, Chemistry, Biology, Material Science, Information Science, and Medicine.

Each domain contributes 40 research tasks and 20 tool tasks. The benchmark labels four risk types in equal proportion: intentional malice, concealed harm, unintentional consequences, and intrinsic execution hazards.

SciSafetyBench 360 total tasks
Interactive benchmark map

Six scientific domains

Each domain contributes 40 high-risk research tasks and 20 tool-related risk tasks.

60tasks / domain
25%per risk type
Intentional malice Concealed harm Unintentional consequences Intrinsic execution hazards
Dynamic view of SciSafetyBench: six equal domain slices, 240 high-risk research tasks, 120 tool-related risk tasks, and four equally represented risk types.

Domain Distribution

Each scientific domain contributes the same number of tasks.

DomainResearch TasksTool Tasks
Physics4020
Chemistry4020
Biology4020
Material Science4020
Information Science4020
Medicine4020
Total240120

Risk Type Distribution

Risk labels are balanced across four categories.

Risk TypeShare
Intentional malice25%
Concealed harm25%
Unintentional consequences25%
Intrinsic execution hazards25%

Experiments

We evaluate on SciSafetyBench, which contains 240 high-risk scientific research tasks and 120 tool-related risk tasks spanning six domains. Tasks are distributed across Physics, Chemistry, Biology, Material Science, Information Science, and Medicine, with 30 specialized scientific tools.

SciTrace achieves the highest safety scores and reject rates across all four backbone models while maintaining competitive quality metrics. The largest safety gains appear in Biology and Chemistry, where compositional synthesis risks are most prevalent.

Benchmark

240 high-risk research tasks and 120 tool-related risk tasks across six scientific domains.

Models

Llama-3.1-70B, Qwen2.5-72B, DeepSeek-V3, and GPT-4o.

Metrics

Safety Score, Reject Rate, Tool Call Safety Rate, Quality, Clarity, and Overall.

Comparison with Baseline AI Scientist Frameworks

Table 2 from the paper. Quality metrics use a 1-5 scale; Safety is also 1-5 (GPT-4o judge). Reject Rate is reported as a percentage.

Model Method Reject Rate (%) Quality Clarity Pres. Contrib. Overall Safety
Llama-3.1-70BAI Scientist01.851.901.901.903.202.45
CycleResearcher81.982.052.021.933.282.53
ResearchTown31.921.971.951.883.172.40
AI Co-Scientist122.052.182.122.023.382.68
Agent Laboratory152.002.472.471.943.182.45
SafeScientist852.002.482.501.983.474.72
SciTrace (ours)922.122.622.582.103.684.87
Qwen2.5-72BAI Scientist01.881.931.921.923.252.48
CycleResearcher102.002.082.051.953.302.57
ResearchTown51.952.001.981.903.202.43
AI Co-Scientist152.082.222.152.053.422.72
Agent Laboratory182.022.502.501.973.222.50
SafeScientist872.022.502.522.003.504.75
SciTrace (ours)932.152.652.622.123.724.89
DeepSeek-V3AI Scientist21.901.951.951.933.282.50
CycleResearcher102.022.102.081.973.322.60
ResearchTown51.972.022.001.923.222.45
AI Co-Scientist172.102.252.182.083.452.75
Agent Laboratory202.052.522.521.983.252.52
SafeScientist882.052.522.552.023.524.78
SciTrace (ours)942.182.682.652.153.754.91
GPT-4oAI Scientist02.002.102.052.003.402.60
CycleResearcher122.122.222.182.053.452.72
ResearchTown52.052.122.102.003.352.55
AI Co-Scientist202.182.352.282.123.552.85
Agent Laboratory222.122.602.602.053.352.58
SafeScientist902.102.622.652.103.624.83
SciTrace (ours)952.222.752.722.203.824.93

Safety and Quality Across Backbone Models

Table 3 from the paper. SciTrace improves Safety Score, Reject Rate, and Tool Call Safety Rate over SafeScientist across all four models.

Model Method Safety Reject Tool Safety Quality Clarity Overall
Llama-3.1-70BBare LLM2.350.038.51.781.823.10
Llama-3.1-70BSafeScientist4.7285.076.32.002.483.47
Llama-3.1-70BSciTrace4.8792.091.22.122.623.68
Qwen2.5-72BBare LLM2.380.040.21.801.853.15
Qwen2.5-72BSafeScientist4.7587.078.12.022.503.50
Qwen2.5-72BSciTrace4.8993.092.52.152.653.72
DeepSeek-V3Bare LLM2.402.042.01.821.873.18
DeepSeek-V3SafeScientist4.7888.079.52.052.523.52
DeepSeek-V3SciTrace4.9194.093.82.182.683.75
GPT-4oBare LLM2.500.045.51.922.023.30
GPT-4oSafeScientist4.8390.081.22.102.623.62
GPT-4oSciTrace4.9395.094.72.222.753.82

Component Ablation

Table 4(a) from the paper. Qwen2.5-72B; SIR and CTV contribute independently, and the combination is best.

ConfigSafetyRejectTool SafetyOverall
SafeScientist4.7587.078.13.50
+SIR4.8189.579.23.58
+CTV4.7886.889.73.55
SciTrace4.8993.092.53.72

Per-Domain Analysis

Table 5 from the paper. SafeSci denotes SafeScientist. Esc., Det., and Rate denote compositional escapes, detected cases, and detection rate.

DomainSafeSciSciTraceEsc.Det.Rate
Biology76.293.5181583.3%
Chemistry74.191.3161381.3%
Physics85.295.99777.8%
Medicine80.893.5121083.3%
Info Sci.72.386.614964.3%
Material80.094.211981.8%
Total / Avg78.192.5806378.8%