The Research Behind Mouse
A controlled study of tool architecture effects on AI coding agent performance. 67 paired trials. 3 preregistered studies. Results that speak for themselves.
Download Full Paper (PDF)Working Draft — January 2026 · Simon W. Reiff
Paper Overview
Mouse: Precision File-Editing Tools for AI Coding Agents
A Controlled Study of Tool Architecture Effects on Agent Performance
This paper investigates whether purpose-built editing tools can measurably improve AI coding agent performance compared to baseline tools (GitHub Copilot default editing). Through three preregistered confirmatory studies with 67 paired trials, we demonstrate that Mouse tools produce statistically significant improvements in efficiency, precision, and capability—with effect sizes 2-3× the "large" threshold by Cohen's benchmarks.
Paired Trials
Same task, same model, different tools
Preregistered Studies
Hypotheses locked before data collection
p-value
Less than 1-in-a-million chance
Key Findings
As task difficulty increases, Mouse's advantage shifts from efficiency to precision to capability.
Efficiency Gains
Faster completion
Cheaper per task
N=23 · p < 10⁻⁶ · Mouse faster in all 23 runs
Precision Gains
Perfect First Try (Mouse)
Perfect First Try (Baseline)
N=25 · p = 1.22 × 10⁻⁴ · +56pp risk difference
Capability Unlock
Task Completion (Mouse)
Task Completion (Baseline)
N=19 · p = 7.63 × 10⁻⁶ · Baseline never succeeded
Methodology Highlights
🔬 Controlled Design
- ✓Paired comparisons: same task, same AI model
- ✓Only variable: tool architecture (Mouse vs Baseline)
- ✓Randomized task order to prevent learning effects
- ✓Blinded evaluation of outputs
📋 Preregistration
- ✓Hypotheses specified before data collection
- ✓Sample sizes determined a priori
- ✓Analysis plan locked in advance
- ✓No p-hacking or HARKing
📊 Statistical Rigor
- ✓Non-parametric tests (no distributional assumptions)
- ✓Effect sizes with confidence intervals
- ✓Multiple comparison corrections
- ✓Distribution-free lower bounds
🎯 Task Selection
- ✓Real-world editing scenarios
- ✓Varying difficulty levels (Easy/Medium/Hard)
- ✓Objective success criteria
- ✓Reproducible task specifications
Why This Matters
Tool Architecture as a Performance Lever
The conventional wisdom is that AI agent performance is determined by the underlying language model. Our research demonstrates that tool architecture is an independent performance lever—you can dramatically improve agent outcomes without changing the model, simply by giving agents better tools.
The Verbosity Tax
Baseline tools force agents to echo file content back in their tool calls—a "verbosity tax" that wastes tokens and introduces transcription errors. Mouse eliminates this tax through coordinate-based addressing, reducing output tokens by 74% (172 vs 708 tokens per call).
Predictable Execution
Beyond average performance, Mouse produces remarkably consistent results. While baseline tools show high variance (SD = 26s), Mouse operations are predictable (SD = 2.6s). This consistency matters for production workflows where reliability is as important as speed.
Read the Full Paper
Get all the details: methodology, statistical analysis, additional findings, and discussion of implications.
Working Draft — January 2026 · Comments welcome at research@hic-ai.com