Peer-Review Ready

The Research Behind Mouse

A controlled study of tool architecture effects on AI coding agent performance. 67 paired trials. 3 preregistered studies. Results that speak for themselves.

Download Full Paper (PDF)

Working Draft — January 2026 · Simon W. Reiff

Paper Overview

Mouse: Precision File-Editing Tools for AI Coding Agents

A Controlled Study of Tool Architecture Effects on Agent Performance

This paper investigates whether purpose-built editing tools can measurably improve AI coding agent performance compared to baseline tools (GitHub Copilot default editing). Through three preregistered confirmatory studies with 67 paired trials, we demonstrate that Mouse tools produce statistically significant improvements in efficiency, precision, and capability—with effect sizes 2-3× the "large" threshold by Cohen's benchmarks.

67

Paired Trials

Same task, same model, different tools

3

Preregistered Studies

Hypotheses locked before data collection

<10⁻⁶

p-value

Less than 1-in-a-million chance

Key Findings

As task difficulty increases, Mouse's advantage shifts from efficiency to precision to capability.

Easy TasksBX-504D

Efficiency Gains

3.6×

Faster completion

37%

Cheaper per task

N=23 · p < 10⁻⁶ · Mouse faster in all 23 runs

Medium TasksBX-504B

Precision Gains

56%

Perfect First Try (Mouse)

0%

Perfect First Try (Baseline)

N=25 · p = 1.22 × 10⁻⁴ · +56pp risk difference

Hard TasksBX-701R

Capability Unlock

89.5%

Task Completion (Mouse)

0%

Task Completion (Baseline)

N=19 · p = 7.63 × 10⁻⁶ · Baseline never succeeded

Methodology Highlights

🔬 Controlled Design

  • Paired comparisons: same task, same AI model
  • Only variable: tool architecture (Mouse vs Baseline)
  • Randomized task order to prevent learning effects
  • Blinded evaluation of outputs

📋 Preregistration

  • Hypotheses specified before data collection
  • Sample sizes determined a priori
  • Analysis plan locked in advance
  • No p-hacking or HARKing

📊 Statistical Rigor

  • Non-parametric tests (no distributional assumptions)
  • Effect sizes with confidence intervals
  • Multiple comparison corrections
  • Distribution-free lower bounds

🎯 Task Selection

  • Real-world editing scenarios
  • Varying difficulty levels (Easy/Medium/Hard)
  • Objective success criteria
  • Reproducible task specifications

Why This Matters

Tool Architecture as a Performance Lever

The conventional wisdom is that AI agent performance is determined by the underlying language model. Our research demonstrates that tool architecture is an independent performance lever—you can dramatically improve agent outcomes without changing the model, simply by giving agents better tools.

The Verbosity Tax

Baseline tools force agents to echo file content back in their tool calls—a "verbosity tax" that wastes tokens and introduces transcription errors. Mouse eliminates this tax through coordinate-based addressing, reducing output tokens by 74% (172 vs 708 tokens per call).

Predictable Execution

Beyond average performance, Mouse produces remarkably consistent results. While baseline tools show high variance (SD = 26s), Mouse operations are predictable (SD = 2.6s). This consistency matters for production workflows where reliability is as important as speed.

Read the Full Paper

Get all the details: methodology, statistical analysis, additional findings, and discussion of implications.

Working Draft — January 2026 · Comments welcome at research@hic-ai.com