Autonomous research agents Population-based code evolution

GEAR: Genetic AutoResearch for Agentic Code Evolution

GEAR replaces single-incumbent hill climbing in AutoResearch-style agents with a bounded frontier of elite research states, expanded through mutation, crossover, and—in its strongest variant—an evolvable search controller.

Ahmadreza Jeddi^1,2, Minh Ngoc Le^1,2, Hakki C. Karaimer³, Konstantinos G. Derpanis^3,4, Babak Taati^1,2

¹University of Toronto • ²Vector Institute • ³AI Center-Toronto, Samsung Electronics • ⁴York University

Paper Code

Main result

Frontier search keeps improving after hill climbing plateaus

Running best validation bits-per-byte for Baseline, GEAR-Prompt, GEAR-Fixed, and GEAR-Evolve.

Population frontier

Keep multiple elite research states alive instead of collapsing to one incumbent.

Mutation + crossover

Explore local variants and recombine complementary ideas across branches.

Evolvable controller

Let the agent repair the search policy when its invariants become too weak.

Overview

Why GEAR matters

Research is not one path

AutoResearch-style hill climbing keeps only the current best, discarding useful partial ideas and complementary local optima.

Search states become genetic material

Each node preserves code, parentage, metrics, reflections, and productivity statistics for future expansion.

Agents get a stronger outer loop

GEAR keeps the same training harness and objective, but swaps greedy keep-or-discard search for a structured frontier.

Autonomous ML research agents can now edit code, run experiments, inspect results, and decide what to try next. But a single-incumbent loop is a brittle representation of research: once a direction is worse than the current best, the concrete artifact behind that idea is usually gone.

GEAR treats agentic research as a population-based search process. It maintains a bounded frontier of elite nodes, selects parents by balancing productivity, novelty, coverage, and recency, then expands the frontier through mutation or semantic crossover.

The result is a drop-in genetic search policy for AutoResearch-style systems. Across 100 experiments on the same language-modeling setup, all GEAR variants beat the baseline, and the self-evolving controller achieves the lowest validation bits-per-byte.

Method

Genetic AutoResearch

From one incumbent to a frontier

GEAR replaces the baseline loop with a frontier of elite nodes. The agent consults the frontier, selects one or two parents, creates a child, runs the fixed training job, evaluates fitness, and either promotes the child or discards it from the frontier.

Mutation: edit code from one parent.
Crossover: transplant a coherent idea from a complementary second parent.
Roles: preserve best, lean, and diverse lines of inquiry.

GEAR method overview compared with single-incumbent AutoResearch. — GEAR maintains a frontier population and uses mutation/crossover rather than a single greedy incumbent.

Three variants

GEAR forms a ladder from policy written in natural language, to policy externalized as deterministic code, to policy treated as a mutable search target. This separates what the agent changes in the experiment from how the agent searches.

GEAR-Prompt, GEAR-Fixed, and GEAR-Evolve variants. — GEAR-Prompt executes policy from instructions, GEAR-Fixed uses a deterministic controller, and GEAR-Evolve can modify the controller itself.

Results

GEAR beats AutoResearch under the same budget

Final best elites

Under identical environments and 100 experiment steps, all GEAR variants achieve lower validation bpb than the AutoResearch baseline. GEAR-Evolve is strongest overall, reaching 0.97658 bpb and crossing the baseline plateau earliest.

Baseline: 0.98232 bpb
GEAR-Prompt: 0.98001 bpb
GEAR-Fixed: 0.97914 bpb
GEAR-Evolve: 0.97658 bpb

Final best elite comparison table for AutoResearch and GEAR variants. — Final best elite per variant within the first 100 experiments.

Sustained improvement

The baseline concentrates almost all progress early and then stalls. GEAR variants keep improving across the full budget, showing the value of preserving diverse anchors and recombining ideas instead of committing to a single path.

Improvement in running best bpb per quarter of experiments. — Improvement in running best bpb per 25-experiment block.

GEAR-Evolve controller edits during 100 experiments. — GEAR-Evolve patches its controller when crossover suggestions become redundant or degenerate.

Analysis

What the genetic structure buys

Mutation explores locally

Single-parent edits let each elite branch test architecture, optimizer, schedule, and regularization changes without erasing other branches.

Crossover composes discoveries

Complementary parents let the agent transplant useful changes across branches, turning separate partial wins into stronger children.

Evolving policy repairs failure modes

GEAR-Evolve identifies degenerate crossover behavior and patches the controller, improving the search loop itself mid-run.

The key empirical difference is not only final bpb. The baseline quickly converges to a local optimum, while GEAR maintains multiple lines of inquiry and continues discovering improvements over longer horizons.

Mechanized crossover is especially important: the fixed and evolved controllers enforce complementarity and avoid repeatedly using the same parent pair, producing substantially higher-quality crossover attempts than prompt-only execution.

Citation

BibTeX

@misc{jeddi2026geargeneticautoresearchagentic,
      title={GEAR: Genetic AutoResearch for Agentic Code Evolution}, 
      author={Ahmadreza Jeddi and Minh Ngoc Le and Hakki C. Karaimer and Konstantinos G. Derpanis and Babak Taati},
      year={2026},
      eprint={2605.13874},
      archivePrefix={arXiv},
      primaryClass={cs.NE},
      url={https://arxiv.org/abs/2605.13874}, 
}