autoresearch
by githubautoresearch is an autonomous experimentation loop for coding tasks with measurable outcomes. It helps developers define a goal, baseline, metric, and scope, then iterate through code changes, tests, and keep-or-revert decisions using git-backed checkpoints.
This skill scores 82/100, which means it is a solid directory listing candidate: users can quickly understand when to invoke it, what prerequisites it has, and what workflow it will drive, though they should expect a documentation-only skill rather than a packaged tool with installable helpers.
- Highly triggerable: the description clearly defines fit as autonomous iterative experimentation for programming tasks with a measurable metric, and explicitly rules out one-shot tasks and simple bug fixes.
- Operationally clear: it states concrete prerequisites and constraints, including requiring git, a git repository, terminal access, an interactive setup phase, baseline measurement, and commit-before-run experiment discipline.
- Real agent leverage: the body is substantial and workflow-heavy, with multiple sections and code fences describing an autonomous loop of code changes, testing, measuring, and keeping or discarding results.
- Adoption is documentation-led only: there are no scripts, resources, references, or install command, so execution depends on the agent correctly following prose instructions.
- Usefulness depends on having a measurable outcome and a repo-ready environment; tasks without clear metrics or without git/terminal access are explicitly out of scope.
Overview of autoresearch skill
What autoresearch is for
The autoresearch skill is an autonomous experimentation loop for coding tasks where success can be measured. Instead of asking an agent for one big fix, you define a target, a metric, and boundaries; the agent then iterates through changes, tests, measurements, and keep-or-revert decisions.
Who should install autoresearch
The best fit for the autoresearch skill is a developer who wants repeatable improvement, not a one-shot answer. It is especially useful for:
- performance tuning
- promptable benchmark improvement
- reliability or test-pass-rate improvement
- reducing build time or runtime cost
- trying multiple implementation variants safely
If your task is a simple bug fix, a code review, or anything without a measurable outcome, autoresearch is usually the wrong tool.
The real job-to-be-done
Users adopt autoresearch when they want the agent to behave more like an experiment operator than a code generator. The job is not “write code”; it is “run disciplined iterations against a defined metric and stop when gains flatten or constraints are hit.”
What makes autoresearch different from a normal prompt
A normal prompt often produces one proposed solution. autoresearch for Workflow Automation is different because it structures the work around:
- an explicit goal
- a baseline measurement
- a repeatable experiment loop
- git-backed checkpoints
- a decision process for keeping or discarding results
That difference matters most when several plausible changes might help, but only measurement can tell.
Main adoption constraints to know first
Before you try autoresearch install steps, check the hard requirements:
- your project must already be a
gitrepository - the agent needs terminal access
- the task needs a measurable metric
- the metric must be runnable often enough to support iteration
The skill is light on support files and centers almost entirely on SKILL.md, so your decision depends on whether that workflow matches your environment.
How to Use autoresearch skill
Install autoresearch in your skill environment
Install it from the GitHub skill repository with:
npx skills add github/awesome-copilot --skill autoresearch
After installation, open skills/autoresearch/SKILL.md first. This skill has no extra scripts or helper references, so most operational detail lives there.
Read this file before anything else
Start with:
SKILL.md
Because the repository does not include separate automation assets, the quality of your autoresearch usage depends on understanding the workflow described in that file rather than hunting for hidden tooling.
Confirm your project is a good fit
Use autoresearch when you can answer all three:
- What exact outcome should improve?
- How will you measure it?
- What constraints must not be violated?
Good examples:
- “Reduce endpoint latency by 20% while keeping all tests green.”
- “Increase benchmark throughput on
bench/search.jswithout increasing memory beyond 10%.” - “Improve flaky test pass rate from 82% to 95%.”
Weak examples:
- “Make the code cleaner.”
- “Refactor this area.”
- “Fix whatever seems wrong.”
- “Improve architecture.”
Define the metric before the loop starts
The most important setup step in this autoresearch guide is choosing a metric the agent can actually run. Strong metrics are:
- objective
- fast enough to rerun
- stable enough to compare
- tied to the real goal
Examples:
npm test -- --runInBand- a benchmark script with median runtime
- build duration
- request latency from a local harness
- binary size
- failure count across repeated runs
If the metric is noisy, require multiple runs or a threshold for meaningful improvement.
Turn a rough goal into a strong prompt
A weak request leaves the loop guessing. A strong request gives the agent a target, metric, scope, and stopping rule.
Weak:
Use autoresearch to improve this service.
Stronger:
Use autoresearch on this repository to reduce
npm run bench:apimedian latency by at least 15%. Keepnpm testpassing, do not change external API behavior, and limit work tosrc/cacheandsrc/http. Establish a baseline first, commit each experiment, and stop after 8 iterations or when improvements plateau.
That prompt works better because it removes ambiguity the loop cannot safely infer.
Provide explicit scope constraints
The skill is designed to ask for setup details interactively. Help it by pre-specifying:
- allowed directories
- forbidden files
- whether dependency changes are allowed
- runtime or memory ceilings
- acceptable tradeoffs
- max number of iterations
Without this, the agent may spend iterations exploring areas you would have ruled out immediately.
Follow the intended autoresearch loop
In practice, the autoresearch skill works best as:
- define goal
- define metric
- record baseline
- propose one experiment
- make code changes
- run measurement
- compare with baseline
- keep or discard
- commit the attempt
- repeat until stop criteria are met
The key operational idea is controlled iteration, not broad autonomous refactoring.
Use git the way the skill expects
git is not optional here. The workflow explicitly depends on checkpointing each experiment attempt. That gives you:
- reversible trials
- cleaner comparison between ideas
- a clearer audit trail
- safer autonomous exploration
If your working tree is messy before you start, clean it first. Autoresearch is much easier to trust when every trial is isolated.
Suggested workflow inside a real repository
A practical way to run autoresearch usage is:
- clean working tree
- verify metric command runs locally
- verify baseline once manually
- invoke the skill with goal, metric, and scope
- let it iterate in small batches
- review kept commits, not every discarded idea
- rerun the winning result independently before merging
This keeps the experiment loop useful without surrendering review discipline.
Tips that improve output quality fast
High-impact habits:
- choose one primary metric, not five competing goals
- keep the experiment surface small at first
- define what “no regression” means
- set a max iteration count
- ask for a short log of attempts and outcomes
- prefer measurable local commands over subjective evaluation
These choices matter more than fancy wording.
autoresearch skill FAQ
Is autoresearch better than an ordinary coding prompt?
For measurable optimization tasks, yes. For one-off implementation requests, usually no. The value of autoresearch comes from repeated measured trials, not from initial code generation quality alone.
Is autoresearch beginner-friendly?
It is usable by beginners, but only if they can define a runnable metric and understand the repository enough to set scope. The skill reduces experimentation guesswork; it does not remove the need for clear success criteria.
When should I not use autoresearch?
Skip the autoresearch skill when:
- there is no trustworthy metric
- the task is mostly design judgment
- the codebase is too risky for autonomous edits
- experiment runs are too slow or expensive
- you only need a simple fix
Does autoresearch require special project structure?
No special framework is required, but it does require:
- a git repository
- terminal access
- commands the agent can run to measure progress
That makes it broadly applicable across languages, provided your measurement loop is real.
How is it different from CI-driven optimization?
CI can verify results, but autoresearch is about generating and evaluating candidate changes in a loop. Think of CI as the safety net and autoresearch as the experiment operator.
Is autoresearch useful outside performance tuning?
Yes, if the outcome is measurable. It can also fit reliability, pass-rate, cost, build-speed, or other programming tasks with a clear metric. It is much less useful for ambiguous “improve this” requests.
How to Improve autoresearch skill
Start with a sharper problem statement
The fastest way to improve autoresearch results is to replace vague objectives with operational ones. Include:
- target metric
- baseline command
- acceptable regressions
- scope boundaries
- stop condition
A precise setup usually outperforms giving the agent more freedom.
Reduce metric noise before blaming the skill
A common failure mode is chasing random variance. If results fluctuate, improve the benchmark setup:
- run multiple trials
- use medians
- isolate background processes
- warm caches consistently
- fix input datasets
Better measurement often improves the skill more than changing prompts.
Narrow the search space early
If autoresearch roams too widely, constrain it. Ask it to start in one subsystem, one hotspot, or one class of changes. Broad search sounds powerful, but narrower search usually yields better, reviewable gains.
Tell the skill what must never change
Many poor outcomes come from missing guardrails. State non-negotiables such as:
- API compatibility
- test suite pass requirements
- dependency freeze
- memory ceilings
- style or security restrictions
This helps the agent reject locally good but globally bad changes.
Ask for experiment logging, not just final code
To get more value from the autoresearch guide workflow, ask the agent to summarize:
- each attempted change
- measured result
- keep/discard decision
- reason for rejection
This makes iteration auditable and helps you spot patterns in failed attempts.
Iterate on prompts after the first run
If the first run disappoints, do not just rerun unchanged. Improve one of:
- the metric
- the allowed scope
- the stop rule
- the benchmark command
- the explicit hypotheses to test
Example:
On the next autoresearch run, focus only on allocation reduction in
src/parser, ignore stylistic refactors, and compare median time across 7 runs.
That kind of refinement changes behavior materially.
Know the most common misfire patterns
Watch for:
- optimizing the wrong metric
- regressions hidden by weak tests
- too-large code changes per iteration
- benchmark commands that are slow or flaky
- stopping too early after one apparent win
These are usually setup problems, not proof that autoresearch is ineffective.
Review winners independently before merging
Even when autoresearch for Workflow Automation finds an improvement, validate it outside the loop:
- rerun the benchmark yourself
- run a broader test suite
- inspect maintainability tradeoffs
- confirm the gain matters in production terms
The skill is strongest at discovering candidates. Final acceptance should still be deliberate.
