agent-eval

by affaan-m

agent-eval is a skill for benchmarking coding agents head-to-head on reproducible tasks, comparing pass rate, cost, time, and consistency. Use the agent-eval skill to evaluate Claude Code, Aider, Codex, or another agent in your own repo with clearer evidence than ad hoc prompting.

Stars156k

Favorites0

Comments0

AddedApr 15, 2026

CategoryModel Evaluation

Install Command

npx skills add affaan-m/everything-claude-code --skill agent-eval

Curation Score

This skill scores 78/100, which means it is a solid listing candidate for directory users who want a reproducible way to compare coding agents. The repository gives enough operational detail to understand when to use it and how it works, though users should still review the source before installing because there are no supporting scripts or reference files.

78/100

Strengths

Clear activation use cases for comparing agents, regression checks, and model/tool adoption decisions.
Concrete workflow elements: YAML task definitions, judge checks, and git worktree isolation for reproducible comparisons.
Strong install-decision value for teams wanting data-backed agent selection instead of ad hoc comparisons.

Cautions

No install command, scripts, or support files are provided, so adoption still depends on reading the main skill file.
The repository appears focused on one lightweight CLI workflow; users needing broader evaluation infrastructure may want more tooling.

Claude Code Codex Aider Git Cli Testing Workflow

Overview

Overview of agent-eval skill

agent-eval is a skill for benchmarking coding agents head-to-head on the same task, then comparing results by pass rate, cost, time, and consistency. If you are deciding whether to adopt Claude Code, Aider, Codex, or another agent in a real repo, the agent-eval skill helps you move from opinion to reproducible evidence.

It is best for teams and power users who need a fair comparison, not a generic “prompt it and see” test. The real job-to-be-done is to define a task once, run multiple agents against the same baseline, and judge which one performs best under your constraints.

What makes agent-eval useful

The key value of agent-eval is controlled comparison: same repo, same task, same success checks, separate worktrees. That makes the results easier to trust than ad hoc trials or one-off prompts.

When the skill fits

Use agent-eval skill when you want to:

compare agents before standardizing a workflow
check whether a model update changed outcomes
test performance on your own codebase and rules
gather decision evidence for a team or procurement choice

When it may not fit

If you only need a single coding answer, a normal prompt is simpler. agent-eval is most valuable when you care about repeatability, evaluation criteria, and tradeoffs between speed, quality, and cost.

How to Use agent-eval skill

Install and inspect the skill

For agent-eval install, add the skill from the repo and read the core skill file first:
npx skills add affaan-m/everything-claude-code --skill agent-eval

Then open SKILL.md and any linked context you use in your workflow. In this repository, the main source is the skill file itself, so the install decision depends heavily on whether its task model matches your evaluation process.

Turn a vague goal into a usable task

agent-eval usage works best when you define a concrete task, a target repo, and objective checks. A weak prompt is “test which agent is better at refactoring.” A stronger prompt is:

add retry logic to src/http_client.py
pin the repo to a commit for reproducibility
specify files that may change
define judge commands such as pytest or grep
state the maximum acceptable time or cost if that matters

The more the task can be verified automatically, the more useful the comparison.

Suggested workflow

A practical agent-eval guide is:

Pick one task that reflects a real decision you need to make.
Write the task in YAML with repo path, files, prompt, and judges.
Run multiple agents on the same task.
Compare output quality, execution time, and cost.
Repeat with another task before making a final choice.

The skill uses git worktree isolation, which helps avoid agents stepping on each other’s changes and makes side-by-side evaluation cleaner.

Read these files first

Start with:

SKILL.md for the task format and workflow
any repo-local files that define your testing or judging rules
the files named in your YAML task definition

If you are evaluating agent-eval for Model Evaluation specifically, confirm that your tasks and judges are stable enough to produce comparable runs before you invest in larger benchmarks.

agent-eval skill FAQ

Is agent-eval only for coding-agent benchmarks?

Yes, primarily. The skill is designed for head-to-head coding agent comparison, not general prompt testing or broad LLM benchmarking.

Do I need Docker to use it?

No. The skill uses git worktree isolation, so you can keep runs separated without container overhead.

Is it beginner friendly?

It is approachable if you can define a task clearly and run a command-line workflow. It is less suited to users who want a one-click evaluator with no setup.

How is this different from a normal prompt?

A normal prompt asks one agent to solve one task. agent-eval skill asks multiple agents to solve the same task with fixed judges so you can compare outcomes with less bias.

How to Improve agent-eval skill

Use stronger task definitions

The best agent-eval results come from tasks with clear inputs, clear edit boundaries, and objective judges. If your prompt is too open-ended, the comparison will mostly measure interpretation differences instead of agent quality.

Add judges that reflect real success

Prefer checks that mirror how your team actually validates changes: tests, lint, file diffs, or pattern checks. If the judge is too loose, weak solutions can look good; if it is too strict, you may reward brittle hacks.

Iterate on the benchmark, not the answer

If one agent wins for the wrong reason, revise the task before drawing conclusions. Tighten the files list, clarify acceptance criteria, and pin the commit so the agent-eval skill measures the same target every time.

Watch for common failure modes

The most common mistakes are vague prompts, mismatched judges, and tasks that are too large for a fair comparison. For better agent-eval usage, keep the first benchmark small, reproducible, and representative of the work you actually want agents to do.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

evaluation-methodology

by wshobson

The evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.

Model Evaluation

Favorites 0GitHub 32.6k

self-eval

by alirezarezvani

self-eval is a prompt-only Claude Code skill for honest post-work review. It uses two-axis scoring, devil's advocate reasoning, score persistence, and anti-inflation checks to evaluate AI work quality after tasks, code reviews, or work sessions.

Model Evaluation

Favorites 0GitHub 22.2k

healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation

Favorites 0GitHub 156.2k

eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation

Favorites 0GitHub 156.1k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

huggingface-best

by huggingface

The huggingface-best skill helps you find the best model for a task by checking Hugging Face benchmark leaderboards and filtering by device limits and model size. Use it for model recommendations in coding, reasoning, chat, OCR, RAG, speech, vision, or multimodal work when you need a practical shortlist, not a generic model list.

Model Evaluation

Favorites 0GitHub 10.4k

libafl

by trailofbits

The libafl skill helps you plan and build modular fuzzers with LibAFL for custom targets, mutation strategies, and security audit workflows. Use this libafl guide to move from target details to a practical harness, feedback model, and run plan with fewer assumptions.

Security Audit

Favorites 0GitHub 5k

evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing

Favorites 0GitHub 0

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

gws-modelarmor

by googleworkspace

gws-modelarmor helps you work with Google Model Armor in the googleworkspace/cli ecosystem. Use it to sanitize prompts, sanitize model responses, and create templates with less guesswork than a generic prompt. It is designed for repeatable, policy-aware usage and Security Audit workflows.

Security Audit

Favorites 0GitHub 25.5k

analyzing-campaign-attribution-evidence

by mukul975

analyzing-campaign-attribution-evidence helps analysts weigh infrastructure overlap, ATT&CK consistency, malware similarity, timing, and language artifacts for defensible campaign attribution. Use this analyzing-campaign-attribution-evidence guide for CTI, incident analysis, and Security Audit reviews.

Security Audit

Favorites 0GitHub 6.1k

detecting-ai-model-prompt-injection-attacks

by mukul975

detecting-ai-model-prompt-injection-attacks is a cybersecurity skill for screening untrusted text before it reaches an LLM. It uses layered regex, heuristic scoring, and DeBERTa-based classification to flag direct and indirect prompt injection attacks. Useful for chatbot input validation, document ingestion, and Threat Modeling.

Threat Modeling

Favorites 0GitHub 0

honeyhive-automation

by ComposioHQ

honeyhive-automation helps Claude automate Honeyhive workflows through Composio Rube MCP, with setup checks, active connection verification, and schema-first tool discovery before actions.

Workflow Automation

Favorites 0GitHub 67.5k

llm-evaluation

by wshobson

Use the llm-evaluation skill to design repeatable evaluation plans for LLM apps, prompts, RAG systems, and model changes with metrics, human review, benchmarking, and regression checks.

Model Evaluation

Favorites 0GitHub 32.6k

ai-prompt-engineering-safety-review

by github

ai-prompt-engineering-safety-review is a prompt audit skill for reviewing LLM prompts for safety, bias, security weaknesses, and output quality before production, evaluation, or customer-facing use.

Model Evaluation

Favorites 0GitHub 27.8k

agentic-eval

by github

agentic-eval is a GitHub Copilot skill that shows how to build evaluation loops for AI outputs using reflection, rubric-based critique, and evaluator-optimizer patterns.

Model Evaluation

Favorites 0GitHub 27.8k