eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Stars156.1k

Favorites0

Comments0

AddedApr 15, 2026

CategoryModel Evaluation

Install Command

npx skills add affaan-m/everything-claude-code --skill eval-harness

Curation Score

This skill scores 78/100, which means it is a solid directory candidate with real workflow value for agents doing eval-driven development. Users should be able to trigger it and understand its purpose quickly, though they should expect a mostly documentation-driven skill rather than one backed by helper scripts or bundled references.

78/100

Strengths

Clear activation use cases for EDD setup, pass/fail criteria, regression evals, and benchmarking
Substantial operational content with structured eval and grader templates plus multiple workflow sections
Strong triggerability from the frontmatter and explicit 'When to Activate' guidance, making install intent easy to judge

Cautions

No install command, scripts, or support files, so adoption depends on reading and applying the markdown guidance manually
No references/resources/tests bundled, which limits trust signals for users who want a turnkey evaluation harness

Claude Code Evaluation Testing Regression Testing Pr Github Code

Overview

Overview of eval-harness skill

What eval-harness does

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define what “good” looks like before you ship, then measure whether an agent, prompt, or workflow actually meets that bar.

Who should use it

Use the eval-harness skill if you need repeatable checks for AI-assisted coding, prompt changes, or agent behavior. It is especially useful for teams comparing model versions, tracking regressions, or turning vague task expectations into pass/fail criteria.

Why it matters

The main value of eval-harness for Model Evaluation is reliability: instead of judging outcomes by feel, you write evals that expose when behavior changes. That makes it easier to debug agent performance, compare runs, and avoid shipping prompt updates that silently degrade quality.

When it is a good fit

It fits best when the task can be expressed as observable success criteria, output structure, or checkpointed behavior. It is less useful for open-ended creative work unless you can still define measurable acceptance conditions.

How to Use eval-harness skill

Install and activate

For eval-harness install, use the repo’s skill install flow in your Claude Code environment, then open the skill file directly. The skill lives at skills/eval-harness/SKILL.md, and that is the first file to read because it defines when to activate the framework and how to structure evals.

Build a prompt the skill can evaluate

For strong eval-harness usage, do not start with “test my agent.” Start with a concrete target, such as: what task the agent must complete, what counts as success, what failure looks like, and whether you are checking capability or regression. A better input looks like: “Evaluate whether the agent can update a React form without breaking validation, and require three explicit success criteria.” That gives the harness something measurable.

Read the right files first

If you are adopting the eval-harness guide approach inside your own workflow, read SKILL.md first, then inspect any repository notes that describe evaluation style, grading logic, or output conventions. In this repo, there are no helper scripts or extra support folders, so the skill file itself is the source of truth.

Use it in a practical workflow

A good workflow is: define the behavior, write one eval for the happy path, add one regression eval for a known failure, then run the harness and refine the criteria. This keeps evals small enough to debug and reduces the chance of writing tests that are too broad to interpret.

eval-harness skill FAQ

Is eval-harness only for Claude Code?

No. The skill is written around Claude Code sessions, but the underlying method is useful anywhere you need structured agent evaluation. If your stack uses different tools, you can still adapt the eval format and grading logic.

Is eval-harness the same as a normal prompt?

No. A normal prompt asks for an answer; eval-harness asks for a repeatable way to judge answers. That distinction matters when you need consistency across versions, not just a single good response.

Is it beginner-friendly?

Yes, if you can describe a task clearly. The harder part is not the syntax; it is writing good success criteria. Beginners usually do well when they start with one simple capability eval instead of trying to model an entire workflow at once.

When should I not use it?

Skip eval-harness if the work is highly subjective, if the output cannot be checked consistently, or if you only need a one-off answer. It is strongest when reliability, regression tracking, or model comparison is the real goal.

How to Improve eval-harness skill

Make criteria observable

The biggest quality gain comes from turning opinions into checks. Replace “make it better” with conditions like “preserve existing API shape,” “return valid JSON,” or “pass all three regression cases.” The more observable the criteria, the easier eval-harness becomes to run and trust.

Separate capability from regression

If you mix new-feature checks with old-behavior checks, failures become hard to interpret. Keep capability evals focused on whether Claude can do something new, and regression evals focused on whether a known baseline still holds.

Give the harness real edge cases

Stronger evals include failure modes, not just happy paths. Add tricky inputs, incomplete context, or ambiguous instructions so the eval-harness skill can reveal whether the agent is robust or merely lucky on clean examples.

Iterate after the first run

Treat the first run as calibration, not proof. If the result is unclear, tighten the success criteria, add a baseline, or split one broad eval into smaller checks. That is usually the fastest way to improve eval-harness usage and get results you can act on.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

evaluation-methodology

by wshobson

The evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.

Model Evaluation

Favorites 0GitHub 32.6k

self-eval

by alirezarezvani

self-eval is a prompt-only Claude Code skill for honest post-work review. It uses two-axis scoring, devil's advocate reasoning, score persistence, and anti-inflation checks to evaluate AI work quality after tasks, code reviews, or work sessions.

Model Evaluation

Favorites 0GitHub 22.2k

healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation

Favorites 0GitHub 156.2k

agent-eval

by affaan-m

agent-eval is a skill for benchmarking coding agents head-to-head on reproducible tasks, comparing pass rate, cost, time, and consistency. Use the agent-eval skill to evaluate Claude Code, Aider, Codex, or another agent in your own repo with clearer evidence than ad hoc prompting.

Model Evaluation

Favorites 0GitHub 156k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

huggingface-best

by huggingface

The huggingface-best skill helps you find the best model for a task by checking Hugging Face benchmark leaderboards and filtering by device limits and model size. Use it for model recommendations in coding, reasoning, chat, OCR, RAG, speech, vision, or multimodal work when you need a practical shortlist, not a generic model list.

Model Evaluation

Favorites 0GitHub 10.4k

libafl

by trailofbits

The libafl skill helps you plan and build modular fuzzers with LibAFL for custom targets, mutation strategies, and security audit workflows. Use this libafl guide to move from target details to a practical harness, feedback model, and run plan with fewer assumptions.

Security Audit

Favorites 0GitHub 5k

evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing

Favorites 0GitHub 0

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

gws-modelarmor

by googleworkspace

gws-modelarmor helps you work with Google Model Armor in the googleworkspace/cli ecosystem. Use it to sanitize prompts, sanitize model responses, and create templates with less guesswork than a generic prompt. It is designed for repeatable, policy-aware usage and Security Audit workflows.

Security Audit

Favorites 0GitHub 25.5k

analyzing-campaign-attribution-evidence

by mukul975

analyzing-campaign-attribution-evidence helps analysts weigh infrastructure overlap, ATT&CK consistency, malware similarity, timing, and language artifacts for defensible campaign attribution. Use this analyzing-campaign-attribution-evidence guide for CTI, incident analysis, and Security Audit reviews.

Security Audit

Favorites 0GitHub 6.1k

detecting-ai-model-prompt-injection-attacks

by mukul975

detecting-ai-model-prompt-injection-attacks is a cybersecurity skill for screening untrusted text before it reaches an LLM. It uses layered regex, heuristic scoring, and DeBERTa-based classification to flag direct and indirect prompt injection attacks. Useful for chatbot input validation, document ingestion, and Threat Modeling.

Threat Modeling

Favorites 0GitHub 0

honeyhive-automation

by ComposioHQ

honeyhive-automation helps Claude automate Honeyhive workflows through Composio Rube MCP, with setup checks, active connection verification, and schema-first tool discovery before actions.

Workflow Automation

Favorites 0GitHub 67.5k

llm-evaluation

by wshobson

Use the llm-evaluation skill to design repeatable evaluation plans for LLM apps, prompts, RAG systems, and model changes with metrics, human review, benchmarking, and regression checks.

Model Evaluation

Favorites 0GitHub 32.6k

ai-prompt-engineering-safety-review

by github

ai-prompt-engineering-safety-review is a prompt audit skill for reviewing LLM prompts for safety, bias, security weaknesses, and output quality before production, evaluation, or customer-facing use.

Model Evaluation

Favorites 0GitHub 27.8k

agentic-eval

by github

agentic-eval is a GitHub Copilot skill that shows how to build evaluation loops for AI outputs using reflection, rubric-based critique, and evaluator-optimizer patterns.

Model Evaluation

Favorites 0GitHub 27.8k