Evaluation

Evaluation taxonomy generated by the site skill importer.

19 skills

healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation

Favorites 0GitHub 156.2k

eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation

Favorites 0GitHub 156.1k

continuous-agent-loop

by affaan-m

continuous-agent-loop helps agents run repeatable autonomous loops with quality gates, evals, recovery steps, and clear stop rules for reliable task completion.

Agent Orchestration

Favorites 0GitHub 156.1k

self-eval

by alirezarezvani

self-eval is a prompt-only Claude Code skill for honest post-work review. It uses two-axis scoring, devil's advocate reasoning, score persistence, and anti-inflation checks to evaluate AI work quality after tasks, code reviews, or work sessions.

Model Evaluation

Favorites 0GitHub 22.2k

prompt-governance

by alirezarezvani

prompt-governance is a Claude skill for managing production prompts as versioned, reviewed, tested assets. Use it to plan prompt registries, regression tests, A/B experiments, eval pipelines, release approvals, and rollback workflows for AI features.

Prompt Governance

Favorites 0GitHub 22.2k

run

by alirezarezvani

run is an AgentHub orchestration skill for Claude that triggers /hub:run to initialize a task, spawn agents, evaluate results, and merge the winner. Use it for measurable code improvements or judged creative comparisons with clear task, agent, eval, metric, direction, and template parameters.

Agent Orchestration

Favorites 0GitHub 22.1k

eval

by alirezarezvani

eval ranks completed AgentHub agent results by configured metrics, LLM judge review, or a hybrid approach. Use it with /hub:eval to compare session branches, diffs, and result posts before choosing a winner.

Model Evaluation

Favorites 0GitHub 22.1k

context-degradation

by muratcankoylan

context-degradation is a practical skill for diagnosing context failures in long workflows, including lost-in-the-middle, poisoning, distraction, confusion, and clash. Use it to identify where context breaks, decide what to change first, and apply a repeatable context-degradation guide for Skill Authoring, prompt placement, and production agent debugging.

Skill Authoring

Favorites 0GitHub 15.6k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

azure-ai-projects-py

by microsoft

azure-ai-projects-py is the Azure AI Projects Python SDK skill for Microsoft Foundry project clients. Use it for install, auth, client setup, versioned agents with PromptAgentDefinition, evaluations, connections, deployments, datasets, indexes, and OpenAI-compatible access. Best for backend development workflows in Python.

Backend Development

Favorites 0GitHub 2.2k

skill-optimizer

by mcollina

skill-optimizer helps authors improve AI skills for activation, clarity, and cross-model reliability. Use it for Skill Authoring when a skill is written but not reliably followed, when triggers are weak, regressions appear, or context cost needs trimming. It supports benchmark loops, release gates, and tighter usage fidelity.

Skill Authoring

Favorites 0GitHub 1.8k

tree-of-thoughts

by NeoLabHQ

tree-of-thoughts is a reasoning workflow skill that helps agents explore multiple approaches, prune weak branches, and synthesize a better answer. It is useful for hard debugging, planning, architecture tradeoffs, and tree-of-thoughts for Agent Orchestration.

Agent Orchestration

Favorites 0GitHub 982

judge

by NeoLabHQ

Judge is a two-phase evaluation skill that launches a meta-judge first, then a judge sub-agent to score work with isolated context, evidence, and clear criteria. Use it for report-only reviews of code, writing, analysis, or Skill Authoring when you need a defensible judge guide instead of a casual opinion.

Skill Authoring

Favorites 0GitHub 982

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

do-and-judge

by NeoLabHQ

The do-and-judge skill executes a single task with a sub-agent implementation step, an independent judge, and retry-based verification until it passes or max retries are reached. Use do-and-judge for Workflow Automation when you need clear acceptance criteria, isolated execution, and less guesswork than a generic prompt.

Workflow Automation

Favorites 0GitHub 982

do-competitively

by NeoLabHQ

do-competitively helps you solve important tasks with parallel candidate generation, rubric-based judging, and evidence-based synthesis. It is best for Workflow Automation and other high-stakes requests where quality, robustness, and tradeoff handling matter more than speed.

Workflow Automation

Favorites 0GitHub 982

scholar-evaluation

by K-Dense-AI

scholar-evaluation helps evaluate scholarly and research work with structured scoring across problem formulation, methodology, analysis, writing, and publication readiness. Use it for academic review, revision planning, and consistent feedback on papers, proposals, literature reviews, and other scholarly drafts.

Academic Research

Favorites 0GitHub 0

evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing

Favorites 0GitHub 0

critique

by NeoLabHQ

critique is a report-only review skill that uses multiple specialized judges, debate, and consensus to assess completed work. It helps with critique for Code Review, correctness, quality, and missed issues before merging. Install critique in the NeoLabHQ context-engineering-kit and use it with file paths, commits, or context.

Code Review

Favorites 0GitHub 0