Evaluation

Evaluation taxonomy generated by the site skill importer.

15 skills
A
healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation
Favorites 0GitHub 156.2k
A
eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation
Favorites 0GitHub 156.1k
A
continuous-agent-loop

by affaan-m

continuous-agent-loop helps agents run repeatable autonomous loops with quality gates, evals, recovery steps, and clear stop rules for reliable task completion.

Agent Orchestration
Favorites 0GitHub 156.1k
M
context-degradation

by muratcankoylan

context-degradation is a practical skill for diagnosing context failures in long workflows, including lost-in-the-middle, poisoning, distraction, confusion, and clash. Use it to identify where context breaks, decide what to change first, and apply a repeatable context-degradation guide for Skill Authoring, prompt placement, and production agent debugging.

Skill Authoring
Favorites 0GitHub 15.6k
H
huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation
Favorites 0GitHub 10.4k
M
azure-ai-projects-py

by microsoft

azure-ai-projects-py is the Azure AI Projects Python SDK skill for Microsoft Foundry project clients. Use it for install, auth, client setup, versioned agents with PromptAgentDefinition, evaluations, connections, deployments, datasets, indexes, and OpenAI-compatible access. Best for backend development workflows in Python.

Backend Development
Favorites 0GitHub 2.2k
M
skill-optimizer

by mcollina

skill-optimizer helps authors improve AI skills for activation, clarity, and cross-model reliability. Use it for Skill Authoring when a skill is written but not reliably followed, when triggers are weak, regressions appear, or context cost needs trimming. It supports benchmark loops, release gates, and tighter usage fidelity.

Skill Authoring
Favorites 0GitHub 1.8k
N
tree-of-thoughts

by NeoLabHQ

tree-of-thoughts is a reasoning workflow skill that helps agents explore multiple approaches, prune weak branches, and synthesize a better answer. It is useful for hard debugging, planning, architecture tradeoffs, and tree-of-thoughts for Agent Orchestration.

Agent Orchestration
Favorites 0GitHub 982
N
judge

by NeoLabHQ

Judge is a two-phase evaluation skill that launches a meta-judge first, then a judge sub-agent to score work with isolated context, evidence, and clear criteria. Use it for report-only reviews of code, writing, analysis, or Skill Authoring when you need a defensible judge guide instead of a casual opinion.

Skill Authoring
Favorites 0GitHub 982
N
judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems
Favorites 0GitHub 982
N
do-and-judge

by NeoLabHQ

The do-and-judge skill executes a single task with a sub-agent implementation step, an independent judge, and retry-based verification until it passes or max retries are reached. Use do-and-judge for Workflow Automation when you need clear acceptance criteria, isolated execution, and less guesswork than a generic prompt.

Workflow Automation
Favorites 0GitHub 982
N
do-competitively

by NeoLabHQ

do-competitively helps you solve important tasks with parallel candidate generation, rubric-based judging, and evidence-based synthesis. It is best for Workflow Automation and other high-stakes requests where quality, robustness, and tradeoff handling matter more than speed.

Workflow Automation
Favorites 0GitHub 982
K
scholar-evaluation

by K-Dense-AI

scholar-evaluation helps evaluate scholarly and research work with structured scoring across problem formulation, methodology, analysis, writing, and publication readiness. Use it for academic review, revision planning, and consistent feedback on papers, proposals, literature reviews, and other scholarly drafts.

Academic Research
Favorites 0GitHub 0
M
evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing
Favorites 0GitHub 0
N
critique

by NeoLabHQ

critique is a report-only review skill that uses multiple specialized judges, debate, and consensus to assess completed work. It helps with critique for Code Review, correctness, quality, and missed issues before merging. Install critique in the NeoLabHQ context-engineering-kit and use it with file paths, commits, or context.

Code Review
Favorites 0GitHub 0
Evaluation