evaluation-methodology

by wshobson

The evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.

Stars32.6k

Favorites0

Comments0

AddedMar 30, 2026

CategoryModel Evaluation

Install Command

npx skills add wshobson/agents --skill evaluation-methodology

Curation Score

This skill scores 83/100, which makes it a solid directory listing for users who need a detailed reference on how PluginEval scores skills and plugins. The repository evidence shows substantial, non-placeholder methodology content with explicit dimensions, formulas, thresholds, anti-patterns, and improvement guidance, so an agent can use it as a reliable interpretation and calibration aid. It is less of a hands-on executable workflow than an operational reference, so users should install it when they need evaluation logic explained consistently rather than step-by-step automation.

83/100

Strengths

Strong triggerability from a specific description covering scoring interpretation, threshold calibration, and improvement use cases
High operational substance: SKILL.md is extensive and explicitly covers evaluation layers, dimensions, blend weights, formulas, badges, anti-pattern flags, and Elo ranking
Trustworthy reference structure with an authoritative rubric file in references/rubrics.md for anchored scoring standards

Cautions

Mostly documentation-driven; there are no scripts or install commands to turn the methodology into a directly executable workflow
Some referenced implementation details point to analyzer files like `layers/static.py`, but the evidence shown here is mainly conceptual methodology rather than runnable evaluation tooling

Plugin Eval Testing Checklist Docs Workflow Metrics Ai Agents Llm

Overview

Overview of evaluation-methodology skill

What the evaluation-methodology skill does

The evaluation-methodology skill explains the scoring system behind PluginEval for Model Evaluation. It is not a generic “how to evaluate models” prompt. It is a specific methodology reference covering the three evaluation layers, the scoring dimensions, blend logic, composite scoring, badge thresholds, anti-pattern flags, and ranking concepts used to assess plugin or skill quality.

Who should install evaluation-methodology

This skill is best for people who need to interpret or improve an evaluation result, not just generate one score. Good fits include:

skill or plugin authors diagnosing a weak score
marketplace or platform operators calibrating quality gates
reviewers who need consistent language for score disputes
teams explaining badges or rankings to partners and stakeholders

If your real task is “why did this score happen, and what should change first?” this is a strong fit.

Real job-to-be-done

Users usually care about four things before adoption:

which dimensions matter most
how static checks differ from judge-based scoring
how Monte Carlo or blended layers affect the final number
what changes will raise the score fastest

The evaluation-methodology skill is valuable because it gives those answers in a structured way rather than leaving you to infer them from scattered rubric notes.

What makes this different from a normal evaluation prompt

A normal prompt can ask an LLM to “evaluate this skill,” but it usually lacks:

explicit layer separation
anchored rubric references
dimension-specific weighting logic
threshold and badge interpretation
methodology language suitable for calibration or dispute resolution

This skill is better when you need consistent evaluation reasoning, especially around triggering accuracy, orchestration quality, and score interpretation.

What to read before deciding

Read SKILL.md first for the full methodology, then references/rubrics.md for the anchored standards used by the judge layer. Those two files are enough to decide whether the evaluation-methodology skill matches your Model Evaluation workflow.

How to Use evaluation-methodology skill

Install context for evaluation-methodology install

Install from the repo with:

npx skills add https://github.com/wshobson/agents --skill evaluation-methodology

Then invoke it from your AI coding environment the same way you would use any installed skill: by giving a task that clearly asks for PluginEval scoring interpretation, methodology explanation, calibration guidance, or score-improvement advice.

What input the skill needs

The evaluation-methodology skill works best when you provide concrete evaluation context, such as:

the SKILL.md or plugin content being judged
the dimension or score that looks suspicious
whether you care about static analysis, LLM judge output, or full blended scoring
your goal: explain, calibrate, improve, or defend a score
any marketplace threshold, badge cutoff, or acceptance bar you use

Without that context, the output will stay high-level because the methodology itself is broad.

Turn a rough goal into a strong prompt

Weak prompt:

Explain this evaluation score.

Stronger prompt:

Use the evaluation-methodology skill to interpret this PluginEval result. Focus on Triggering Accuracy and Orchestration Fitness, explain how the three evaluation layers likely contributed, identify which issues are static-document problems versus judge-layer reasoning problems, and suggest the smallest changes that would most improve the composite score.

Why this works:

names the methodology explicitly
narrows the dimensions
asks for layer-aware reasoning
requests prioritized improvement advice, not a summary

Best prompt pattern for evaluation-methodology usage

A high-quality evaluation-methodology usage prompt usually includes:

the artifact being evaluated
the score or dimension in question
the decision you need to make
the desired output format

Example:

Apply the evaluation-methodology skill to this skill draft. Estimate which dimensions are most at risk, cite the likely rubric anchors behind that judgment, and recommend edits that improve triggering precision without making the description too narrow.

Practical workflow that reduces guesswork

Use this sequence:

read SKILL.md for the overall scoring system
open references/rubrics.md for anchor-level interpretation
identify the dimension you actually need to act on
ask for layer-specific diagnosis
revise the skill or plugin
re-check whether the change improved the right dimension rather than just making the document longer

This matters because many score problems are misdiagnosed. For example, a triggering issue often comes from vague frontmatter description language, while an orchestration issue may come from unclear input/output contracts.

Repository files to read first

For this evaluation-methodology guide, prioritize:

plugins/plugin-eval/skills/evaluation-methodology/SKILL.md
plugins/plugin-eval/skills/evaluation-methodology/references/rubrics.md

Read SKILL.md to understand the framework, then use references/rubrics.md when you need grounded score interpretation or want to compare a draft against anchor points.

What the three layers mean in practice

The methodology stacks three layers:

static analysis for deterministic document checks
LLM judge scoring for rubric-based qualitative assessment
Monte Carlo simulation for prompt-distribution behavior, especially around triggering

That separation is useful operationally. If you want a fast preflight check before publishing, static analysis is the first stop. If you need a defensible explanation of a low score, the judge rubrics matter more. If you care whether a skill fires on the right prompts across realistic variation, the Monte Carlo framing is the most decision-relevant.

When to use evaluation-methodology for Model Evaluation

Use evaluation-methodology for Model Evaluation when your subject is not just model output quality but the quality of the skill or plugin wrapper around model behavior. This methodology is especially relevant when the key question is whether a skill is discoverable, appropriately triggered, well-scaffolded, and operationally reliable in an agent ecosystem.

It is less suitable if you only need benchmark design for raw model performance on tasks unrelated to plugin or skill orchestration.

Common adoption blockers

Users often hesitate because they are unsure whether this skill is actionable or just descriptive. In practice, it is actionable if you need to:

trace a score back to a dimension
understand what each dimension rewards
choose edits that affect the composite score
calibrate thresholds for publishing or badging

It is less actionable if you expect a turnkey evaluator script. The repository evidence here is methodology-first, with the strongest support in the written framework and rubrics.

evaluation-methodology skill FAQ

Is evaluation-methodology a scorer or a methodology reference?

Primarily a methodology reference. It tells you how PluginEval measures quality and how to interpret results. That makes it especially useful for audits, calibration, and improvement planning.

Is the evaluation-methodology skill beginner-friendly?

Yes, if the beginner already understands what a skill or plugin is. The writing is structured, but the concepts become much clearer when you bring a real example and ask about one dimension at a time instead of the full framework all at once.

How is this different from asking an LLM to review my skill?

A plain review prompt may produce decent advice, but it usually will not align with PluginEval’s layered scoring model or rubric anchors. The evaluation-methodology skill gives you a shared scoring language, which is more useful when multiple reviewers need consistency.

When should I not use evaluation-methodology?

Skip it when:

you only need a generic writing critique
you are evaluating raw model task accuracy rather than skill/plugin quality
you want executable automation more than methodology guidance
your ecosystem does not use PluginEval-like dimensions or badge logic

Does this help with low Triggering Accuracy scores?

Yes. The rubric reference explicitly treats triggering as precision-plus-recall behavior across representative prompts. That makes the skill especially useful when a description is either too vague to trigger reliably or too broad and fires on irrelevant prompts.

Can I use this outside PluginEval?

Yes, but mostly as a structured reference model. The dimensions, layer separation, and rubric thinking transfer well. The exact weights, thresholds, and badges are most useful when your process is close to PluginEval.

How to Improve evaluation-methodology skill

Start with the dimension that affects decisions

When using the evaluation-methodology skill, do not ask for “overall quality” first. Ask which single dimension is most likely blocking your decision. In practice, that often surfaces the most leverage quickly, especially for Triggering Accuracy or Orchestration Fitness.

Provide stronger inputs for better analysis

Better input:

current score or suspected weak dimension
the exact description frontmatter
the relevant section of SKILL.md
examples of prompts that should and should not trigger the skill
your acceptance threshold

This lets the skill reason more like the methodology intends, especially for dimension-specific diagnosis.

Use positive and negative trigger examples

One of the highest-value upgrades is to provide both:

prompts where the skill should activate
prompts where it should stay silent

That directly improves analysis of routing quality. It mirrors the methodology’s concern with both precision and recall instead of only asking, “does this sound relevant?”

Separate static fixes from judge-layer fixes

Not all improvements are equal. Use the skill to classify issues into:

structural fixes: frontmatter, missing contracts, poor progressive disclosure
rubric fixes: weak explanations, vague guidance, poor actionability
behavior-fit fixes: likely triggering mismatch under realistic prompt variation

This prevents over-editing the wrong part of the skill.

Avoid the most common failure mode

The most common mistake is making the skill broader in an attempt to improve discoverability. That can raise apparent coverage while hurting triggering precision. Ask the evaluation-methodology skill to check whether a revised description became too generic.

Iterate with rubric anchors, not intuition alone

After the first output, ask:

Which anchor in references/rubrics.md best matches this draft now, and what exact evidence keeps it from the next anchor?

That question produces more useful revision guidance than “how can I improve it?” because it ties changes to specific scoring movement.

Ask for smallest-change recommendations

For faster iteration, prompt for minimal edits:

Using the evaluation-methodology skill, recommend the three smallest wording or structure changes most likely to improve the composite score without changing scope.

This is usually better than a full rewrite because it preserves intent while targeting the evaluated dimensions.

Re-check whether improvements changed the right metric

A cleaner document can still fail the methodology. After revising, ask the skill to compare:

expected effect on Triggering Accuracy
expected effect on Orchestration Fitness
likely effect on composite score
possible new tradeoffs introduced by the edits

That final check is where the evaluation-methodology guide becomes most useful: not just explaining the framework, but helping you improve within it.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation

Favorites 0GitHub 156.2k

eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation

Favorites 0GitHub 156.1k

agent-eval

by affaan-m

agent-eval is a skill for benchmarking coding agents head-to-head on reproducible tasks, comparing pass rate, cost, time, and consistency. Use the agent-eval skill to evaluate Claude Code, Aider, Codex, or another agent in your own repo with clearer evidence than ad hoc prompting.

Model Evaluation

Favorites 0GitHub 156k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

huggingface-best

by huggingface

The huggingface-best skill helps you find the best model for a task by checking Hugging Face benchmark leaderboards and filtering by device limits and model size. Use it for model recommendations in coding, reasoning, chat, OCR, RAG, speech, vision, or multimodal work when you need a practical shortlist, not a generic model list.

Model Evaluation

Favorites 0GitHub 10.4k

libafl

by trailofbits

The libafl skill helps you plan and build modular fuzzers with LibAFL for custom targets, mutation strategies, and security audit workflows. Use this libafl guide to move from target details to a practical harness, feedback model, and run plan with fewer assumptions.

Security Audit

Favorites 0GitHub 5k

evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing

Favorites 0GitHub 0

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

gws-modelarmor

by googleworkspace

gws-modelarmor helps you work with Google Model Armor in the googleworkspace/cli ecosystem. Use it to sanitize prompts, sanitize model responses, and create templates with less guesswork than a generic prompt. It is designed for repeatable, policy-aware usage and Security Audit workflows.

Security Audit

Favorites 0GitHub 25.5k

analyzing-campaign-attribution-evidence

by mukul975

analyzing-campaign-attribution-evidence helps analysts weigh infrastructure overlap, ATT&CK consistency, malware similarity, timing, and language artifacts for defensible campaign attribution. Use this analyzing-campaign-attribution-evidence guide for CTI, incident analysis, and Security Audit reviews.

Security Audit

Favorites 0GitHub 6.1k

detecting-ai-model-prompt-injection-attacks

by mukul975

detecting-ai-model-prompt-injection-attacks is a cybersecurity skill for screening untrusted text before it reaches an LLM. It uses layered regex, heuristic scoring, and DeBERTa-based classification to flag direct and indirect prompt injection attacks. Useful for chatbot input validation, document ingestion, and Threat Modeling.

Threat Modeling

Favorites 0GitHub 0

llm-evaluation

by wshobson

Use the llm-evaluation skill to design repeatable evaluation plans for LLM apps, prompts, RAG systems, and model changes with metrics, human review, benchmarking, and regression checks.

Model Evaluation

Favorites 0GitHub 32.6k

ai-prompt-engineering-safety-review

by github

ai-prompt-engineering-safety-review is a prompt audit skill for reviewing LLM prompts for safety, bias, security weaknesses, and output quality before production, evaluation, or customer-facing use.

Model Evaluation

Favorites 0GitHub 27.8k

agentic-eval

by github

agentic-eval is a GitHub Copilot skill that shows how to build evaluation loops for AI outputs using reflection, rubric-based critique, and evaluator-optimizer patterns.

Model Evaluation

Favorites 0GitHub 27.8k

ml-pipeline-workflow

by wshobson

ml-pipeline-workflow is a practical guide to designing end-to-end MLOps pipelines for data prep, training, validation, deployment, and monitoring, with orchestration patterns for repeatable workflow automation.

Workflow Automation

Favorites 0GitHub 0

frontend-design

by anthropics

frontend-design helps you turn vague UI ideas into distinctive, production-grade interfaces with real frontend code, strong aesthetic direction, and less generic AI styling.

UI Design

Favorites 1GitHub 105.2k