evaluation-methodology
by wshobsonThe evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.
This skill scores 83/100, which makes it a solid directory listing for users who need a detailed reference on how PluginEval scores skills and plugins. The repository evidence shows substantial, non-placeholder methodology content with explicit dimensions, formulas, thresholds, anti-patterns, and improvement guidance, so an agent can use it as a reliable interpretation and calibration aid. It is less of a hands-on executable workflow than an operational reference, so users should install it when they need evaluation logic explained consistently rather than step-by-step automation.
- Strong triggerability from a specific description covering scoring interpretation, threshold calibration, and improvement use cases
- High operational substance: SKILL.md is extensive and explicitly covers evaluation layers, dimensions, blend weights, formulas, badges, anti-pattern flags, and Elo ranking
- Trustworthy reference structure with an authoritative rubric file in references/rubrics.md for anchored scoring standards
- Mostly documentation-driven; there are no scripts or install commands to turn the methodology into a directly executable workflow
- Some referenced implementation details point to analyzer files like `layers/static.py`, but the evidence shown here is mainly conceptual methodology rather than runnable evaluation tooling
Overview of evaluation-methodology skill
What the evaluation-methodology skill does
The evaluation-methodology skill explains the scoring system behind PluginEval for Model Evaluation. It is not a generic “how to evaluate models” prompt. It is a specific methodology reference covering the three evaluation layers, the scoring dimensions, blend logic, composite scoring, badge thresholds, anti-pattern flags, and ranking concepts used to assess plugin or skill quality.
Who should install evaluation-methodology
This skill is best for people who need to interpret or improve an evaluation result, not just generate one score. Good fits include:
- skill or plugin authors diagnosing a weak score
- marketplace or platform operators calibrating quality gates
- reviewers who need consistent language for score disputes
- teams explaining badges or rankings to partners and stakeholders
If your real task is “why did this score happen, and what should change first?” this is a strong fit.
Real job-to-be-done
Users usually care about four things before adoption:
- which dimensions matter most
- how static checks differ from judge-based scoring
- how Monte Carlo or blended layers affect the final number
- what changes will raise the score fastest
The evaluation-methodology skill is valuable because it gives those answers in a structured way rather than leaving you to infer them from scattered rubric notes.
What makes this different from a normal evaluation prompt
A normal prompt can ask an LLM to “evaluate this skill,” but it usually lacks:
- explicit layer separation
- anchored rubric references
- dimension-specific weighting logic
- threshold and badge interpretation
- methodology language suitable for calibration or dispute resolution
This skill is better when you need consistent evaluation reasoning, especially around triggering accuracy, orchestration quality, and score interpretation.
What to read before deciding
Read SKILL.md first for the full methodology, then references/rubrics.md for the anchored standards used by the judge layer. Those two files are enough to decide whether the evaluation-methodology skill matches your Model Evaluation workflow.
How to Use evaluation-methodology skill
Install context for evaluation-methodology install
Install from the repo with:
npx skills add https://github.com/wshobson/agents --skill evaluation-methodology
Then invoke it from your AI coding environment the same way you would use any installed skill: by giving a task that clearly asks for PluginEval scoring interpretation, methodology explanation, calibration guidance, or score-improvement advice.
What input the skill needs
The evaluation-methodology skill works best when you provide concrete evaluation context, such as:
- the
SKILL.mdor plugin content being judged - the dimension or score that looks suspicious
- whether you care about static analysis, LLM judge output, or full blended scoring
- your goal: explain, calibrate, improve, or defend a score
- any marketplace threshold, badge cutoff, or acceptance bar you use
Without that context, the output will stay high-level because the methodology itself is broad.
Turn a rough goal into a strong prompt
Weak prompt:
Explain this evaluation score.
Stronger prompt:
Use the evaluation-methodology skill to interpret this PluginEval result. Focus on Triggering Accuracy and Orchestration Fitness, explain how the three evaluation layers likely contributed, identify which issues are static-document problems versus judge-layer reasoning problems, and suggest the smallest changes that would most improve the composite score.
Why this works:
- names the methodology explicitly
- narrows the dimensions
- asks for layer-aware reasoning
- requests prioritized improvement advice, not a summary
Best prompt pattern for evaluation-methodology usage
A high-quality evaluation-methodology usage prompt usually includes:
- the artifact being evaluated
- the score or dimension in question
- the decision you need to make
- the desired output format
Example:
Apply the evaluation-methodology skill to this skill draft. Estimate which dimensions are most at risk, cite the likely rubric anchors behind that judgment, and recommend edits that improve triggering precision without making the description too narrow.
Practical workflow that reduces guesswork
Use this sequence:
- read
SKILL.mdfor the overall scoring system - open
references/rubrics.mdfor anchor-level interpretation - identify the dimension you actually need to act on
- ask for layer-specific diagnosis
- revise the skill or plugin
- re-check whether the change improved the right dimension rather than just making the document longer
This matters because many score problems are misdiagnosed. For example, a triggering issue often comes from vague frontmatter description language, while an orchestration issue may come from unclear input/output contracts.
Repository files to read first
For this evaluation-methodology guide, prioritize:
plugins/plugin-eval/skills/evaluation-methodology/SKILL.mdplugins/plugin-eval/skills/evaluation-methodology/references/rubrics.md
Read SKILL.md to understand the framework, then use references/rubrics.md when you need grounded score interpretation or want to compare a draft against anchor points.
What the three layers mean in practice
The methodology stacks three layers:
- static analysis for deterministic document checks
- LLM judge scoring for rubric-based qualitative assessment
- Monte Carlo simulation for prompt-distribution behavior, especially around triggering
That separation is useful operationally. If you want a fast preflight check before publishing, static analysis is the first stop. If you need a defensible explanation of a low score, the judge rubrics matter more. If you care whether a skill fires on the right prompts across realistic variation, the Monte Carlo framing is the most decision-relevant.
When to use evaluation-methodology for Model Evaluation
Use evaluation-methodology for Model Evaluation when your subject is not just model output quality but the quality of the skill or plugin wrapper around model behavior. This methodology is especially relevant when the key question is whether a skill is discoverable, appropriately triggered, well-scaffolded, and operationally reliable in an agent ecosystem.
It is less suitable if you only need benchmark design for raw model performance on tasks unrelated to plugin or skill orchestration.
Common adoption blockers
Users often hesitate because they are unsure whether this skill is actionable or just descriptive. In practice, it is actionable if you need to:
- trace a score back to a dimension
- understand what each dimension rewards
- choose edits that affect the composite score
- calibrate thresholds for publishing or badging
It is less actionable if you expect a turnkey evaluator script. The repository evidence here is methodology-first, with the strongest support in the written framework and rubrics.
evaluation-methodology skill FAQ
Is evaluation-methodology a scorer or a methodology reference?
Primarily a methodology reference. It tells you how PluginEval measures quality and how to interpret results. That makes it especially useful for audits, calibration, and improvement planning.
Is the evaluation-methodology skill beginner-friendly?
Yes, if the beginner already understands what a skill or plugin is. The writing is structured, but the concepts become much clearer when you bring a real example and ask about one dimension at a time instead of the full framework all at once.
How is this different from asking an LLM to review my skill?
A plain review prompt may produce decent advice, but it usually will not align with PluginEval’s layered scoring model or rubric anchors. The evaluation-methodology skill gives you a shared scoring language, which is more useful when multiple reviewers need consistency.
When should I not use evaluation-methodology?
Skip it when:
- you only need a generic writing critique
- you are evaluating raw model task accuracy rather than skill/plugin quality
- you want executable automation more than methodology guidance
- your ecosystem does not use PluginEval-like dimensions or badge logic
Does this help with low Triggering Accuracy scores?
Yes. The rubric reference explicitly treats triggering as precision-plus-recall behavior across representative prompts. That makes the skill especially useful when a description is either too vague to trigger reliably or too broad and fires on irrelevant prompts.
Can I use this outside PluginEval?
Yes, but mostly as a structured reference model. The dimensions, layer separation, and rubric thinking transfer well. The exact weights, thresholds, and badges are most useful when your process is close to PluginEval.
How to Improve evaluation-methodology skill
Start with the dimension that affects decisions
When using the evaluation-methodology skill, do not ask for “overall quality” first. Ask which single dimension is most likely blocking your decision. In practice, that often surfaces the most leverage quickly, especially for Triggering Accuracy or Orchestration Fitness.
Provide stronger inputs for better analysis
Better input:
- current score or suspected weak dimension
- the exact
descriptionfrontmatter - the relevant section of
SKILL.md - examples of prompts that should and should not trigger the skill
- your acceptance threshold
This lets the skill reason more like the methodology intends, especially for dimension-specific diagnosis.
Use positive and negative trigger examples
One of the highest-value upgrades is to provide both:
- prompts where the skill should activate
- prompts where it should stay silent
That directly improves analysis of routing quality. It mirrors the methodology’s concern with both precision and recall instead of only asking, “does this sound relevant?”
Separate static fixes from judge-layer fixes
Not all improvements are equal. Use the skill to classify issues into:
- structural fixes: frontmatter, missing contracts, poor progressive disclosure
- rubric fixes: weak explanations, vague guidance, poor actionability
- behavior-fit fixes: likely triggering mismatch under realistic prompt variation
This prevents over-editing the wrong part of the skill.
Avoid the most common failure mode
The most common mistake is making the skill broader in an attempt to improve discoverability. That can raise apparent coverage while hurting triggering precision. Ask the evaluation-methodology skill to check whether a revised description became too generic.
Iterate with rubric anchors, not intuition alone
After the first output, ask:
Which anchor in
references/rubrics.mdbest matches this draft now, and what exact evidence keeps it from the next anchor?
That question produces more useful revision guidance than “how can I improve it?” because it ties changes to specific scoring movement.
Ask for smallest-change recommendations
For faster iteration, prompt for minimal edits:
Using the evaluation-methodology skill, recommend the three smallest wording or structure changes most likely to improve the composite score without changing scope.
This is usually better than a full rewrite because it preserves intent while targeting the evaluated dimensions.
Re-check whether improvements changed the right metric
A cleaner document can still fail the methodology. After revising, ask the skill to compare:
- expected effect on Triggering Accuracy
- expected effect on Orchestration Fitness
- likely effect on composite score
- possible new tradeoffs introduced by the edits
That final check is where the evaluation-methodology guide becomes most useful: not just explaining the framework, but helping you improve within it.
