llm-evaluation

by wshobson

Use the llm-evaluation skill to design repeatable evaluation plans for LLM apps, prompts, RAG systems, and model changes with metrics, human review, benchmarking, and regression checks.

Stars32.6k

Favorites0

Comments0

AddedMar 30, 2026

CategoryModel Evaluation

Install Command

npx skills add wshobson/agents --skill llm-evaluation

Curation Score

This skill scores 68/100, which means it is acceptable to list for directory users who want structured guidance on evaluating LLM apps, but they should expect a documentation-heavy framework rather than a tightly operational skill with runnable assets or explicit execution steps.

68/100

Strengths

Strong triggerability: the skill clearly states when to use it, including regression testing, model/prompt comparison, and production validation.
Substantive workflow content: the document covers multiple evaluation modes such as automated metrics, human evaluation, benchmarking, and A/B testing rather than staying at a placeholder level.
Useful conceptual leverage: it gives agents a reusable evaluation taxonomy for text generation, classification, and RAG tasks that is more structured than a generic prompt.

Cautions

Operational clarity is limited by missing install/run guidance, scripts, and referenced support files, so agents still need to infer implementation details.
The evidence shows few explicit constraints or decision rules, which may make metric selection and execution inconsistent across real projects.

Llm Testing Ai Metrics Reliability Workflow

Overview

Overview of llm-evaluation skill

The llm-evaluation skill is a practical framework for designing evaluations for LLM apps, prompts, and model changes. It is best for builders who need more than “this feels better” and want a repeatable way to measure quality, compare variants, and catch regressions before shipping.

Who this llm-evaluation skill is for

This llm-evaluation skill fits teams and solo builders working on:

prompt iteration
model comparison
RAG quality checks
classification or extraction tasks
production QA for LLM features
benchmark creation for ongoing releases

If you are trying to answer “Did this change actually improve the system?” this skill is a strong fit.

What job the skill helps you get done

The real job-to-be-done is to turn vague quality concerns into a usable evaluation plan. Instead of asking for generic testing advice, you use llm-evaluation to choose the right evaluation type, define metrics, add human review where automation is weak, and structure comparisons over time.

What makes llm-evaluation different from a generic prompt

A generic prompt might suggest “use BLEU, F1, and human review.” This llm-evaluation skill is more useful when you need to map evaluation methods to the actual shape of your application:

text generation tasks need different metrics than classification
RAG systems need retrieval metrics, not just output judgments
some qualities like helpfulness or tone need human evaluation
A/B tests and regression checks need baselines, not one-off scores

That makes it more decision-oriented than a casual “how do I evaluate my LLM?” request.

What matters most before you install

Before using llm-evaluation, be clear on three things:

what task you are evaluating
what “good” means for that task
whether you need automated metrics, human review, or both

If those are still fuzzy, the skill can still help, but your outputs will stay high-level.

Main tradeoffs and limits

This skill gives evaluation strategy, not a packaged evaluation runner. It helps you design the framework and select methods, but you still need your own dataset, tooling, and execution setup. If you want a fully automated framework with built-in pipelines, treat this as planning guidance rather than drop-in infrastructure.

How to Use llm-evaluation skill

How to install llm-evaluation skill

Use the standard skill install flow:

npx skills add https://github.com/wshobson/agents --skill llm-evaluation

After install, invoke it when you want help designing or improving an evaluation plan for an LLM application.

What to read first in the repository

This skill is unusually self-contained. Start with:

plugins/llm-application-dev/skills/llm-evaluation/SKILL.md

Because there are no obvious helper scripts or resource files, most of the value is in the written framework itself. Read the “When to Use This Skill” and “Core Evaluation Types” sections first.

What inputs the skill needs to be useful

The llm-evaluation usage quality depends heavily on the inputs you provide. Give:

your application type: summarization, chatbot, RAG, extraction, classification, etc.
the change being evaluated: new prompt, model swap, retrieval update, policy change
sample inputs and expected outputs
current failure modes
deployment constraints: speed, cost, safety, review bandwidth
whether you need offline benchmarking, human review, or online testing

Without this context, the skill will correctly stay generic.

How to turn a rough goal into a strong prompt

Weak goal:

“Help me evaluate my LLM app.”

Stronger goal:

“Use the llm-evaluation skill to design an evaluation plan for a customer-support RAG assistant. We are comparing two prompts and one retriever change. We need offline metrics for retrieval quality, human review dimensions for answer quality, and a regression checklist we can run before deployment.”

That stronger version tells the skill what system is changing, what kind of evaluation is needed, and what decision the evaluation must support.

Prompt template for llm-evaluation usage

Use a request like this:

task type
system architecture
variants being compared
evaluation dataset size and source
key risks
preferred metrics
acceptable tradeoffs

Example structure:

“Use llm-evaluation for Model Evaluation of a RAG assistant. Recommend automated metrics, human evaluation criteria, and an A/B testing approach. We care most about factual accuracy, citation usefulness, and regression detection. Suggest a minimal first version and an expanded version.”

Choosing the right evaluation type

The skill covers multiple evaluation modes. In practice:

use automated metrics for repeatability and scale
use human evaluation for qualities that are subjective or nuanced
use benchmarking to compare versions over time
use A/B testing when real-user behavior matters

A common mistake is overusing one method. For example, relying only on BLEU for generative tasks or only on human review for large regression checks.

Metric selection by task

Use the task to drive metric choice:

text generation: BLEU, ROUGE, METEOR, BERTScore, perplexity
classification: accuracy, precision, recall, F1, confusion matrix, AUC-ROC
retrieval / RAG: MRR, NDCG, Precision@K, Recall@K

The important practical point: do not force text-generation metrics onto retrieval problems or vice versa. The llm-evaluation guide is most useful when you match metrics to the actual system layer being tested.

When to include human evaluation

Add human review when your success criteria include things like:

factual accuracy in open-ended answers
helpfulness
coherence
tone
instruction-following
safety or policy compliance

Human review is especially important when automated scores can look good while real answers are still poor.

A practical workflow that reduces guesswork

A good first workflow for llm-evaluation install users:

define one task and one user outcome
collect a small but representative test set
choose 2–4 automated metrics that fit the task
define 3–5 human review dimensions
score a baseline system
compare one change at a time
record failures, not just averages

This keeps evaluation lightweight enough to adopt while still being rigorous.

What the skill helps with best

This llm-evaluation skill is strongest when you need help with:

selecting evaluation methods
structuring a benchmark
combining human and automated assessment
planning comparisons between prompts or models
building confidence before deployment

It is less useful if you only need a one-line prompt to “judge outputs,” or if you already have a mature evaluation harness and just need implementation code.

Common usage mistake: evaluating without a baseline

Many teams ask whether version B is “good.” The more useful question is whether version B is better than version A on the cases that matter. In your prompt, ask the skill to define:

baseline metrics
comparison rules
pass/fail thresholds
regression criteria

That makes llm-evaluation for Model Evaluation much more actionable.

llm-evaluation skill FAQ

Is llm-evaluation good for beginners?

Yes, if you already know your app type and what you are trying to improve. The skill explains major evaluation categories clearly. It is less beginner-friendly if you have not yet defined your task, dataset, or success criteria.

Do I need a formal benchmark dataset first?

No, but you do need examples. Even a small curated test set is better than evaluating with ad hoc prompts every time. The skill is most useful once you can show representative cases and expected behavior.

Is this skill only for academic-style evaluation?

No. The repository content is practical: model comparison, prompt validation, regression detection, production confidence, and A/B testing. It is applicable to product teams, not just research workflows.

When should I not use llm-evaluation?

Skip llm-evaluation if your need is purely implementation-specific, such as wiring a particular evaluation SDK or running a specific framework command. This skill is about strategy and design, not a turnkey code integration.

How is llm-evaluation different from asking an LLM to grade itself?

Self-grading can be part of a workflow, but it is not a full evaluation strategy. llm-evaluation helps you combine fit-for-purpose metrics, human judgment, baselines, and comparisons so you do not rely on a single noisy signal.

Can I use llm-evaluation for RAG systems?

Yes. In fact, it is a strong fit because it explicitly covers retrieval metrics like MRR, NDCG, Precision@K, and Recall@K. That matters because many weak evaluations score only answer text and ignore retrieval quality.

How to Improve llm-evaluation skill

Give the skill task-level detail, not just a general app description

Better input:

“Support chatbot that answers billing questions from a knowledge base”

Worse input:

“AI assistant”

The more specific your task framing, the better the skill can recommend the right metrics and review dimensions.

Separate system components in your prompt

For stronger llm-evaluation usage, ask the skill to evaluate layers separately:

retrieval quality
generation quality
classification accuracy
safety behavior

This avoids blending multiple failure sources into one vague score.

Provide real failure examples

Include 5–10 bad outputs and explain why they failed. For example:

hallucinated product policy
missed relevant retrieved document
correct answer with poor tone
refusal when the query was actually safe

This helps the skill recommend evaluation dimensions that match your actual risks.

Ask for a minimum viable evaluation first

Do not start with a huge framework. Ask for:

the smallest useful benchmark
the fewest metrics worth tracking
the minimum human review rubric
a simple regression process

This makes adoption much easier and avoids evaluation plans that look impressive but never get run.

Use scorecards with explicit criteria

If you request human evaluation, ask the skill to define:

rating dimensions
scoring scales
examples of pass/fail
tie-break rules for ambiguous cases

That reduces reviewer inconsistency and makes repeated evaluations more trustworthy.

Compare one change at a time

A common failure mode is changing prompt, model, retriever, and post-processing together. Then the evaluation cannot explain what caused the result. Ask llm-evaluation to structure experiments so each test isolates a variable where possible.

Track regressions, not just average improvement

Averages can hide important losses. Ask the skill to identify:

worst-case categories
high-risk slices
user-critical scenarios
safety-sensitive prompts

This is one of the biggest practical upgrades over shallow evaluation plans.

Iterate after the first evaluation run

After your first pass, bring the results back and ask the skill to refine:

which metrics were noisy
which human dimensions overlapped
where the dataset was too narrow
which failure clusters deserve new test cases

That second iteration is often where llm-evaluation becomes truly valuable rather than just informative.

Improve llm-evaluation outputs with decision-focused requests

Instead of asking for a broad overview, ask for a decision artifact:

“Create a release-gate evaluation plan”
“Design a prompt-comparison benchmark”
“Build a human review rubric for hallucination risk”
“Recommend metrics for RAG retrieval regression checks”

Decision-focused prompts produce outputs you can use immediately.

Know the ceiling of the skill

llm-evaluation improves planning quality, but it cannot replace representative data, careful labeling, or disciplined review. If your examples are weak or your success criteria are contradictory, the output will also be weak. The fastest way to improve the skill’s usefulness is to improve the specificity and realism of your evaluation brief.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

evaluation-methodology

by wshobson

The evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.

Model Evaluation

Favorites 0GitHub 32.6k

healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Model Evaluation

Favorites 0GitHub 156.2k

eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation

Favorites 0GitHub 156.1k

agent-eval

by affaan-m

agent-eval is a skill for benchmarking coding agents head-to-head on reproducible tasks, comparing pass rate, cost, time, and consistency. Use the agent-eval skill to evaluate Claude Code, Aider, Codex, or another agent in your own repo with clearer evidence than ad hoc prompting.

Model Evaluation

Favorites 0GitHub 156k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

huggingface-best

by huggingface

The huggingface-best skill helps you find the best model for a task by checking Hugging Face benchmark leaderboards and filtering by device limits and model size. Use it for model recommendations in coding, reasoning, chat, OCR, RAG, speech, vision, or multimodal work when you need a practical shortlist, not a generic model list.

Model Evaluation

Favorites 0GitHub 10.4k

libafl

by trailofbits

The libafl skill helps you plan and build modular fuzzers with LibAFL for custom targets, mutation strategies, and security audit workflows. Use this libafl guide to move from target details to a practical harness, feedback model, and run plan with fewer assumptions.

Security Audit

Favorites 0GitHub 5k

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

gws-modelarmor

by googleworkspace

gws-modelarmor helps you work with Google Model Armor in the googleworkspace/cli ecosystem. Use it to sanitize prompts, sanitize model responses, and create templates with less guesswork than a generic prompt. It is designed for repeatable, policy-aware usage and Security Audit workflows.

Security Audit

Favorites 0GitHub 25.5k

analyzing-campaign-attribution-evidence

by mukul975

analyzing-campaign-attribution-evidence helps analysts weigh infrastructure overlap, ATT&CK consistency, malware similarity, timing, and language artifacts for defensible campaign attribution. Use this analyzing-campaign-attribution-evidence guide for CTI, incident analysis, and Security Audit reviews.

Security Audit

Favorites 0GitHub 6.1k

detecting-ai-model-prompt-injection-attacks

by mukul975

detecting-ai-model-prompt-injection-attacks is a cybersecurity skill for screening untrusted text before it reaches an LLM. It uses layered regex, heuristic scoring, and DeBERTa-based classification to flag direct and indirect prompt injection attacks. Useful for chatbot input validation, document ingestion, and Threat Modeling.

Threat Modeling

Favorites 0GitHub 0

ai-prompt-engineering-safety-review

by github

ai-prompt-engineering-safety-review is a prompt audit skill for reviewing LLM prompts for safety, bias, security weaknesses, and output quality before production, evaluation, or customer-facing use.

Model Evaluation

Favorites 0GitHub 27.8k

agentic-eval

by github

agentic-eval is a GitHub Copilot skill that shows how to build evaluation loops for AI outputs using reflection, rubric-based critique, and evaluator-optimizer patterns.

Model Evaluation

Favorites 0GitHub 27.8k

ml-pipeline-workflow

by wshobson

ml-pipeline-workflow is a practical guide to designing end-to-end MLOps pipelines for data prep, training, validation, deployment, and monitoring, with orchestration patterns for repeatable workflow automation.

Workflow Automation

Favorites 0GitHub 0

frontend-design

by anthropics

frontend-design helps you turn vague UI ideas into distinctive, production-grade interfaces with real frontend code, strong aesthetic direction, and less generic AI styling.

UI Design

Favorites 1GitHub 105.2k

create-colleague

by titanwings

create-colleague turns coworker docs, chats, emails, screenshots, Feishu, and DingTalk data into an editable AI skill with separate work and persona outputs, plus update flows for ongoing refinement.

Skill Authoring

Favorites 1GitHub 747