healthcare-eval-harness

by affaan-m

healthcare-eval-harness is a patient safety evaluation harness for healthcare app deployments. It helps teams verify CDSS accuracy, PHI exposure, data integrity, clinical workflow behavior, and integration compliance before release. Critical failures block deployment, making it useful for healthcare-eval-harness for Model Evaluation and CI safety gates.

Stars156.2k

Favorites0

Comments0

AddedApr 15, 2026

CategoryModel Evaluation

Install Command

npx skills add affaan-m/everything-claude-code --skill healthcare-eval-harness

Curation Score

This skill scores 78/100, which means it is a solid listing candidate for directory users who need a healthcare deployment safety harness. The repository shows a real, triggerable workflow for evaluating EMR/EHR changes, with explicit safety gates for CDSS accuracy, PHI exposure, data integrity, clinical workflow, and integration compliance. It is useful enough to install if you want a structured healthcare test harness rather than a generic prompt, though users should note that it is test-framework oriented and not bundled with helper scripts or references.

78/100

Strengths

Clear healthcare-specific trigger conditions: use before EMR/EHR deployments, CDSS changes, schema changes touching patient data, and auth changes.
Operationally meaningful gates: critical failures block deployment, with explicit pass thresholds for safety-focused categories.
Good workflow orientation: the body describes ordered test categories and framework-agnostic adaptation guidance, which helps an agent execute it with less guesswork.

Cautions

No install command, scripts, or supporting reference files are included, so adoption requires users to translate the harness into their own test framework.
The repository is labeled with experimental/test signals, so users should verify it fits their CI/CD and clinical validation standards before relying on it.

Testing Evaluation Ci Cd Medical Regression Testing Jest

Overview

Overview of healthcare-eval-harness skill

What healthcare-eval-harness is

healthcare-eval-harness is a deployment safety skill for healthcare software teams that need to verify patient-facing changes before release. It focuses on model- and rules-based evaluation for clinical decision support, PHI exposure, data integrity, workflow correctness, and integration behavior. The point is not generic QA; it is to stop unsafe healthcare changes from shipping.

Who should use it

This healthcare-eval-harness skill is a good fit for engineers, QA leads, MLOps teams, and clinical informatics teams working on EMR, EHR, CDSS, or adjacent healthcare apps. It is most useful when a failure could affect dosing, triage, access control, or regulated patient data handling. If you need a lightweight prompt for a non-clinical app, this is probably too strict.

What makes it different

The repository treats safety gates as hard release criteria: critical failures block deployment instead of being logged as warnings. That makes healthcare-eval-harness useful when you need an installable evaluation pattern, not just a checklist. It also expects you to adapt the harness to your test runner, which keeps it portable across Jest, Vitest, pytest, or PHPUnit.

How to Use healthcare-eval-harness skill

Install and inspect the skill

Install with npx skills add affaan-m/everything-claude-code --skill healthcare-eval-harness. Then read skills/healthcare-eval-harness/SKILL.md first, followed by any linked guidance in the repo root if you are using the broader package. For this skill, the main value is in the evaluation rules and thresholds, so do not skip the “When to Use” and “How It Works” sections.

Turn your task into a useful prompt

A strong healthcare-eval-harness usage prompt should name the system under test, the change type, the test runner, and the safety concern. For example: “Apply healthcare-eval-harness to our EHR medication order flow in pytest. We changed dose validation and role-based access, and I need the critical gates to block release on PHI leakage or unsafe dosing failures.” That is much better than “Run the healthcare skill.”

Recommended workflow

Use the skill when a change touches patient data, clinical logic, or deployment controls. First map your feature to the five evaluation categories, then decide which ones are critical versus high priority. Next, translate the rules into your existing framework and CI pipeline, and only then run the checks. The most important decision is whether your test suite actually reflects the clinical failure mode you want to prevent.

What to read first

Start with SKILL.md for the gate structure, pass thresholds, and usage boundaries. Pay special attention to the examples that use Jest as a reference only; the skill is framework-agnostic, so you should adapt the file paths, commands, and assertions to your stack. If your repo has its own test organization, mirror that structure instead of forcing a generic layout.

healthcare-eval-harness skill FAQ

Is healthcare-eval-harness only for Jest?

No. Jest is shown as an example, but healthcare-eval-harness is meant to work with any serious test runner. The important part is preserving the critical gate logic, category order, and pass thresholds in your own tooling.

Is this the same as a normal prompt for healthcare QA?

No. A normal prompt may generate tests, but the healthcare-eval-harness skill gives you an installable evaluation model with explicit blocking behavior. That matters when you need reliable deployment decisions for healthcare application changes.

When should I not use it?

Do not use healthcare-eval-harness for low-risk content changes, marketing pages, or features that do not touch patient safety, clinical workflows, or regulated data. It can be overkill if your team does not have the discipline to maintain tests that reflect real clinical risk.

Is it beginner-friendly?

Yes, if you already know basic testing and CI concepts. It is not a tutorial on healthcare compliance, so beginners will still need domain review for thresholds, edge cases, and what counts as a critical failure.

How to Improve healthcare-eval-harness skill

Give the skill sharper clinical context

The best healthcare-eval-harness results come from specific inputs: the patient workflow, the failure you fear, the data fields involved, and the expected safe behavior. “Test the app” is weak; “test that a medication order with an allergy match blocks submission and logs the reason” is actionable.

Make the failure gates explicit

State which failures must block deployment and which can be high-priority warnings. If you want the skill to evaluate healthcare AI for Model Evaluation, say whether you care more about hallucination risk, PHI leakage, guideline adherence, or workflow breakage. The more explicit the gate, the less guesswork in the output.

Iterate against real misses

After the first run, compare the harness output to actual incidents, near misses, or clinician feedback. Tighten the assertions where unsafe behavior slipped through, and relax only the checks that create noise without improving safety. That feedback loop is what makes healthcare-eval-harness useful beyond a one-time prompt.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

evaluation-methodology

by wshobson

The evaluation-methodology skill explains PluginEval scoring for Model Evaluation, including layers, rubrics, composite scoring, badge thresholds, and practical guidance for interpreting results and improving weak dimensions.

Model Evaluation

Favorites 0GitHub 32.6k

self-eval

by alirezarezvani

self-eval is a prompt-only Claude Code skill for honest post-work review. It uses two-axis scoring, devil's advocate reasoning, score persistence, and anti-inflation checks to evaluate AI work quality after tasks, code reviews, or work sessions.

Model Evaluation

Favorites 0GitHub 22.2k

eval-harness

by affaan-m

The eval-harness skill is a formal evaluation framework for Claude Code sessions and eval-driven development. It helps you define pass/fail criteria, build capability and regression evals, and measure agent reliability before shipping prompt or workflow changes.

Model Evaluation

Favorites 0GitHub 156.1k

agent-eval

by affaan-m

agent-eval is a skill for benchmarking coding agents head-to-head on reproducible tasks, comparing pass rate, cost, time, and consistency. Use the agent-eval skill to evaluate Claude Code, Aider, Codex, or another agent in your own repo with clearer evidence than ad hoc prompting.

Model Evaluation

Favorites 0GitHub 156k

huggingface-community-evals

by huggingface

huggingface-community-evals helps you run Hugging Face Hub model evaluations locally with inspect-ai or lighteval. Use it for backend selection, smoke tests, and a practical guide to vLLM, Transformers, or accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publishing, or community-evals automation.

Model Evaluation

Favorites 0GitHub 10.4k

huggingface-best

by huggingface

The huggingface-best skill helps you find the best model for a task by checking Hugging Face benchmark leaderboards and filtering by device limits and model size. Use it for model recommendations in coding, reasoning, chat, OCR, RAG, speech, vision, or multimodal work when you need a practical shortlist, not a generic model list.

Model Evaluation

Favorites 0GitHub 10.4k

libafl

by trailofbits

The libafl skill helps you plan and build modular fuzzers with LibAFL for custom targets, mutation strategies, and security audit workflows. Use this libafl guide to move from target details to a practical harness, feedback model, and run plan with fewer assumptions.

Security Audit

Favorites 0GitHub 5k

evaluation

by muratcankoylan

The evaluation skill helps you design and run agent evaluations for non-deterministic systems. Use it for evaluation install planning, rubrics, regression checks, quality gates, and evaluation for Skill Testing. It fits LLM-as-judge workflows, multi-dimensional scoring, and practical evaluation usage when you need repeatable results.

Skill Testing

Favorites 0GitHub 0

judge-with-debate

by NeoLabHQ

judge-with-debate evaluates solutions through structured multi-agent debate, using a shared specification, evidence-based counterarguments, and up to 3 rounds to reach consensus. It is well suited for code review, rubric-based assessment, and judge-with-debate for Multi-Agent Systems workflows.

Multi-Agent Systems

Favorites 0GitHub 982

gws-modelarmor

by googleworkspace

gws-modelarmor helps you work with Google Model Armor in the googleworkspace/cli ecosystem. Use it to sanitize prompts, sanitize model responses, and create templates with less guesswork than a generic prompt. It is designed for repeatable, policy-aware usage and Security Audit workflows.

Security Audit

Favorites 0GitHub 25.5k

analyzing-campaign-attribution-evidence

by mukul975

analyzing-campaign-attribution-evidence helps analysts weigh infrastructure overlap, ATT&CK consistency, malware similarity, timing, and language artifacts for defensible campaign attribution. Use this analyzing-campaign-attribution-evidence guide for CTI, incident analysis, and Security Audit reviews.

Security Audit

Favorites 0GitHub 6.1k

detecting-ai-model-prompt-injection-attacks

by mukul975

detecting-ai-model-prompt-injection-attacks is a cybersecurity skill for screening untrusted text before it reaches an LLM. It uses layered regex, heuristic scoring, and DeBERTa-based classification to flag direct and indirect prompt injection attacks. Useful for chatbot input validation, document ingestion, and Threat Modeling.

Threat Modeling

Favorites 0GitHub 0

honeyhive-automation

by ComposioHQ

honeyhive-automation helps Claude automate Honeyhive workflows through Composio Rube MCP, with setup checks, active connection verification, and schema-first tool discovery before actions.

Workflow Automation

Favorites 0GitHub 67.5k

llm-evaluation

by wshobson

Use the llm-evaluation skill to design repeatable evaluation plans for LLM apps, prompts, RAG systems, and model changes with metrics, human review, benchmarking, and regression checks.

Model Evaluation

Favorites 0GitHub 32.6k

ai-prompt-engineering-safety-review

by github

ai-prompt-engineering-safety-review is a prompt audit skill for reviewing LLM prompts for safety, bias, security weaknesses, and output quality before production, evaluation, or customer-facing use.

Model Evaluation

Favorites 0GitHub 27.8k

agentic-eval

by github

agentic-eval is a GitHub Copilot skill that shows how to build evaluation loops for AI outputs using reflection, rubric-based critique, and evaluator-optimizer patterns.

Model Evaluation

Favorites 0GitHub 27.8k