do-and-judge

by NeoLabHQ

The do-and-judge skill executes a single task with a sub-agent implementation step, an independent judge, and retry-based verification until it passes or max retries are reached. Use do-and-judge for Workflow Automation when you need clear acceptance criteria, isolated execution, and less guesswork than a generic prompt.

Stars982

Favorites0

Comments0

AddedMay 9, 2026

CategoryWorkflow Automation

Install Command

npx skills add NeoLabHQ/context-engineering-kit --skill do-and-judge

Curation Score

This skill scores 78/100, which means it is a solid listing candidate for directory users who want a structured execute-and-verify workflow. The repository gives enough operational detail to understand when to use it and how it behaves, though it still lacks some adoption aids that would reduce setup and usage guesswork.

78/100

Strengths

Clear trigger and workflow: it is explicitly for a single task with implementation, independent judging, and retry until pass or max retries.
Strong agent leverage: the meta-judge plus judge loop, parallel dispatch, and feedback retry pattern should help agents execute with less self-check bias.
Operational structure is substantial: valid frontmatter, long body, many headings, and multiple workflow/constraint signals suggest real procedural content rather than a placeholder.

Cautions

No install command, support files, or references are provided, so users must rely on the SKILL.md alone.
The excerpt shows a hard orchestration constraint and truncation, which may make the skill feel brittle or harder to adapt in broader agent setups.

Claude Code Claude Agents Evaluation Workflow

Overview

Overview of do-and-judge skill

What do-and-judge does

The do-and-judge skill is a single-task execution pattern for workflow automation: it sends work to an implementation sub-agent, creates a separate judge rubric, then retries until the result passes or the retry limit is reached. It is best for jobs where quality depends on external verification, not just one-shot generation.

Who should use it

Use do-and-judge when you need an agent to complete a bounded task with measurable acceptance criteria, such as refactors, code edits, or structured content changes. It is a good fit if you want less self-critique and more independent checking before output is accepted.

Why it stands out

The main value of the do-and-judge skill is the separation of roles: the orchestrator does not do the task itself, the implementation agent works from fresh context, and the judge evaluates against a dedicated specification. That design reduces blind spots and makes the do-and-judge install worthwhile when correctness matters more than speed alone.

How to Use do-and-judge skill

do-and-judge install and setup

Install the do-and-judge skill in your skills workspace, then open SKILL.md first because it contains the operating rules and the control flow. For a quick repo pass, read SKILL.md before anything else; there are no helper scripts or support folders to lean on here, so the skill file is the source of truth.

Turn a vague request into usable input

The do-and-judge usage pattern works best when the task is narrow, testable, and has a clear finish line. Instead of asking for “improve this module,” provide:

the exact target file or component
the desired outcome
constraints that must not change
a pass/fail condition or expected behavior

Strong prompt example: Refactor the UserService class to use dependency injection without changing public method names; verify that all existing tests still pass and that constructor wiring is explicit.

Suggested workflow

A practical do-and-judge guide is: define the task, let the implementation agent work in isolation, generate a judge rubric, check the result against that rubric, then retry only on concrete failures. The workflow is designed for do-and-judge for Workflow Automation where the goal is controlled execution, not open-ended brainstorming.

What to watch in the repo

Read SKILL.md for the process, the critical constraints, and the retry threshold. Pay special attention to the sections on task scope, context handling, and red flags, because those determine whether the orchestrator behaves correctly. If you are adapting the skill to another stack, map those rules to your own tooling before using it on a real task.

do-and-judge skill FAQ

Is do-and-judge better than a normal prompt?

For simple requests, no. A normal prompt is faster. do-and-judge is better when you need a task to be implemented and independently verified, especially if the first answer is likely to miss edge cases or drift from requirements.

Is this skill beginner-friendly?

Yes, if you can describe the task clearly. The main learning curve is not the syntax; it is providing enough task context and acceptance criteria for the judge to evaluate output without guessing.

When should I not use do-and-judge?

Do not use do-and-judge for open-ended exploration, loose ideation, or tasks where success is hard to define. It is also a poor fit when you want the orchestrator to directly edit files or run tools, because the skill is built around role separation and verification.

How does it fit into Workflow Automation?

It fits best as a control layer for single, bounded jobs inside a larger automation system. If your workflow already has explicit checks, the skill adds value by structuring the agent loop; if your workflow has no acceptance criteria, the judge step will be too vague to help.

How to Improve do-and-judge skill

Give the judge better criteria

The biggest quality gain comes from stronger evaluation input. When using do-and-judge, specify what “good” means in observable terms: required behavior, forbidden changes, coverage targets, formatting constraints, or compatibility rules. The more concrete the criteria, the less likely the judge is to approve a weak result.

Reduce common failure modes

The most common failure is underspecified scope. If the task is too broad, the implementation agent may optimize the wrong thing and the judge will only catch it late. Another failure mode is hidden constraints, such as backward compatibility, naming conventions, or environment limits, so include those up front instead of expecting the retry loop to infer them.

Iterate on the first output

If the first run misses the mark, do not restate the same task. Feed back the judge’s exact failures, tighten the acceptance criteria, and remove ambiguous language. For do-and-judge usage, the second attempt should be narrower and more testable than the first.

Improve fit before re-running

If you are adapting do-and-judge for another repository or agent stack, align the orchestration rules with your tooling first. Check whether your setup can actually support isolated implementation, independent judging, and bounded retries; if not, simplify the pattern rather than forcing it.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

playwright-interactive

by openai

playwright-interactive is a browser automation skill for persistent Playwright sessions in local web and Electron apps. Use it to inspect UI state, retry interactions, and run functional or visual QA without restarting the toolchain. Ideal when you need a practical playwright-interactive guide for iterative debugging.

Browser Automation

Favorites 0GitHub 0

huggingface-datasets

by huggingface

Use the huggingface-datasets skill for Hugging Face Dataset Viewer API workflows to validate datasets, resolve splits, preview and paginate rows, search text, apply filters, and fetch parquet links or statistics. It is a practical huggingface-datasets guide for read-only dataset exploration.

Web Scraping

Favorites 0GitHub 10.4k

iterative-retrieval

by affaan-m

iterative-retrieval is a workflow pattern for progressively refining context retrieval in agentic work. It helps subagents avoid too much or too little context, making it useful for iterative-retrieval usage, install decisions, and iterative-retrieval for Workflow Automation.

Workflow Automation

Favorites 0GitHub 156.2k

data-scraper-agent

by affaan-m

data-scraper-agent helps build a repeatable public-data pipeline for web scraping, enrichment, and storage. It is designed for monitoring jobs, prices, news, repos, sports, and listings on a schedule using GitHub Actions, with outputs to Notion, Sheets, or Supabase. Best for ongoing tracking, not one-off extractions.

Web Scraping

Favorites 0GitHub 156.1k

notion-meeting-intelligence

by openai

notion-meeting-intelligence helps turn Notion context into meeting-ready agendas and pre-reads, with Codex research for decisions, status, planning, retros, and 1:1 prep. Best for the notion-meeting-intelligence for Meeting Prep workflow when you need grounded materials, clear timeboxes, and attendee-specific outputs.

Meeting Prep

Favorites 0GitHub 18.6k

building-incident-response-playbook

by mukul975

building-incident-response-playbook helps security teams create reusable incident response playbooks with step-by-step phases, decision trees, escalation criteria, RACI ownership, and SOAR-ready structure. It is designed for incident response procedure documentation, incident triage workflows, and audit-friendly operational response plans.

Incident Triage

Favorites 0GitHub 6.1k

building-patch-tuesday-response-process

by mukul975

building-patch-tuesday-response-process helps teams build a repeatable Microsoft Patch Tuesday process to triage advisories, rank risk, test patches, approve rollout, and track compliance. Useful for security operations, vulnerability management, and building-patch-tuesday-response-process for Project Management.

Project Management

Favorites 0GitHub 6.1k

secure-workflow-guide

by trailofbits

secure-workflow-guide guides a 5-step Solidity security workflow: Slither triage, feature-specific checks, visual inspection, security-property notes, and manual review. It is built for smart contract teams, auditors, and builders who want a repeatable secure-workflow-guide guide before deployment or release.

Security Audit

Favorites 0GitHub 4.9k

twitter-cli

by public-clis

twitter-cli is a terminal-first Twitter/X skill for reading timelines, bookmarks, search results, profiles, and tweet details, with posting and other write actions when authenticated. Use it for Social Media research, account monitoring, and lightweight publishing from the command line.

Social Media

Favorites 0GitHub 2.3k

azure-ai-contentunderstanding-py

by microsoft

azure-ai-contentunderstanding-py is the Python skill for Azure AI Content Understanding. It extracts structured content from documents, images, audio, and video for RAG workflows and automation. Use it when you need reliable multimodal extraction, Azure authentication, and repeatable pipeline-ready output.

RAG Workflows

Favorites 0GitHub 2.2k

wp-performance

by WordPress

Use wp-performance to investigate and improve WordPress performance from the backend, without a browser UI. It supports measurement-first diagnosis for slow frontend requests, admin pages, REST routes, and WP-Cron, with guidance on WP-CLI profile/doctor, Query Monitor via REST headers, Server-Timing, database queries, autoloaded options, object caching, cron, and remote HTTP calls.

Performance Optimization

Favorites 0GitHub 1.4k

wp-wpcli-and-ops

by WordPress

The wp-wpcli-and-ops skill helps with WordPress operations in WP-CLI: safe search-replace, db export/import, plugin and theme actions, cron, cache flushing, multisite targeting, and repeatable automation for backend development.

Backend Development

Favorites 0GitHub 1.4k

agents-sdk

by cloudflare

agents-sdk helps you build Cloudflare Workers agents with stateful conversations, durable execution, WebSocket or streaming chat, MCP integration, scheduled tasks, and browser automation. This agents-sdk skill focuses on install decisions, configuration, and practical usage for existing or new Workers apps, with guidance for multi-agent systems only when they fit Cloudflare runtime constraints.

Multi-Agent Systems

Favorites 0GitHub 1.3k

reddit-ads

by alinaqi

reddit-ads skill for Reddit Ads API workflows: campaign creation, targeting, conversion tracking, and ad optimization. Install the reddit-ads guide to manage account hierarchy, budgets, audiences, and API-based optimization with less guesswork.

Ad Optimization

Favorites 0GitHub 611

existing-repo

by alinaqi

existing-repo helps agents analyze an existing codebase, detect stack and conventions, and add guardrails without breaking local patterns. Use this existing-repo skill for Git Workflows, first-time repo work, maintenance, and setup changes where understand-before-modifying matters most.

Git Workflows

Favorites 0GitHub 607

composio

by ComposioHQ

Use composio to connect AI workflows to external apps through the CLI or SDK. This composio skill is built for workflow automation, app actions, per-user connections, toolkit discovery, and a practical guide to install and usage before you start building.

Workflow Automation

Favorites 0GitHub 48