python-resilience

by wshobson

python-resilience is a guidance skill for safer Python failure handling with retries, exponential backoff, jitter, timeouts, and bounded retry windows. Use it to install practical resilience patterns for external calls and apply tenacity-style wrappers with clearer retry rules.

Stars32.6k

Favorites0

Comments0

AddedMar 30, 2026

CategoryReliability

Install Command

npx skills add wshobson/agents --skill python-resilience

Curation Score

This skill scores 78/100, which makes it a solid listing candidate for directory users who need Python retry, timeout, and fault-tolerance patterns. The repository evidence shows real operational content with clear triggers, core concepts, and code examples, so an agent can likely apply it with less guesswork than a generic prompt; however, adoption confidence is moderated by the lack of companion files, install guidance, or executable reference assets.

78/100

Strengths

Clear triggerability: frontmatter and 'When to Use This Skill' explicitly cover retries, timeouts, transient failures, rate limiting, and circuit breakers.
Good practical leverage: the skill includes a Quick Start with concrete Python code using tenacity and resilience concepts like exponential backoff, jitter, and bounded retries.
Substantive standalone guidance: SKILL.md is long, structured, and non-placeholder, with multiple headings covering concepts and workflow-oriented advice.

Cautions

No support files, scripts, or references are included, so users must translate the guidance into their own project context without executable examples.
SKILL.md has no install command or repository/file references, which limits confidence about dependencies, setup, and how the patterns should be integrated in real codebases.

Python Backend Microservices Distributed Systems API Workflow

Overview

Overview of python-resilience skill

What python-resilience does

The python-resilience skill helps you design Python code that fails more safely when dependencies are unreliable. Its focus is practical resilience patterns: retries, exponential backoff, jitter, timeouts, bounded retry windows, and fault-tolerant wrappers around external calls.

Who should install this skill

This python-resilience skill is best for developers, platform teams, and agent users working on services that talk to APIs, databases, queues, or other networked systems. It is especially useful when you need code-generation help that goes beyond “add a retry” and instead chooses sane failure-handling boundaries.

The real job to be done

Most users do not need a theory page on reliability. They need working Python patterns for questions like:

“Should this error be retried or fail fast?”
“What backoff strategy is safe under load?”
“Where should timeouts live?”
“How do I avoid infinite retries and thundering herds?”
“What decorator or helper should wrap this external call?”

The python-resilience skill is valuable because it frames those decisions explicitly instead of treating retry logic as a one-line patch.

What makes it different from a generic prompt

A generic coding prompt may add retries everywhere or ignore the difference between transient and permanent failures. python-resilience for Reliability is more opinionated: retry only retryable failures, add jitter, cap attempts and total time, and treat external boundaries as the main place to add resilience logic.

What the source actually covers

The upstream skill is a single SKILL.md file, but it is substantive. It centers on:

transient vs permanent failures
exponential backoff
jitter
bounded retries
practical Python examples using tenacity

That makes it lightweight to inspect and fast to adopt, but it also means you should expect guidance rather than a packaged library or test harness.

When this skill is a strong fit

Use python-resilience when you are:

calling third-party APIs
wrapping flaky network or service interactions
building microservices or workers
adding reliability controls to shared client utilities
handling rate limiting, temporary outages, or intermittent timeouts

If your code is mostly pure in-process logic, this skill is probably not the highest-leverage install.

How to Use python-resilience skill

Install context for python-resilience

Install the skill from the wshobson/agents repository:

npx skills add https://github.com/wshobson/agents --skill python-resilience

After installation, open the skill file first:

plugins/python-development/skills/python-resilience/SKILL.md

This repository area appears to contain only the skill document, so adoption is straightforward: read the skill, then apply its patterns in your own codebase.

Read this file first

Start with SKILL.md from top to bottom. The highest-value sections to review first are:

When to Use This Skill
Core Concepts
Quick Start

That reading order gives you fit, design rules, and implementation shape before you ask the model to modify your code.

What input the skill needs from you

The python-resilience usage quality depends heavily on the context you provide. Before invoking the skill, gather:

the function or service boundary being protected
the dependency type: HTTP API, DB, queue, cache, filesystem
the exact exceptions or failure symptoms observed
whether failures are transient or permanent
timeout expectations
idempotency constraints
max acceptable latency
retry budget: attempts or total duration
whether many clients may retry at once

Without these inputs, the model will often produce overly broad retry logic.

Turn a rough goal into a strong prompt

Weak prompt:

Add resilience to this Python API client.

Better prompt:

Use the python-resilience skill to refactor this Python client method.

Context:
- Dependency: third-party HTTP API
- Library: httpx
- Traffic: moderate, bursty
- Common failures: read timeout, connect timeout, occasional 429 and 503
- Permanent failures: 400, 401, 403 should not be retried
- Idempotency: safe to retry GET requests only
- SLO: fail within 8 seconds total
- Requirement: use bounded retries, exponential backoff with jitter, and clear logging

Task:
- Propose a retry policy
- Implement the wrapper/decorator
- Explain which exceptions and status codes are retryable
- Show where timeout configuration should live

This works better because it gives the skill the decision boundaries it is designed to reason about.

Ask for policy before code

A strong python-resilience guide workflow is:

ask for failure classification
ask for a retry/timeout policy
review tradeoffs
then generate implementation code

This avoids jumping straight into decorators before deciding what should and should not be retried.

Use the skill at external boundaries

The skill is most effective when applied to code that crosses process or network boundaries, such as:

httpx or requests calls
message publishing or consumption
database queries with known transient failure modes
cloud SDK calls
service client methods

Do not start by wrapping large business workflows end to end. Put resilience controls around the unstable dependency first.

What good python-resilience output should include

When the skill is working well, the output should usually include:

explicit transient vs permanent failure rules
finite retry limits
exponential backoff
jitter
timeout placement
examples using Python tooling such as tenacity
notes on idempotency and side effects

If the result only says “retry 3 times,” ask for a more explicit retry policy.

Practical implementation pattern to request

The source skill includes a tenacity-based quick start. In practice, that means you can ask for patterns like:

a decorator around a service client method
a wrapper helper for all outbound HTTP calls
separate read vs write retry policies
retries filtered by exception type or status code

For mutation operations, ask the model to justify retry safety. Reliability patterns that ignore idempotency can create duplicate side effects.

Common mistakes during python-resilience usage

Watch for these issues in generated code:

retrying authentication or validation failures
no timeout, only retries
retry loops with no total budget
backoff without jitter
wrapping too much code, hiding root cause
retrying non-idempotent writes by default

These are the practical blockers that matter more than code style.

A useful workflow in your repo

For best results, provide the skill with:

the current client function
the exception classes you see in logs
sample status codes
your latency or retry budget
one or two representative failure traces

Then ask for:

policy summary
code changes
tests you should add
monitoring fields to log

That sequence usually produces better adoption-ready output than asking for code alone.

python-resilience skill FAQ

Is python-resilience only for web APIs?

No. The python-resilience skill is broadly about unreliable dependencies. HTTP calls are the easiest example, but the same reasoning applies to queues, databases, caches, and cloud services where transient failures are common.

Is this a library or a guidance skill?

It is a guidance skill, not a standalone Python package. It teaches patterns and shows implementation style, including tenacity-based examples, but you still apply those patterns inside your own codebase.

When should I not use python-resilience?

Do not use python-resilience as a default layer over every function. It is a poor fit for:

pure CPU-bound local logic
errors that are clearly permanent
workflows where retries would duplicate unsafe side effects
systems where latency budgets are too tight for retry windows

In those cases, fail fast or redesign the integration instead.

Is python-resilience suitable for beginners?

Yes, if you already know basic Python and exception handling. The skill’s core ideas are accessible, but the user still needs to supply business context like retry safety, timeout budgets, and which failures are acceptable to retry.

How is this better than asking an LLM for retries?

The advantage of python-resilience for Reliability is not just code generation. It helps the model reason about failure categories, bounded retries, and backoff behavior. Generic prompts often miss those boundaries and produce retry logic that is unsafe or noisy under load.

Does python-resilience choose the exact retry policy for me?

Not automatically. It provides a strong pattern vocabulary, but the best policy depends on your dependency behavior, latency requirements, and idempotency rules. You should expect to tune attempts, wait ranges, and retry filters to your environment.

How to Improve python-resilience skill

Give the skill better failure classification

The fastest way to improve python-resilience results is to specify which failures are transient and which are permanent. For example:

transient: ConnectTimeout, ReadTimeout, 503, some 429
permanent: 400, 401, 403, schema errors, bad credentials

This single distinction usually determines whether the generated policy is safe.

Provide latency and retry budgets

If you do not provide a budget, the model may choose arbitrary retry counts. State limits like:

max 3 attempts
total retry window under 8 seconds
single request timeout 2 seconds
background job can tolerate 30 seconds total

These constraints produce more realistic code.

Tell it whether operations are idempotent

Many resilience mistakes come from missing side-effect context. Improve python-resilience usage by labeling operations as:

safe to retry
safe only with idempotency key
unsafe to retry automatically

That changes both the decorator design and the exception filters.

Ask for explicit non-retry rules

Do not only ask “what should be retried?” Also ask:

what should fail fast?
what should be surfaced to callers immediately?
what should be logged but not retried?

This makes the output much more production-usable.

Request observability with the implementation

A good python-resilience guide should not stop at decorators. Ask the model to add:

attempt count in logs
exception type
elapsed time
final failure reason
retry exhaustion message

Without this, your resilience layer may hide why calls are failing.

Iterate after the first draft

After the first output, refine with concrete feedback such as:

“Do not retry POST requests.”
“Cap total time, not just attempts.”
“Handle 429 differently from 500.”
“Use jitter to avoid synchronized retries.”
“Separate timeout config from retry config.”

This kind of iteration materially improves the implementation.

Test the failure paths the skill proposes

Ask the model to generate tests for:

transient exception retries
permanent exception fast-fail behavior
retry exhaustion
timeout enforcement
backoff policy boundaries

Resilience code that is not tested is easy to misconfigure and hard to trust.

Improve the skill output with real traces

If you have logs or sample stack traces, include them. Real failure evidence helps python-resilience recommend narrower exception filters and more believable timeout/backoff settings than abstract prompts do.

Keep the abstraction level modest

A common failure mode is asking the skill to design a full resilience framework when you only need a reliable client wrapper. Start smaller:

one function
one dependency
one retry policy

Then expand after the pattern proves useful.

Use python-resilience as a review lens

Even if you already wrote the code, python-resilience is useful as a reviewer prompt. Ask it to inspect existing retry logic for:

unbounded retries
missing jitter
bad timeout placement
retrying permanent failures
hidden side-effect risks

That review-oriented use case is often the highest-value way to apply the skill in mature codebases.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

error-handling-patterns

by wshobson

error-handling-patterns helps teams choose exceptions vs Result types, classify failures, propagate context, and design graceful degradation for more reliable APIs and services.

Reliability

Favorites 1GitHub 32.6k

on-call-handoff-patterns

by wshobson

Learn the on-call-handoff-patterns skill for reliable shift transitions. Use it to structure incident handoffs, capture active issues, recent changes, escalation state, and next actions for Reliability teams.

Reliability

Favorites 0GitHub 32.5k

incident-runbook-templates

by wshobson

incident-runbook-templates helps teams create structured incident response runbooks with clear triage, mitigation, escalation, communication, and recovery steps for outages and operational Playbooks.

Playbooks

Favorites 0GitHub 32.5k

upgrade-stripe

by stripe

upgrade-stripe guide for upgrading Stripe API versions, server-side SDKs, Stripe.js, and mobile SDKs in real codebases, with practical steps for Backend Development.

Backend Development

Favorites 0GitHub 1.5k

slo-implementation

by wshobson

Use the slo-implementation skill to define SLIs, SLOs, error budgets, and burn-rate alerts for Reliability work. It helps teams turn service goals into measurable targets with PromQL-style examples and practical guidance from SKILL.md.

Reliability

Favorites 0GitHub 32.6k

istio-traffic-management

by wshobson

istio-traffic-management helps teams draft Istio traffic policies like VirtualService, DestinationRule, Gateway, and ServiceEntry for canary, retries, circuit breaking, and mirroring. Use it to translate deployment intent into clear routing and resilience manifests with practical prompts and review checks.

Deployment

Favorites 0GitHub 32.6k

linkerd-patterns

by wshobson

linkerd-patterns helps teams apply Linkerd patterns for Kubernetes workloads, including mTLS, sidecar injection, traffic splits, retries, timeouts, service profiles, and multi-cluster planning for Deployment-based rollouts.

Deployment

Favorites 0GitHub 32.6k

frontend-design

by anthropics

frontend-design helps you turn vague UI ideas into distinctive, production-grade interfaces with real frontend code, strong aesthetic direction, and less generic AI styling.

UI Design

Favorites 1GitHub 105.2k

create-colleague

by titanwings

create-colleague turns coworker docs, chats, emails, screenshots, Feishu, and DingTalk data into an editable AI skill with separate work and persona outputs, plus update flows for ongoing refinement.

Skill Authoring

Favorites 1GitHub 747

hyperframes

by heygen-com

hyperframes is a workflow skill for building HTML-based video compositions in HyperFrames. Use it for title cards, overlays, captions, voiceovers, audio-reactive motion, and scene transitions when you need structured, code-first hyperframes for Video Editing. It favors layout, timing, and animation decisions over generic prompt-only video requests.

Video Editing

Favorites 0GitHub 2.7k

kreuzberg

by kreuzberg-dev

The kreuzberg skill helps you install and use Kreuzberg for document extraction across 91+ formats, including PDFs, Office files, images, HTML, email, and archives. It covers Python, Node.js/TypeScript, Rust, and CLI workflows for OCR, tables, metadata, batch processing, and practical parsing guidance.

PDF Processing

Favorites 0GitHub 0

skill-creator

by anthropics

skill-creator is a Skill Authoring meta-skill for drafting new skills, revising existing SKILL.md files, running evals, comparing variants, and improving trigger descriptions with repository scripts and review tools.

Skill Authoring

Favorites 2GitHub 105.1k

azure-identity-py

by microsoft

azure-identity-py helps set up Azure authentication in Python with Microsoft Entra ID. Use it to choose DefaultAzureCredential, managed identity, or service principal auth, configure environment variables, and troubleshoot access control and credential chain issues. Install guidance, usage patterns, and practical setup notes are based on the repo skill file.

Access Control

Favorites 0GitHub 2.2k

claude-api

by anthropics

claude-api is a practical skill for installing and using the Claude API and Anthropic SDKs. It helps developers choose the right SDK or raw HTTP path, detect language-specific docs, and implement streaming, tool use, files, batches, and error handling with less guesswork.

API Development

Favorites 0GitHub 105k

wrangler

by cloudflare

The wrangler skill helps you find correct CLI commands, config shapes, and deployment steps for Cloudflare Workers. Use it for wrangler usage, wrangler install checks, and a practical wrangler guide when building or shipping Workers for Backend Development.

Backend Development

Favorites 0GitHub 1.3k

clickhouse-best-practices

by ClickHouse

clickhouse-best-practices is a ClickHouse best practices skill for Database Engineering. It guides schema design, query tuning, insert strategy, and agent connectivity with rule-based recommendations, making clickhouse-best-practices usage easier to trigger, review, and cite in ClickHouse workflows.

Database Engineering

Favorites 0GitHub 412