Sre

Sre taxonomy generated by the site skill importer.

7 skills

python-observability

by wshobson

python-observability helps you instrument Python services with structured logging, metrics, traces, correlation IDs, and bounded-cardinality patterns for production debugging and safer observability rollouts.

Observability

Favorites 0GitHub 32.6k

slo-implementation

by wshobson

Use the slo-implementation skill to define SLIs, SLOs, error budgets, and burn-rate alerts for Reliability work. It helps teams turn service goals into measurable targets with PromQL-style examples and practical guidance from SKILL.md.

Reliability

Favorites 0GitHub 32.6k

distributed-tracing

by wshobson

Use the distributed-tracing skill to design and explain request tracing across microservices with Jaeger and Tempo. Covers install basics, trace and span concepts, Kubernetes setup patterns, context propagation, and practical usage for observability and latency debugging.

Observability

Favorites 0GitHub 32.6k

postmortem-writing

by wshobson

postmortem-writing helps teams create blameless incident postmortems with timelines, root cause analysis, contributing factors, impact, and actionable follow-up items for report writing after outages or near-misses.

Report Writing

Favorites 0GitHub 32.5k

on-call-handoff-patterns

by wshobson

Learn the on-call-handoff-patterns skill for reliable shift transitions. Use it to structure incident handoffs, capture active issues, recent changes, escalation state, and next actions for Reliability teams.

Reliability

Favorites 0GitHub 32.5k

incident-runbook-templates

by wshobson

incident-runbook-templates helps teams create structured incident response runbooks with clear triage, mitigation, escalation, communication, and recovery steps for outages and operational Playbooks.

Playbooks

Favorites 0GitHub 32.5k

conducting-post-incident-lessons-learned

by mukul975

The conducting-post-incident-lessons-learned skill helps Incident Response teams run structured after-action reviews, build factual timelines, identify root causes, capture what worked and failed, and turn each incident into measurable improvements with owners, deadlines, and playbook updates.

Incident Response

Favorites 0GitHub 0