slo-implementation

by wshobson

Use the slo-implementation skill to define SLIs, SLOs, error budgets, and burn-rate alerts for Reliability work. It helps teams turn service goals into measurable targets with PromQL-style examples and practical guidance from SKILL.md.

Stars32.6k

Favorites0

Comments0

AddedMar 30, 2026

CategoryReliability

Install Command

npx skills add wshobson/agents --skill slo-implementation

Curation Score

This skill scores 68/100, which means it is listable for directory users but should be approached as a document-driven framework rather than a turnkey implementation. The repository gives enough real content to help an agent recognize when to use it and provides practical SLI/SLO examples, but adoption still requires some interpretation because there are no support files, install steps, or visible operational rules beyond the markdown.

68/100

Strengths

Good triggerability: the description and "When to Use" section clearly scope reliability-target, SLI/SLO, error budget, and alerting tasks.
Substantive workflow content: the skill includes concrete SLI/SLO concepts and example PromQL for availability and latency, which is more actionable than a generic prompt.
Install-decision clarity is decent: users can tell this is a framework for defining SLIs, SLOs, and error budgets rather than a placeholder or demo-only skill.

Cautions

Operational execution is somewhat guesswork-heavy because the repository shows no scripts, references, resources, or install command to turn the framework into a runnable workflow.
The excerpt references an external file (`references/slo-definitions.md`), but structural signals show no reference files, which weakens trust and completeness.

Service Level Objectives Service Level Indicators Error Budget Sre Prometheus Metrics

Overview

Overview of slo-implementation skill

The slo-implementation skill helps you turn vague reliability goals into concrete Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and alerting logic. It is best for SREs, platform teams, backend engineers, and reliability-minded product owners who need a repeatable way to define what “good enough” service health actually means.

What the slo-implementation skill is for

Use the slo-implementation skill when you need to:

define measurable reliability targets for a service
choose the right SLI type, such as availability or latency
set an SLO target that matches business impact
derive an error budget from that target
create alerting based on burn rate or SLO consumption

This is more useful than a generic “write me an SLO” prompt because it gives a structured hierarchy from SLI to SLO to SLA and grounds the work in implementation details like measurement windows and PromQL-style queries.

Who should install it

The slo-implementation skill is a strong fit if you already have telemetry or can get it soon. It is especially useful for teams using Prometheus-style metrics and wanting SRE-aligned reliability practices without inventing the framework from scratch.

It is less useful if:

you have no meaningful service metrics yet
your main problem is incident response rather than reliability target design
you only need a legal or customer-facing SLA document

What users care about most before adopting

Most users evaluating slo-implementation install want to know:

whether it gives actionable SLO design help, not just theory
whether it supports implementation details like queries and alerts
whether it can help avoid bad SLOs, such as vanity uptime targets
whether it is concise enough to use in a real workflow

On those points, the skill is practical: it covers common SLI types, target-setting examples, and the relationship between objectives and error budgets.

Key strengths and tradeoffs

The main differentiator of slo-implementation is that it stays focused on reliability measurement and policy design instead of drifting into generic observability advice. That focus makes it easier to invoke well.

The tradeoff is that the skill is only as good as the service context you provide. If you do not specify user journeys, traffic patterns, dependencies, thresholds, and metric names, the output will sound plausible but be hard to operationalize.

How to Use slo-implementation skill

Install context for slo-implementation skill

Install the skill in the environment where your agent can access custom skills. A typical pattern is:

add the source repository to your skills setup
enable the slo-implementation skill
invoke it when your task is defining or revising SLIs, SLOs, error budgets, or SLO-based alerts

If your tooling supports direct skill installation, use your normal skill loader for the repository at:
https://github.com/wshobson/agents/tree/main/plugins/observability-monitoring/skills/slo-implementation

Because the repository evidence shows only SKILL.md for this skill, plan to read that file first rather than expecting helper scripts or extra references.

Read this file first

Start with:

plugins/observability-monitoring/skills/slo-implementation/SKILL.md

That file contains the real substance of the slo-implementation guide: purpose, when to use it, SLI/SLO/SLA hierarchy, common SLI types, target examples, and implementation patterns.

What input the skill needs to produce useful output

For high-quality slo-implementation usage, give the agent:

the service name and what users do with it
the most important user-facing journeys
current metrics and labels available
existing dashboards, alerts, or PromQL if any
traffic volume and seasonality
business criticality and outage cost
latency expectations by endpoint or operation
known failure modes
whether you need internal SLOs, external SLA alignment, or both

Without this, the skill can still draft an SLO, but it will tend to default to generic availability targets and simplistic request-based SLIs.

Turn a rough goal into a strong prompt

Weak prompt:

“Create SLOs for my API.”

Better prompt:

“Use the slo-implementation skill to define SLIs and SLOs for a multi-tenant payments API. Our critical user journeys are charge creation and webhook delivery. We use Prometheus. Available metrics include http_requests_total, http_request_duration_seconds_bucket, and queue retry counters. Propose 2 to 3 SLIs, recommend SLO targets, calculate monthly error budgets, and suggest burn-rate alerts. Exclude admin endpoints and health checks.”

Why this works:

it defines the service boundary
it points to real metrics
it limits scope to meaningful user journeys
it asks for outputs the skill is designed to produce

Best workflow for first-time use

A practical slo-implementation usage flow is:

identify one service, not your whole platform
name 1 to 3 critical user journeys
map each journey to existing signals
ask the skill for candidate SLIs
review whether those SLIs reflect user experience, not just system internals
set an initial SLO target and error budget
draft alerting logic
test whether the metrics actually support the design
revise thresholds and exclusions before rollout

This avoids the common failure mode of trying to define enterprise-wide reliability policy in one pass.

What the skill is likely to output well

The slo-implementation skill is strongest at:

proposing common SLI patterns like availability and latency
explaining SLI/SLO/SLA relationships
translating reliability goals into measurable ratios
suggesting target ranges and error budget framing
outlining alerting based on SLO consumption

It is particularly helpful when you need a first operational draft quickly and want it anchored in standard SRE language.

Where teams usually get blocked

Adoption usually stalls for one of these reasons:

the team cannot agree on the user-facing service boundary
only infrastructure metrics exist, not user-journey metrics
latency histograms are missing, so threshold-based SLIs are weak
metrics include bot traffic, internal jobs, or health checks that distort the numerator and denominator
targets are chosen politically rather than from risk and cost

The skill can help structure the conversation, but it cannot invent trustworthy measurement where telemetry is absent.

Practical prompt patterns that improve output quality

Ask the skill to produce outputs in a decision-ready format, for example:

“List candidate SLIs with rationale and tradeoffs.”
“Recommend one primary SLO and one secondary guardrail SLO.”
“Show PromQL-style formulas for each SLI.”
“Identify exclusions that should not count against the SLO.”
“Suggest alerting windows for fast and slow burn.”

These prompt patterns create implementation-grade output instead of abstract reliability advice.

How to use slo-implementation for Reliability work

For slo-implementation for Reliability, use it at these moments:

before launching a new service
during observability improvement work
after recurring incidents show your current alerts are noisy
when leadership asks for reliability targets tied to customer impact
when you need to connect engineering velocity to error budget policy

It is most valuable when the team is moving from “monitor everything” to “measure what matters to users.”

slo-implementation skill FAQ

Is slo-implementation better than a normal prompt?

Yes, if your task is specifically SLI/SLO design. A normal prompt may generate acceptable definitions, but slo-implementation is more likely to preserve the hierarchy, include measurable formulas, and connect targets to error budgets and alerting.

Is the slo-implementation skill beginner friendly?

Moderately. Beginners can use it, but the best results come from knowing basic SRE concepts and having some telemetry context. If you are new to SLOs, use the skill for one service first and review each proposed metric before adopting it.

Does it require Prometheus?

No, but the skill content clearly aligns well with Prometheus and PromQL-style thinking. If you use Datadog, CloudWatch, Grafana, or another stack, you can still use the logic and translate the metric expressions to your platform.

When should I not use slo-implementation?

Do not use slo-implementation as your main tool if:

you need legal SLA language
you are missing any usable service telemetry
your real problem is ownership, not measurement
your service is too immature to define stable user journeys yet

In those cases, instrument first or solve the operating model problem before formalizing SLOs.

Can it help with alerting too?

Yes. The skill is not only about target definition; it also supports the operational side of error budgets and SLO-based alerts. That makes it more useful than a template that stops at percentage targets.

How to Improve slo-implementation skill

Give business context, not just technical metrics

To improve slo-implementation results, tell the agent what reliability means commercially:

Which workflow loses revenue when degraded?
Which users are premium or latency-sensitive?
What is the tolerated impact duration?

This helps the skill choose realistic targets instead of defaulting to aspirational numbers like 99.99%.

Define service boundaries explicitly

A stronger slo-implementation guide input names what counts and what does not. For example:

include public API write requests
exclude /healthz, admin routes, and internal batch jobs
measure only successful user-visible completion, not just request acceptance

Boundary clarity is one of the biggest determinants of whether an SLO will be trusted.

Provide metric names and sample queries

The skill becomes much more actionable when you share actual telemetry. Good input looks like:

metric names
label dimensions
histogram buckets
current alert queries
dashboard links or copied snippets

That allows the output to move from conceptual SLOs to near-implementable definitions.

Avoid vanity SLIs

A common failure mode is choosing metrics that are easy to collect but weakly tied to user experience. Examples:

pod restarts
CPU saturation alone
raw uptime of a dependency with no service impact mapping

Ask the skill to justify why each SLI reflects user-perceived reliability. If it cannot, replace that SLI.

Iterate after the first draft

Your first output from slo-implementation should be treated as a draft. Improve it by asking:

“Which SLI is most representative of user harm?”
“What would make this SLO impossible to measure accurately?”
“Which exclusions are risky or easy to abuse?”
“How would this change for low-traffic services?”
“What alerting would reduce noise while protecting the error budget?”

This second pass usually produces a much better operational design than accepting the first target set.

Pressure-test with historical incidents

One of the best ways to improve the slo-implementation skill output is to compare proposed SLIs and alerts against real incidents. Ask:

would this SLO have detected the issue?
would it have over-counted harmless failures?
would the burn-rate policy have paged too early or too late?

That validation step turns a neat document into something teams can actually run.

Use one service at a time

If results feel generic, the scope is probably too broad. The skill works best when focused on one service or one user journey. Split large systems into separate passes, then standardize patterns later.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

error-handling-patterns

by wshobson

error-handling-patterns helps teams choose exceptions vs Result types, classify failures, propagate context, and design graceful degradation for more reliable APIs and services.

Reliability

Favorites 1GitHub 32.6k

python-resilience

by wshobson

python-resilience is a guidance skill for safer Python failure handling with retries, exponential backoff, jitter, timeouts, and bounded retry windows. Use it to install practical resilience patterns for external calls and apply tenacity-style wrappers with clearer retry rules.

Reliability

Favorites 0GitHub 32.6k

on-call-handoff-patterns

by wshobson

Learn the on-call-handoff-patterns skill for reliable shift transitions. Use it to structure incident handoffs, capture active issues, recent changes, escalation state, and next actions for Reliability teams.

Reliability

Favorites 0GitHub 32.5k

incident-runbook-templates

by wshobson

incident-runbook-templates helps teams create structured incident response runbooks with clear triage, mitigation, escalation, communication, and recovery steps for outages and operational Playbooks.

Playbooks

Favorites 0GitHub 32.5k

upgrade-stripe

by stripe

upgrade-stripe guide for upgrading Stripe API versions, server-side SDKs, Stripe.js, and mobile SDKs in real codebases, with practical steps for Backend Development.

Backend Development

Favorites 0GitHub 1.5k

istio-traffic-management

by wshobson

istio-traffic-management helps teams draft Istio traffic policies like VirtualService, DestinationRule, Gateway, and ServiceEntry for canary, retries, circuit breaking, and mirroring. Use it to translate deployment intent into clear routing and resilience manifests with practical prompts and review checks.

Deployment

Favorites 0GitHub 32.6k

linkerd-patterns

by wshobson

linkerd-patterns helps teams apply Linkerd patterns for Kubernetes workloads, including mTLS, sidecar injection, traffic splits, retries, timeouts, service profiles, and multi-cluster planning for Deployment-based rollouts.

Deployment

Favorites 0GitHub 32.6k

frontend-design

by anthropics

frontend-design helps you turn vague UI ideas into distinctive, production-grade interfaces with real frontend code, strong aesthetic direction, and less generic AI styling.

UI Design

Favorites 1GitHub 105.2k

create-colleague

by titanwings

create-colleague turns coworker docs, chats, emails, screenshots, Feishu, and DingTalk data into an editable AI skill with separate work and persona outputs, plus update flows for ongoing refinement.

Skill Authoring

Favorites 1GitHub 747

hyperframes

by heygen-com

hyperframes is a workflow skill for building HTML-based video compositions in HyperFrames. Use it for title cards, overlays, captions, voiceovers, audio-reactive motion, and scene transitions when you need structured, code-first hyperframes for Video Editing. It favors layout, timing, and animation decisions over generic prompt-only video requests.

Video Editing

Favorites 0GitHub 2.7k

kreuzberg

by kreuzberg-dev

The kreuzberg skill helps you install and use Kreuzberg for document extraction across 91+ formats, including PDFs, Office files, images, HTML, email, and archives. It covers Python, Node.js/TypeScript, Rust, and CLI workflows for OCR, tables, metadata, batch processing, and practical parsing guidance.

PDF Processing

Favorites 0GitHub 0

skill-creator

by anthropics

skill-creator is a Skill Authoring meta-skill for drafting new skills, revising existing SKILL.md files, running evals, comparing variants, and improving trigger descriptions with repository scripts and review tools.

Skill Authoring

Favorites 2GitHub 105.1k

azure-identity-py

by microsoft

azure-identity-py helps set up Azure authentication in Python with Microsoft Entra ID. Use it to choose DefaultAzureCredential, managed identity, or service principal auth, configure environment variables, and troubleshoot access control and credential chain issues. Install guidance, usage patterns, and practical setup notes are based on the repo skill file.

Access Control

Favorites 0GitHub 2.2k

claude-api

by anthropics

claude-api is a practical skill for installing and using the Claude API and Anthropic SDKs. It helps developers choose the right SDK or raw HTTP path, detect language-specific docs, and implement streaming, tool use, files, batches, and error handling with less guesswork.

API Development

Favorites 0GitHub 105k

wrangler

by cloudflare

The wrangler skill helps you find correct CLI commands, config shapes, and deployment steps for Cloudflare Workers. Use it for wrangler usage, wrangler install checks, and a practical wrangler guide when building or shipping Workers for Backend Development.

Backend Development

Favorites 0GitHub 1.3k

clickhouse-best-practices

by ClickHouse

clickhouse-best-practices is a ClickHouse best practices skill for Database Engineering. It guides schema design, query tuning, insert strategy, and agent connectivity with rule-based recommendations, making clickhouse-best-practices usage easier to trigger, review, and cite in ClickHouse workflows.

Database Engineering

Favorites 0GitHub 412