W

slo-implementation

by wshobson

Use the slo-implementation skill to define SLIs, SLOs, error budgets, and burn-rate alerts for Reliability work. It helps teams turn service goals into measurable targets with PromQL-style examples and practical guidance from SKILL.md.

Stars32.6k
Favorites0
Comments0
AddedMar 30, 2026
CategoryReliability
Install Command
npx skills add wshobson/agents --skill slo-implementation
Curation Score

This skill scores 68/100, which means it is listable for directory users but should be approached as a document-driven framework rather than a turnkey implementation. The repository gives enough real content to help an agent recognize when to use it and provides practical SLI/SLO examples, but adoption still requires some interpretation because there are no support files, install steps, or visible operational rules beyond the markdown.

68/100
Strengths
  • Good triggerability: the description and "When to Use" section clearly scope reliability-target, SLI/SLO, error budget, and alerting tasks.
  • Substantive workflow content: the skill includes concrete SLI/SLO concepts and example PromQL for availability and latency, which is more actionable than a generic prompt.
  • Install-decision clarity is decent: users can tell this is a framework for defining SLIs, SLOs, and error budgets rather than a placeholder or demo-only skill.
Cautions
  • Operational execution is somewhat guesswork-heavy because the repository shows no scripts, references, resources, or install command to turn the framework into a runnable workflow.
  • The excerpt references an external file (`references/slo-definitions.md`), but structural signals show no reference files, which weakens trust and completeness.
Overview

Overview of slo-implementation skill

The slo-implementation skill helps you turn vague reliability goals into concrete Service Level Indicators (SLIs), Service Level Objectives (SLOs), error budgets, and alerting logic. It is best for SREs, platform teams, backend engineers, and reliability-minded product owners who need a repeatable way to define what “good enough” service health actually means.

What the slo-implementation skill is for

Use the slo-implementation skill when you need to:

  • define measurable reliability targets for a service
  • choose the right SLI type, such as availability or latency
  • set an SLO target that matches business impact
  • derive an error budget from that target
  • create alerting based on burn rate or SLO consumption

This is more useful than a generic “write me an SLO” prompt because it gives a structured hierarchy from SLI to SLO to SLA and grounds the work in implementation details like measurement windows and PromQL-style queries.

Who should install it

The slo-implementation skill is a strong fit if you already have telemetry or can get it soon. It is especially useful for teams using Prometheus-style metrics and wanting SRE-aligned reliability practices without inventing the framework from scratch.

It is less useful if:

  • you have no meaningful service metrics yet
  • your main problem is incident response rather than reliability target design
  • you only need a legal or customer-facing SLA document

What users care about most before adopting

Most users evaluating slo-implementation install want to know:

  1. whether it gives actionable SLO design help, not just theory
  2. whether it supports implementation details like queries and alerts
  3. whether it can help avoid bad SLOs, such as vanity uptime targets
  4. whether it is concise enough to use in a real workflow

On those points, the skill is practical: it covers common SLI types, target-setting examples, and the relationship between objectives and error budgets.

Key strengths and tradeoffs

The main differentiator of slo-implementation is that it stays focused on reliability measurement and policy design instead of drifting into generic observability advice. That focus makes it easier to invoke well.

The tradeoff is that the skill is only as good as the service context you provide. If you do not specify user journeys, traffic patterns, dependencies, thresholds, and metric names, the output will sound plausible but be hard to operationalize.

How to Use slo-implementation skill

Install context for slo-implementation skill

Install the skill in the environment where your agent can access custom skills. A typical pattern is:

  1. add the source repository to your skills setup
  2. enable the slo-implementation skill
  3. invoke it when your task is defining or revising SLIs, SLOs, error budgets, or SLO-based alerts

If your tooling supports direct skill installation, use your normal skill loader for the repository at:
https://github.com/wshobson/agents/tree/main/plugins/observability-monitoring/skills/slo-implementation

Because the repository evidence shows only SKILL.md for this skill, plan to read that file first rather than expecting helper scripts or extra references.

Read this file first

Start with:

  • plugins/observability-monitoring/skills/slo-implementation/SKILL.md

That file contains the real substance of the slo-implementation guide: purpose, when to use it, SLI/SLO/SLA hierarchy, common SLI types, target examples, and implementation patterns.

What input the skill needs to produce useful output

For high-quality slo-implementation usage, give the agent:

  • the service name and what users do with it
  • the most important user-facing journeys
  • current metrics and labels available
  • existing dashboards, alerts, or PromQL if any
  • traffic volume and seasonality
  • business criticality and outage cost
  • latency expectations by endpoint or operation
  • known failure modes
  • whether you need internal SLOs, external SLA alignment, or both

Without this, the skill can still draft an SLO, but it will tend to default to generic availability targets and simplistic request-based SLIs.

Turn a rough goal into a strong prompt

Weak prompt:

  • “Create SLOs for my API.”

Better prompt:

  • “Use the slo-implementation skill to define SLIs and SLOs for a multi-tenant payments API. Our critical user journeys are charge creation and webhook delivery. We use Prometheus. Available metrics include http_requests_total, http_request_duration_seconds_bucket, and queue retry counters. Propose 2 to 3 SLIs, recommend SLO targets, calculate monthly error budgets, and suggest burn-rate alerts. Exclude admin endpoints and health checks.”

Why this works:

  • it defines the service boundary
  • it points to real metrics
  • it limits scope to meaningful user journeys
  • it asks for outputs the skill is designed to produce

Best workflow for first-time use

A practical slo-implementation usage flow is:

  1. identify one service, not your whole platform
  2. name 1 to 3 critical user journeys
  3. map each journey to existing signals
  4. ask the skill for candidate SLIs
  5. review whether those SLIs reflect user experience, not just system internals
  6. set an initial SLO target and error budget
  7. draft alerting logic
  8. test whether the metrics actually support the design
  9. revise thresholds and exclusions before rollout

This avoids the common failure mode of trying to define enterprise-wide reliability policy in one pass.

What the skill is likely to output well

The slo-implementation skill is strongest at:

  • proposing common SLI patterns like availability and latency
  • explaining SLI/SLO/SLA relationships
  • translating reliability goals into measurable ratios
  • suggesting target ranges and error budget framing
  • outlining alerting based on SLO consumption

It is particularly helpful when you need a first operational draft quickly and want it anchored in standard SRE language.

Where teams usually get blocked

Adoption usually stalls for one of these reasons:

  • the team cannot agree on the user-facing service boundary
  • only infrastructure metrics exist, not user-journey metrics
  • latency histograms are missing, so threshold-based SLIs are weak
  • metrics include bot traffic, internal jobs, or health checks that distort the numerator and denominator
  • targets are chosen politically rather than from risk and cost

The skill can help structure the conversation, but it cannot invent trustworthy measurement where telemetry is absent.

Practical prompt patterns that improve output quality

Ask the skill to produce outputs in a decision-ready format, for example:

  • “List candidate SLIs with rationale and tradeoffs.”
  • “Recommend one primary SLO and one secondary guardrail SLO.”
  • “Show PromQL-style formulas for each SLI.”
  • “Identify exclusions that should not count against the SLO.”
  • “Suggest alerting windows for fast and slow burn.”

These prompt patterns create implementation-grade output instead of abstract reliability advice.

How to use slo-implementation for Reliability work

For slo-implementation for Reliability, use it at these moments:

  • before launching a new service
  • during observability improvement work
  • after recurring incidents show your current alerts are noisy
  • when leadership asks for reliability targets tied to customer impact
  • when you need to connect engineering velocity to error budget policy

It is most valuable when the team is moving from “monitor everything” to “measure what matters to users.”

slo-implementation skill FAQ

Is slo-implementation better than a normal prompt?

Yes, if your task is specifically SLI/SLO design. A normal prompt may generate acceptable definitions, but slo-implementation is more likely to preserve the hierarchy, include measurable formulas, and connect targets to error budgets and alerting.

Is the slo-implementation skill beginner friendly?

Moderately. Beginners can use it, but the best results come from knowing basic SRE concepts and having some telemetry context. If you are new to SLOs, use the skill for one service first and review each proposed metric before adopting it.

Does it require Prometheus?

No, but the skill content clearly aligns well with Prometheus and PromQL-style thinking. If you use Datadog, CloudWatch, Grafana, or another stack, you can still use the logic and translate the metric expressions to your platform.

When should I not use slo-implementation?

Do not use slo-implementation as your main tool if:

  • you need legal SLA language
  • you are missing any usable service telemetry
  • your real problem is ownership, not measurement
  • your service is too immature to define stable user journeys yet

In those cases, instrument first or solve the operating model problem before formalizing SLOs.

Can it help with alerting too?

Yes. The skill is not only about target definition; it also supports the operational side of error budgets and SLO-based alerts. That makes it more useful than a template that stops at percentage targets.

How to Improve slo-implementation skill

Give business context, not just technical metrics

To improve slo-implementation results, tell the agent what reliability means commercially:

  • Which workflow loses revenue when degraded?
  • Which users are premium or latency-sensitive?
  • What is the tolerated impact duration?

This helps the skill choose realistic targets instead of defaulting to aspirational numbers like 99.99%.

Define service boundaries explicitly

A stronger slo-implementation guide input names what counts and what does not. For example:

  • include public API write requests
  • exclude /healthz, admin routes, and internal batch jobs
  • measure only successful user-visible completion, not just request acceptance

Boundary clarity is one of the biggest determinants of whether an SLO will be trusted.

Provide metric names and sample queries

The skill becomes much more actionable when you share actual telemetry. Good input looks like:

  • metric names
  • label dimensions
  • histogram buckets
  • current alert queries
  • dashboard links or copied snippets

That allows the output to move from conceptual SLOs to near-implementable definitions.

Avoid vanity SLIs

A common failure mode is choosing metrics that are easy to collect but weakly tied to user experience. Examples:

  • pod restarts
  • CPU saturation alone
  • raw uptime of a dependency with no service impact mapping

Ask the skill to justify why each SLI reflects user-perceived reliability. If it cannot, replace that SLI.

Iterate after the first draft

Your first output from slo-implementation should be treated as a draft. Improve it by asking:

  • “Which SLI is most representative of user harm?”
  • “What would make this SLO impossible to measure accurately?”
  • “Which exclusions are risky or easy to abuse?”
  • “How would this change for low-traffic services?”
  • “What alerting would reduce noise while protecting the error budget?”

This second pass usually produces a much better operational design than accepting the first target set.

Pressure-test with historical incidents

One of the best ways to improve the slo-implementation skill output is to compare proposed SLIs and alerts against real incidents. Ask:

  • would this SLO have detected the issue?
  • would it have over-counted harmless failures?
  • would the burn-rate policy have paged too early or too late?

That validation step turns a neat document into something teams can actually run.

Use one service at a time

If results feel generic, the scope is probably too broad. The skill works best when focused on one service or one user journey. Split large systems into separate passes, then standardize patterns later.

Ratings & Reviews

No ratings yet
Share your review
Sign in to leave a rating and comment for this skill.
G
0/10000
Latest reviews
Saving...