I

elevenlabs-stt

by inferen-sh

High-accuracy ElevenLabs speech-to-text via inference.sh CLI using Scribe v1/v2 models. Supports transcription, speaker diarization, audio event tagging, word-level timestamps, forced alignment, and subtitle generation for meetings, podcasts, and other audio workflows.

Stars0
Favorites0
Comments0
AddedMar 27, 2026
CategoryAudio Editing
Install Command
npx skills add https://github.com/inferen-sh/skills --skill elevenlabs-stt
Overview

Overview

What is elevenlabs-stt?

elevenlabs-stt is a speech-to-text skill that connects your agent or CLI workflows to ElevenLabs Scribe models via the inference.sh (infsh) CLI.

It focuses on high-accuracy, time-aligned audio transcription rather than general note-taking. The skill is designed for media workflows such as:

  • Cleaning up voice recordings for audio and video editing
  • Creating accurate subtitles and captions with timing
  • Producing podcast and interview transcripts
  • Generating lip-sync and karaoke timing through word-level alignment
  • Tagging audio events and identifying different speakers in a recording

Key capabilities

Backed by ElevenLabs Scribe v1/v2 models (via the elevenlabs/stt app on inference.sh), elevenlabs-stt provides:

  • Transcription of audio into structured text
  • Speaker diarization and speaker identification (who spoke when)
  • Audio event tagging (e.g., music, silence, background sounds)
  • Word-level timestamps and forced alignment to existing text
  • Subtitle-friendly output suitable for captions and post-production
  • Multilingual support across 90+ languages with auto-detection

The models are described as delivering 98%+ transcription accuracy in supported conditions, making this skill suitable for production-quality audio and video projects.

Who is elevenlabs-stt for?

elevenlabs-stt is a strong fit if you:

  • Work in audio or video post-production and need reliable transcripts
  • Produce podcasts, webinars, interviews, or lectures and want automated text output
  • Need time-aligned subtitles or caption files as part of your workflow
  • Build developer tools, agents, or pipelines that must call ElevenLabs STT from scripts
  • Want to keep everything in a CLI- and JSON-first environment

It is less suitable if you:

  • Need a purely browser-based, non-technical interface with no CLI
  • Only require casual note-taking from audio and do not care about timing, diarization, or data structures
  • Cannot install or use the infsh CLI where your agent runs

How it fits in your toolchain

elevenlabs-stt sits in the audio-editing and voice tooling layer of your stack:

  • Upstream: audio capture (Zoom recordings, OBS, phone audio, raw WAV/MP3)
  • Core: elevenlabs-stt + infsh to transcribe, diarize, align, and tag
  • Downstream: NLE timelines (Premiere, Resolve), caption workflows, search indexes, AI summarization, or QA agents

Because the skill is defined in the inferen-sh/skills repo, it integrates cleanly with other inference.sh-based tools, using Bash (infsh *) under the hood.

How to Use

1. Prerequisites and environment

Before using elevenlabs-stt as a skill, you need:

  • inference.sh CLI (infsh) installed on the machine where the agent or user runs
  • A working inference.sh account and valid login
  • Network access so infsh can call the elevenlabs/stt app and (optionally) reach any remote audio URLs you provide

To install the CLI, follow the official instructions referenced in the skill:

  • CLI install docs: https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Once installed, authenticate:

infsh login

This sets up the credentials needed for subsequent infsh app run calls from the skill.

2. Installing the elevenlabs-stt skill

If you are using a skills-enabled environment that supports npx skills, you can add elevenlabs-stt directly from the inferen-sh/skills repository:

npx skills add https://github.com/inferen-sh/skills --skill elevenlabs-stt

This will:

  • Register the elevenlabs-stt skill by its slug
  • Make its configuration (including allowed tools and workflow logic) available to your agent runtime

If your environment manages skills differently, mirror the same repository and skill slug, ensuring the skill’s metadata (SKILL.md, metadata.json if present) is correctly loaded.

3. Core transcription workflow

Once the skill and CLI are installed, the underlying operation is a call to the elevenlabs/stt app via infsh.

A basic manual example (mirroring what the skill automates) looks like this:

# Transcribe a remote audio file
infsh app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}'

This pattern is the foundation for how elevenlabs-stt works inside your agent. The skill:

  • Accepts your audio input (URL or path, depending on your integration)
  • Calls infsh app run elevenlabs/stt with JSON input
  • Returns structured JSON containing transcript text and timing information

Use this mental model when configuring prompts, tools, or pipelines around the skill.

4. Choosing models: Scribe v1 vs Scribe v2

The skill exposes the ElevenLabs Scribe v1 and Scribe v2 models:

  • Scribe v2 (scribe_v2) – Latest and highest accuracy (default). Best for most new projects.
  • Scribe v1 (scribe_v1) – Stable, proven version. Useful for consistency with existing workflows or when you have already validated behavior.

If your environment or agent allows passing through model parameters, you can select the model ID accordingly. Where no model is specified, expect Scribe v2 to be used by default as documented.

5. Practical usage patterns

Below are common ways to use elevenlabs-stt once installed.

Basic transcription

For straightforward meeting notes, podcasts, or lectures:

infsh app run elevenlabs/stt --input '{"audio": "https://meeting-recording.mp3"}'

Wrap this call in your agent workflow so that users can say things like:

  • “Transcribe this meeting recording with elevenlabs-stt.”
  • “Use elevenlabs-stt to turn this MP3 into a text transcript.”

The result is a structured transcript you can store, index, or summarize.

Speaker diarization and identification

If the upstream elevenlabs/stt app is configured for speaker diarization, the output JSON includes tokens or segments labeled by speaker.

In your agent prompts, you might specify instructions like:

  • “Run elevenlabs-stt and return speaker-separated transcript segments.”
  • “Group the transcript by speaker, preserving timestamps from elevenlabs-stt.”

This is especially useful for panel discussions, customer calls, or interview shows.

Subtitle and caption generation

Because elevenlabs-stt outputs timestamps and word-level alignment (forced alignment), you can:

  • Convert segments into SRT or VTT caption files
  • Sync text with video tracks in post-production tools
  • Drive karaoke-style highlighting or lip-sync reference

In a workflow, you might:

  1. Call elevenlabs-stt on your audio track.
  2. Map the timing data into subtitle blocks.
  3. Export or feed the captions into your NLE or streaming platform.

Audio event tagging

When audio event tagging is enabled in your calls to elevenlabs/stt, the output can mark music, silence, noise, or other events.

Use this to:

  • Mark cut points for editors
  • Skip non-speech segments when summarizing
  • Automatically detect segments where the main speaker is active

6. File and repository structure

In the inferen-sh/skills repository, the elevenlabs-stt skill lives under:

  • tools/audio/elevenlabs-stt/

Key files to review if you are customizing or self-hosting the skill:

  • SKILL.md – Canonical description of the skill, its purpose, and triggers
  • Any associated rules/, resources/, or scripts/ directories (if present) for helper logic

These files document how the skill is wired to the infsh CLI and what prompts or constraints it expects.

FAQ

When should I use elevenlabs-stt instead of a simpler speech-to-text tool?

Use elevenlabs-stt when you need high accuracy, timestamps, and structure rather than just approximate text.

It is especially appropriate if your core job is:

  • Editing audio or video
  • Publishing podcasts or talking-head content
  • Creating captions and subtitles
  • Analyzing conversations with speaker labels and timing

If you only need casual transcripts without timing or speaker info, a lighter tool may be sufficient.

What accuracy and language coverage can I expect?

According to the skill description, ElevenLabs Scribe models provide:

  • 98%+ transcription accuracy in supported conditions
  • Coverage for 90+ languages with automatic language detection

Real-world performance depends on recording quality, accents, background noise, and microphone placement, but the models are positioned as high-accuracy options suitable for production use.

Do I need the inference.sh CLI to use elevenlabs-stt?

Yes. elevenlabs-stt is implemented around the inference.sh (infsh) CLI and the elevenlabs/stt app. The skill’s allowed tools explicitly list Bash with infsh commands.

If you cannot install or run infsh in your environment, you will not be able to use elevenlabs-stt as designed. In that case, you would need a different skill or a direct API integration outside this repository.

Can elevenlabs-stt handle local audio files, or only URLs?

The documentation example uses a remote URL:

infsh app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}'

Inference.sh generally supports multiple input patterns, but the exact handling of local files depends on how your infsh environment is configured (e.g., upload semantics or mounted paths).

Within an agent, you can typically:

  • Provide a direct URL to hosted audio files, or
  • Use your runtime’s file handling to make local files accessible to infsh.

Check your own environment’s file-passing rules if you need strict local-only workflows.

Does elevenlabs-stt generate SRT or VTT files directly?

The skill itself integrates with the elevenlabs/stt app, which returns structured JSON with timestamps and alignment. The repo evidence focuses on JSON output, not direct SRT/VTT export.

You can, however:

  1. Take the JSON output from elevenlabs-stt.
  2. Map segments and timestamps to SRT or VTT blocks.
  3. Save that as subtitle files in your pipeline.

Many users wire this into simple scripts or agent post-processing steps.

How does forced alignment work in elevenlabs-stt?

Forced alignment uses the underlying Scribe models to align audio with text at the word level, returning precise timestamps per token or word.

This is useful when you:

  • Already have a script or show notes and want them aligned to the final recording
  • Need accurate lip-sync timing (for dubbing, karaoke, or caption highlighting)
  • Want to quickly locate where each line was spoken in the audio

The specifics of alignment output are controlled by the elevenlabs/stt app; elevenlabs-stt is the skill bridge that exposes it to your agent and CLI workflows.

Is elevenlabs-stt suitable for real-time streaming transcription?

The documentation and examples in the skill focus on file-based transcription via infsh app run with an audio input reference. There is no explicit mention of real-time streaming in the provided evidence.

As a result, elevenlabs-stt is best treated as a batch transcription tool for recorded audio files, not as a low-latency live captioning solution.

Where can I see or modify the elevenlabs-stt configuration?

You can explore the skill in the inferen-sh/skills GitHub repository:

  • Base repo: https://github.com/inferen-sh/skills
  • Skill path: tools/audio/elevenlabs-stt/

Start with SKILL.md to understand triggers, description, and usage. If your platform supports custom skills, you can fork and adapt the skill’s configuration, prompts, or allowed tools to match your environment.

Ratings & Reviews

No ratings yet
Share your review
Sign in to leave a rating and comment for this skill.
G
0/10000
Latest reviews
Saving...