elevenlabs-stt

by inferen-sh

High-accuracy ElevenLabs speech-to-text via inference.sh CLI using Scribe v1/v2 models. Supports transcription, speaker diarization, audio event tagging, word-level timestamps, forced alignment, and subtitle generation for meetings, podcasts, and other audio workflows.

Stars0

Favorites0

Comments0

AddedMar 27, 2026

CategoryAudio Editing

Install Command

npx skills add https://github.com/inferen-sh/skills --skill elevenlabs-stt

Audio Video Cli API Workflow

Overview

What is elevenlabs-stt?

elevenlabs-stt is a speech-to-text skill that connects your agent or CLI workflows to ElevenLabs Scribe models via the inference.sh (infsh) CLI.

It focuses on high-accuracy, time-aligned audio transcription rather than general note-taking. The skill is designed for media workflows such as:

Cleaning up voice recordings for audio and video editing
Creating accurate subtitles and captions with timing
Producing podcast and interview transcripts
Generating lip-sync and karaoke timing through word-level alignment
Tagging audio events and identifying different speakers in a recording

Key capabilities

Backed by ElevenLabs Scribe v1/v2 models (via the elevenlabs/stt app on inference.sh), elevenlabs-stt provides:

Transcription of audio into structured text
Speaker diarization and speaker identification (who spoke when)
Audio event tagging (e.g., music, silence, background sounds)
Word-level timestamps and forced alignment to existing text
Subtitle-friendly output suitable for captions and post-production
Multilingual support across 90+ languages with auto-detection

The models are described as delivering 98%+ transcription accuracy in supported conditions, making this skill suitable for production-quality audio and video projects.

Who is elevenlabs-stt for?

elevenlabs-stt is a strong fit if you:

Work in audio or video post-production and need reliable transcripts
Produce podcasts, webinars, interviews, or lectures and want automated text output
Need time-aligned subtitles or caption files as part of your workflow
Build developer tools, agents, or pipelines that must call ElevenLabs STT from scripts
Want to keep everything in a CLI- and JSON-first environment

It is less suitable if you:

Need a purely browser-based, non-technical interface with no CLI
Only require casual note-taking from audio and do not care about timing, diarization, or data structures
Cannot install or use the infsh CLI where your agent runs

How it fits in your toolchain

elevenlabs-stt sits in the audio-editing and voice tooling layer of your stack:

Upstream: audio capture (Zoom recordings, OBS, phone audio, raw WAV/MP3)
Core: elevenlabs-stt + infsh to transcribe, diarize, align, and tag
Downstream: NLE timelines (Premiere, Resolve), caption workflows, search indexes, AI summarization, or QA agents

Because the skill is defined in the inferen-sh/skills repo, it integrates cleanly with other inference.sh-based tools, using Bash (infsh *) under the hood.

How to Use

1. Prerequisites and environment

Before using elevenlabs-stt as a skill, you need:

inference.sh CLI (infsh) installed on the machine where the agent or user runs
A working inference.sh account and valid login
Network access so infsh can call the elevenlabs/stt app and (optionally) reach any remote audio URLs you provide

To install the CLI, follow the official instructions referenced in the skill:

CLI install docs: https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Once installed, authenticate:

infsh login

This sets up the credentials needed for subsequent infsh app run calls from the skill.

2. Installing the elevenlabs-stt skill

If you are using a skills-enabled environment that supports npx skills, you can add elevenlabs-stt directly from the inferen-sh/skills repository:

npx skills add https://github.com/inferen-sh/skills --skill elevenlabs-stt

This will:

Register the elevenlabs-stt skill by its slug
Make its configuration (including allowed tools and workflow logic) available to your agent runtime

If your environment manages skills differently, mirror the same repository and skill slug, ensuring the skill’s metadata (SKILL.md, metadata.json if present) is correctly loaded.

3. Core transcription workflow

Once the skill and CLI are installed, the underlying operation is a call to the elevenlabs/stt app via infsh.

A basic manual example (mirroring what the skill automates) looks like this:

# Transcribe a remote audio file
infsh app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}'

This pattern is the foundation for how elevenlabs-stt works inside your agent. The skill:

Accepts your audio input (URL or path, depending on your integration)
Calls infsh app run elevenlabs/stt with JSON input
Returns structured JSON containing transcript text and timing information

Use this mental model when configuring prompts, tools, or pipelines around the skill.

4. Choosing models: Scribe v1 vs Scribe v2

The skill exposes the ElevenLabs Scribe v1 and Scribe v2 models:

Scribe v2 (scribe_v2) – Latest and highest accuracy (default). Best for most new projects.
Scribe v1 (scribe_v1) – Stable, proven version. Useful for consistency with existing workflows or when you have already validated behavior.

If your environment or agent allows passing through model parameters, you can select the model ID accordingly. Where no model is specified, expect Scribe v2 to be used by default as documented.

5. Practical usage patterns

Below are common ways to use elevenlabs-stt once installed.

Basic transcription

For straightforward meeting notes, podcasts, or lectures:

infsh app run elevenlabs/stt --input '{"audio": "https://meeting-recording.mp3"}'

Wrap this call in your agent workflow so that users can say things like:

“Transcribe this meeting recording with elevenlabs-stt.”
“Use elevenlabs-stt to turn this MP3 into a text transcript.”

The result is a structured transcript you can store, index, or summarize.

Speaker diarization and identification

If the upstream elevenlabs/stt app is configured for speaker diarization, the output JSON includes tokens or segments labeled by speaker.

In your agent prompts, you might specify instructions like:

“Run elevenlabs-stt and return speaker-separated transcript segments.”
“Group the transcript by speaker, preserving timestamps from elevenlabs-stt.”

This is especially useful for panel discussions, customer calls, or interview shows.

Subtitle and caption generation

Because elevenlabs-stt outputs timestamps and word-level alignment (forced alignment), you can:

Convert segments into SRT or VTT caption files
Sync text with video tracks in post-production tools
Drive karaoke-style highlighting or lip-sync reference

In a workflow, you might:

Call elevenlabs-stt on your audio track.
Map the timing data into subtitle blocks.
Export or feed the captions into your NLE or streaming platform.

Audio event tagging

When audio event tagging is enabled in your calls to elevenlabs/stt, the output can mark music, silence, noise, or other events.

Use this to:

Mark cut points for editors
Skip non-speech segments when summarizing
Automatically detect segments where the main speaker is active

6. File and repository structure

In the inferen-sh/skills repository, the elevenlabs-stt skill lives under:

tools/audio/elevenlabs-stt/

Key files to review if you are customizing or self-hosting the skill:

SKILL.md – Canonical description of the skill, its purpose, and triggers
Any associated rules/, resources/, or scripts/ directories (if present) for helper logic

These files document how the skill is wired to the infsh CLI and what prompts or constraints it expects.

FAQ

When should I use elevenlabs-stt instead of a simpler speech-to-text tool?

Use elevenlabs-stt when you need high accuracy, timestamps, and structure rather than just approximate text.

It is especially appropriate if your core job is:

Editing audio or video
Publishing podcasts or talking-head content
Creating captions and subtitles
Analyzing conversations with speaker labels and timing

If you only need casual transcripts without timing or speaker info, a lighter tool may be sufficient.

What accuracy and language coverage can I expect?

According to the skill description, ElevenLabs Scribe models provide:

98%+ transcription accuracy in supported conditions
Coverage for 90+ languages with automatic language detection

Real-world performance depends on recording quality, accents, background noise, and microphone placement, but the models are positioned as high-accuracy options suitable for production use.

Do I need the inference.sh CLI to use elevenlabs-stt?

Yes. elevenlabs-stt is implemented around the inference.sh (infsh) CLI and the elevenlabs/stt app. The skill’s allowed tools explicitly list Bash with infsh commands.

If you cannot install or run infsh in your environment, you will not be able to use elevenlabs-stt as designed. In that case, you would need a different skill or a direct API integration outside this repository.

Can elevenlabs-stt handle local audio files, or only URLs?

The documentation example uses a remote URL:

infsh app run elevenlabs/stt --input '{"audio": "https://audio.mp3"}'

Inference.sh generally supports multiple input patterns, but the exact handling of local files depends on how your infsh environment is configured (e.g., upload semantics or mounted paths).

Within an agent, you can typically:

Provide a direct URL to hosted audio files, or
Use your runtime’s file handling to make local files accessible to infsh.

Check your own environment’s file-passing rules if you need strict local-only workflows.

Does elevenlabs-stt generate SRT or VTT files directly?

The skill itself integrates with the elevenlabs/stt app, which returns structured JSON with timestamps and alignment. The repo evidence focuses on JSON output, not direct SRT/VTT export.

You can, however:

Take the JSON output from elevenlabs-stt.
Map segments and timestamps to SRT or VTT blocks.
Save that as subtitle files in your pipeline.

Many users wire this into simple scripts or agent post-processing steps.

How does forced alignment work in elevenlabs-stt?

Forced alignment uses the underlying Scribe models to align audio with text at the word level, returning precise timestamps per token or word.

This is useful when you:

Already have a script or show notes and want them aligned to the final recording
Need accurate lip-sync timing (for dubbing, karaoke, or caption highlighting)
Want to quickly locate where each line was spoken in the audio

The specifics of alignment output are controlled by the elevenlabs/stt app; elevenlabs-stt is the skill bridge that exposes it to your agent and CLI workflows.

Is elevenlabs-stt suitable for real-time streaming transcription?

The documentation and examples in the skill focus on file-based transcription via infsh app run with an audio input reference. There is no explicit mention of real-time streaming in the provided evidence.

As a result, elevenlabs-stt is best treated as a batch transcription tool for recorded audio files, not as a low-latency live captioning solution.

Where can I see or modify the elevenlabs-stt configuration?

You can explore the skill in the inferen-sh/skills GitHub repository:

Base repo: https://github.com/inferen-sh/skills
Skill path: tools/audio/elevenlabs-stt/

Start with SKILL.md to understand triggers, description, and usage. If your platform supports custom skills, you can fork and adapt the skill’s configuration, prompts, or allowed tools to match your environment.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

animate

by pbakaus

Enhance UI features with purposeful animations, micro-interactions, and motion effects to improve usability and delight. Ideal for frontend and React projects focused on user experience.

UI Design

Favorites 0GitHub 0

next-upgrade

by vercel-labs

Upgrade Next.js projects to the latest version using official migration guides and codemods for a smooth, automated migration process.

Frontend Development

Favorites 0GitHub 0

teach-impeccable

by pbakaus

One-time setup skill for capturing and saving your project's design context and guidelines to your AI config file. Ideal for establishing persistent UI standards.

UI Design

Favorites 0GitHub 14.1K

screen-reader-testing

by wshobson

Test web applications with screen readers including VoiceOver, NVDA, and JAWS. Use when validating screen reader compatibility, debugging accessibility issues, or ensuring assistive technology support.

Frontend Development

Favorites 0GitHub 0

react-state-management

by wshobson

Master modern React state management with Redux Toolkit, Zustand, Jotai, and React Query. Use when setting up global state, managing server state, or choosing between state management solutions.

Frontend Development

Favorites 0GitHub 0

python-anti-patterns

by wshobson

Quickly identify and avoid common Python anti-patterns with this skill. Ideal for code reviews, debugging, and refactoring to improve code quality and maintainability.

Refactoring

Favorites 0GitHub 0

godot-gdscript-patterns

by wshobson

Learn to implement Godot 4 GDScript patterns for game architecture, signals, scene management, and optimization. Ideal for developers building games or mastering GDScript best practices.

Frontend Development

Favorites 0GitHub 32.4K

dotnet-backend-patterns

by wshobson

Master C#/.NET backend development patterns for building robust APIs, MCP servers, and enterprise applications. Covers async/await, dependency injection, Entity Framework Core, Dapper, configuration, caching, and testing with xUnit. Use when developing .NET backends, reviewing C# code, or designing API architectures.

Backend Development

Favorites 0GitHub 32.4K