dialogue-audio

by inferen-sh

Create realistic multi-speaker dialogue audio with Dia TTS and ElevenLabs via the inference.sh CLI. The dialogue-audio skill helps you control speakers, emotion, pacing, and conversation flow for podcasts, audiobooks, explainers, character scenes, and other conversational content.

Stars0

Favorites0

Comments0

AddedMar 27, 2026

CategoryVoice Generation

Install Command

npx skills add https://github.com/inferen-sh/skills --skill dialogue-audio

Audio Video Workflow Cli

Overview

What the dialogue-audio skill does

The dialogue-audio skill turns written dialogue into natural-sounding multi-speaker audio using Dia TTS via the inference.sh (infsh) CLI, with support for ElevenLabs-quality voices behind the scenes.

It is designed for:

Two-character conversations
Podcast-style dialogues and interviews
Audiobook scenes with alternating speakers
Explainer content with host/guest formats
Character dialogue and voice acting prototypes

The skill focuses on:

Speaker separation using simple tags like [S1] and [S2]
Consistent voices per speaker in a session
Emotion and expression control through word choice and punctuation
Pacing and flow of back-and-forth conversation
Post-production guidance for integrating the audio into your media workflow

If you want an automated way to turn a two-person script into polished dialogue audio from the command line, dialogue-audio is built for that use case.

Who this skill is for

This skill is a good fit if you are:

A podcaster who wants to draft or simulate conversations
An audio producer or video editor adding voice tracks to timelines
An author or scriptwriter creating dialogue-heavy scenes
A developer or automation-focused creator who prefers CLI and reproducible workflows

It is not a great fit if you need:

More than two distinct speakers in a single generation
Complex sound design, music, or mixing handled automatically
A point-and-click GUI instead of a command-line tool

For those needs, you may want additional DAW tools or multi-speaker TTS services, then use dialogue-audio for the core two-voice conversation track.

Requirements at a glance

To use dialogue-audio effectively, you will need:

Access to the inference.sh CLI (infsh)
A terminal or command-line environment (macOS, Linux, or Windows with a shell)
Basic familiarity with editing text prompts and running CLI commands

How to Use

1. Install the dialogue-audio skill

You can add the dialogue-audio skill to your Agent environment using npx:

npx skills add https://github.com/inferen-sh/skills --skill dialogue-audio

This pulls the dialogue-audio configuration from the inferen-sh/skills repository and makes it available as a reusable workflow.

Next, ensure you have the inference.sh CLI (infsh) installed. Follow the official instructions:

CLI install instructions: https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Once installed, authenticate:

infsh login

2. Understand the core workflow

At its core, dialogue-audio uses the Dia TTS app via infsh:

infsh app run falai/dia-tts --input '{
  "prompt": "[S1] Have you tried the new feature yet? [S2] Not yet, but I heard it saves a ton of time. [S1] It really does. I cut my workflow in half. [S2] Okay, I am definitely trying it today."
}'

This command:

Calls the falai/dia-tts app
Sends a JSON input payload with a prompt
Uses [S1] and [S2] tags to mark speaker turns
Returns generated dialogue audio for the full exchange

The dialogue-audio skill wraps this pattern into an opinionated workflow, helping you structure prompts, handle two speakers cleanly, and iterate on expression and pacing.

3. Using speaker tags correctly

Dia TTS relies on speaker tags to know who is talking:

[S1] — Speaker 1 (automatically assigned voice A)
[S2] — Speaker 2 (automatically assigned voice B)

Key rules:

Always start each turn with the appropriate tag
Tags must be uppercase: [S1], [S2] (not [s1] or [speaker1])
Maximum of 2 speakers per generation
Each speaker keeps a consistent voice throughout a session

Example of a clean, two-speaker prompt:

[S1] Welcome back to the show. Today we are talking about productivity hacks.
[S2] I am excited for this. Some of these tricks saved me hours every week.
[S1] Let’s start with batching tasks. Why does it work so well?
[S2] Because you stay in the same mental mode instead of constantly context switching.

4. Shaping emotion, tone, and pacing

The dialogue-audio skill encourages you to use natural writing cues to influence the generated audio:

Use punctuation (commas, ellipses, exclamation points) to shape pauses and emphasis
Use shorter sentences for faster, snappier dialogue
Use longer phrases or descriptive language for a calmer, more reflective delivery
Insert stage directions inside parentheses sparingly to hint emotion, for example:

[S1] (laughing softly) I honestly did not expect that to work.
[S2] Me neither, but I am glad we tried.

Test small changes and re-run the command to hear how tone and pacing shift.

5. Iterating your conversation flow

For better results with dialogue-audio:

Draft the full conversation in a text editor
Ensure all lines are correctly tagged and there are no stray tags
Keep turns concise; long monologues can sound less conversational
Break complex scenes into multiple generations if needed, then stitch them together in your editor

You can quickly iterate by tweaking the prompt and re-running the infsh app run falai/dia-tts command until the timing and emotional feel match your project.

6. Post-production and integration

The output from Dia TTS is an audio file you can bring into your usual tools. The dialogue-audio skill is focused on voice generation, not full mixing, but you can:

Import the generated dialogue into a DAW (e.g., Audacity, Reaper, Logic Pro)
Add background music, sound effects, or room tone
Adjust volume levels, EQ, and compression to match your overall production
Sync the dialogue track with video in editors like Premiere Pro, Final Cut, or DaVinci Resolve

This makes dialogue-audio a good building block in a larger audio or video workflow: generate the core multi-speaker performance, then polish it with your usual tools.

7. Files to review in the repository

After installation, you can inspect the skill definition in the inferen-sh/skills repo for more context:

SKILL.md — Main description, quick start, and usage notes for the dialogue-audio workflow

Use these as reference when adapting the setup to your own automations or CI pipelines.

FAQ

Is dialogue-audio limited to only two speakers?

Yes. The dialogue-audio skill, via Dia TTS, is designed for up to two speakers per generation, using [S1] and [S2] tags. If your scene has more characters, you can either:

Focus the generation on two at a time, or
Split the script into multiple dialogue segments and combine them in post-production.

Do I need the inference.sh CLI to use dialogue-audio?

Yes. The dialogue-audio skill depends on the inference.sh CLI (infsh). You must install it, run infsh login, and then call infsh app run falai/dia-tts with your prompts. Without infsh, the underlying Dia TTS app cannot be accessed from this workflow.

Can I choose specific ElevenLabs voices for each speaker?

The repository documentation states that voices are automatically assigned per speaker: [S1] maps to one voice and [S2] to another, remaining consistent within a session. The skill does not document direct, per-speaker manual voice selection, so treat voice choice as managed by the Dia TTS / inference.sh configuration rather than by explicit IDs in your prompt.

How do I control emotion or intensity in the dialogue?

The dialogue-audio skill relies on prompt design and punctuation rather than explicit emotion sliders. You can:

Use expressive wording (e.g., "shouted", "whispered", "nervously")
Adjust punctuation (..., !, ?) to influence pause length and emphasis
Add brief parenthetical cues like (whispering) or (frustrated) when needed

Run small variations to hear how the model responds and settle on a style that sounds right for your project.

Is dialogue-audio suitable for long-form audiobooks?

Yes, for dialogue-heavy sections with two speakers, dialogue-audio can work well. For very long content:

Break the script into logical scenes or chapters
Generate audio in segments and organize them in your DAW
Ensure consistent tagging and tone across segments

If your audiobook has many narrators or complex narration styles, you may need additional TTS setups beyond this two-speaker-focused skill.

Can I automate dialogue-audio in a larger workflow?

Yes. Because dialogue-audio is built around the infsh CLI, it works well in scripted or automated environments:

Integrate infsh app run falai/dia-tts commands into shell scripts
Run generations from CI/CD pipelines or scheduled jobs
Combine with other skills from inferen-sh/skills for larger content pipelines

The skill is especially useful for developers and technical users who want reproducible, text-driven audio generation.

When is dialogue-audio not the right choice?

Consider other options if you:

Need more than two distinct voices in a single pass
Prefer a GUI-only workflow without touching the command line
Want automatic mixing, music, or SFX instead of just dialogue generation

In those cases, pair dedicated audio tools or multi-speaker TTS services with your DAW, and use dialogue-audio only when a clean two-speaker conversation is the core requirement.

Where can I see the full configuration?

Open the dialogue-audio section in the inferen-sh/skills repository:

Repo: https://github.com/inferen-sh/skills
Skill path: tools/audio/dialogue-audio

Start with SKILL.md to understand the intended usage and any updated notes on the Dia TTS integration and CLI commands.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

data-quality-frameworks

by wshobson

Implement data quality validation using Great Expectations, dbt tests, and data contracts. Ideal for building robust data pipelines, automating validation, and establishing data quality standards.

Data Cleaning

Favorites 0GitHub 0

billing-automation

by wshobson

Automate recurring payments, invoicing, subscription management, and dunning workflows. Ideal for SaaS teams and businesses needing robust billing automation.

Workflow Automation

Favorites 0GitHub 0

solidity-security

by wshobson

Master smart contract security best practices to prevent vulnerabilities and implement secure Solidity patterns. Ideal for developers, auditors, and teams building or reviewing Ethereum smart contracts and DeFi protocols.

Security Audit

Favorites 0GitHub 0

frontend-design

by pbakaus

Create distinctive, production-grade frontend interfaces with high design quality. Generates creative, polished code that avoids generic AI aesthetics. Use when the user asks to build web components, pages, artifacts, posters, or applications, or when any design skill requires project context.

UI Design

Favorites 0GitHub 14.1K

slo-implementation

by wshobson

Implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets to set and monitor service reliability targets. Ideal for SRE teams and anyone aiming to measure and improve service performance.

Frontend Development

Favorites 0GitHub 0

prometheus-configuration

by wshobson

Configure Prometheus for robust metric collection, alerting, and monitoring of your infrastructure and applications. Ideal for teams setting up observability and monitoring solutions.

Monitoring

Favorites 0GitHub 0

polish

by pbakaus

The polish skill performs a meticulous final quality pass to fix alignment, spacing, consistency, and micro-detail issues before shipping. Ideal for UI designers seeking to elevate products from good to great with finishing touches.

UI Design

Favorites 0GitHub 0

similarity-search-patterns

by wshobson

Implement efficient similarity search with vector databases. Use for semantic search, nearest neighbor queries, or optimizing retrieval performance in backend systems.

Backend Development

Favorites 0GitHub 0