I

dialogue-audio

by inferen-sh

Create realistic multi-speaker dialogue audio with Dia TTS and ElevenLabs via the inference.sh CLI. The dialogue-audio skill helps you control speakers, emotion, pacing, and conversation flow for podcasts, audiobooks, explainers, character scenes, and other conversational content.

Stars0
Favorites0
Comments0
AddedMar 27, 2026
CategoryVoice Generation
Install Command
npx skills add https://github.com/inferen-sh/skills --skill dialogue-audio
Overview

Overview

What the dialogue-audio skill does

The dialogue-audio skill turns written dialogue into natural-sounding multi-speaker audio using Dia TTS via the inference.sh (infsh) CLI, with support for ElevenLabs-quality voices behind the scenes.

It is designed for:

  • Two-character conversations
  • Podcast-style dialogues and interviews
  • Audiobook scenes with alternating speakers
  • Explainer content with host/guest formats
  • Character dialogue and voice acting prototypes

The skill focuses on:

  • Speaker separation using simple tags like [S1] and [S2]
  • Consistent voices per speaker in a session
  • Emotion and expression control through word choice and punctuation
  • Pacing and flow of back-and-forth conversation
  • Post-production guidance for integrating the audio into your media workflow

If you want an automated way to turn a two-person script into polished dialogue audio from the command line, dialogue-audio is built for that use case.

Who this skill is for

This skill is a good fit if you are:

  • A podcaster who wants to draft or simulate conversations
  • An audio producer or video editor adding voice tracks to timelines
  • An author or scriptwriter creating dialogue-heavy scenes
  • A developer or automation-focused creator who prefers CLI and reproducible workflows

It is not a great fit if you need:

  • More than two distinct speakers in a single generation
  • Complex sound design, music, or mixing handled automatically
  • A point-and-click GUI instead of a command-line tool

For those needs, you may want additional DAW tools or multi-speaker TTS services, then use dialogue-audio for the core two-voice conversation track.

Requirements at a glance

To use dialogue-audio effectively, you will need:

  • Access to the inference.sh CLI (infsh)
  • A terminal or command-line environment (macOS, Linux, or Windows with a shell)
  • Basic familiarity with editing text prompts and running CLI commands

How to Use

1. Install the dialogue-audio skill

You can add the dialogue-audio skill to your Agent environment using npx:

npx skills add https://github.com/inferen-sh/skills --skill dialogue-audio

This pulls the dialogue-audio configuration from the inferen-sh/skills repository and makes it available as a reusable workflow.

Next, ensure you have the inference.sh CLI (infsh) installed. Follow the official instructions:

  • CLI install instructions: https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Once installed, authenticate:

infsh login

2. Understand the core workflow

At its core, dialogue-audio uses the Dia TTS app via infsh:

infsh app run falai/dia-tts --input '{
  "prompt": "[S1] Have you tried the new feature yet? [S2] Not yet, but I heard it saves a ton of time. [S1] It really does. I cut my workflow in half. [S2] Okay, I am definitely trying it today."
}'

This command:

  • Calls the falai/dia-tts app
  • Sends a JSON input payload with a prompt
  • Uses [S1] and [S2] tags to mark speaker turns
  • Returns generated dialogue audio for the full exchange

The dialogue-audio skill wraps this pattern into an opinionated workflow, helping you structure prompts, handle two speakers cleanly, and iterate on expression and pacing.

3. Using speaker tags correctly

Dia TTS relies on speaker tags to know who is talking:

  • [S1] — Speaker 1 (automatically assigned voice A)
  • [S2] — Speaker 2 (automatically assigned voice B)

Key rules:

  • Always start each turn with the appropriate tag
  • Tags must be uppercase: [S1], [S2] (not [s1] or [speaker1])
  • Maximum of 2 speakers per generation
  • Each speaker keeps a consistent voice throughout a session

Example of a clean, two-speaker prompt:

[S1] Welcome back to the show. Today we are talking about productivity hacks.
[S2] I am excited for this. Some of these tricks saved me hours every week.
[S1] Let’s start with batching tasks. Why does it work so well?
[S2] Because you stay in the same mental mode instead of constantly context switching.

4. Shaping emotion, tone, and pacing

The dialogue-audio skill encourages you to use natural writing cues to influence the generated audio:

  • Use punctuation (commas, ellipses, exclamation points) to shape pauses and emphasis
  • Use shorter sentences for faster, snappier dialogue
  • Use longer phrases or descriptive language for a calmer, more reflective delivery
  • Insert stage directions inside parentheses sparingly to hint emotion, for example:
[S1] (laughing softly) I honestly did not expect that to work.
[S2] Me neither, but I am glad we tried.

Test small changes and re-run the command to hear how tone and pacing shift.

5. Iterating your conversation flow

For better results with dialogue-audio:

  • Draft the full conversation in a text editor
  • Ensure all lines are correctly tagged and there are no stray tags
  • Keep turns concise; long monologues can sound less conversational
  • Break complex scenes into multiple generations if needed, then stitch them together in your editor

You can quickly iterate by tweaking the prompt and re-running the infsh app run falai/dia-tts command until the timing and emotional feel match your project.

6. Post-production and integration

The output from Dia TTS is an audio file you can bring into your usual tools. The dialogue-audio skill is focused on voice generation, not full mixing, but you can:

  • Import the generated dialogue into a DAW (e.g., Audacity, Reaper, Logic Pro)
  • Add background music, sound effects, or room tone
  • Adjust volume levels, EQ, and compression to match your overall production
  • Sync the dialogue track with video in editors like Premiere Pro, Final Cut, or DaVinci Resolve

This makes dialogue-audio a good building block in a larger audio or video workflow: generate the core multi-speaker performance, then polish it with your usual tools.

7. Files to review in the repository

After installation, you can inspect the skill definition in the inferen-sh/skills repo for more context:

  • SKILL.md — Main description, quick start, and usage notes for the dialogue-audio workflow

Use these as reference when adapting the setup to your own automations or CI pipelines.


FAQ

Is dialogue-audio limited to only two speakers?

Yes. The dialogue-audio skill, via Dia TTS, is designed for up to two speakers per generation, using [S1] and [S2] tags. If your scene has more characters, you can either:

  • Focus the generation on two at a time, or
  • Split the script into multiple dialogue segments and combine them in post-production.

Do I need the inference.sh CLI to use dialogue-audio?

Yes. The dialogue-audio skill depends on the inference.sh CLI (infsh). You must install it, run infsh login, and then call infsh app run falai/dia-tts with your prompts. Without infsh, the underlying Dia TTS app cannot be accessed from this workflow.

Can I choose specific ElevenLabs voices for each speaker?

The repository documentation states that voices are automatically assigned per speaker: [S1] maps to one voice and [S2] to another, remaining consistent within a session. The skill does not document direct, per-speaker manual voice selection, so treat voice choice as managed by the Dia TTS / inference.sh configuration rather than by explicit IDs in your prompt.

How do I control emotion or intensity in the dialogue?

The dialogue-audio skill relies on prompt design and punctuation rather than explicit emotion sliders. You can:

  • Use expressive wording (e.g., "shouted", "whispered", "nervously")
  • Adjust punctuation (..., !, ?) to influence pause length and emphasis
  • Add brief parenthetical cues like (whispering) or (frustrated) when needed

Run small variations to hear how the model responds and settle on a style that sounds right for your project.

Is dialogue-audio suitable for long-form audiobooks?

Yes, for dialogue-heavy sections with two speakers, dialogue-audio can work well. For very long content:

  • Break the script into logical scenes or chapters
  • Generate audio in segments and organize them in your DAW
  • Ensure consistent tagging and tone across segments

If your audiobook has many narrators or complex narration styles, you may need additional TTS setups beyond this two-speaker-focused skill.

Can I automate dialogue-audio in a larger workflow?

Yes. Because dialogue-audio is built around the infsh CLI, it works well in scripted or automated environments:

  • Integrate infsh app run falai/dia-tts commands into shell scripts
  • Run generations from CI/CD pipelines or scheduled jobs
  • Combine with other skills from inferen-sh/skills for larger content pipelines

The skill is especially useful for developers and technical users who want reproducible, text-driven audio generation.

When is dialogue-audio not the right choice?

Consider other options if you:

  • Need more than two distinct voices in a single pass
  • Prefer a GUI-only workflow without touching the command line
  • Want automatic mixing, music, or SFX instead of just dialogue generation

In those cases, pair dedicated audio tools or multi-speaker TTS services with your DAW, and use dialogue-audio only when a clean two-speaker conversation is the core requirement.

Where can I see the full configuration?

Open the dialogue-audio section in the inferen-sh/skills repository:

  • Repo: https://github.com/inferen-sh/skills
  • Skill path: tools/audio/dialogue-audio

Start with SKILL.md to understand the intended usage and any updated notes on the Dia TTS integration and CLI commands.

Ratings & Reviews

No ratings yet
Share your review
Sign in to leave a rating and comment for this skill.
G
0/10000
Latest reviews
Saving...