ai-voice-cloning

by inferen-sh

ai-voice-cloning is an inference.sh-based skill for AI voice generation, text-to-speech, and voice cloning from the CLI. It wraps ElevenLabs, Kokoro TTS, DIA, Chatterbox, Higgs, and VibeVoice models for natural speech, multi-voice narration, and voice transformation for audio and video projects.

Stars0

Favorites0

Comments0

CategoryVoice Generation

Install Command

npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning

Audio Video Developer Audience Cli Ai

Overview

What is ai-voice-cloning?

ai-voice-cloning is a CLI-focused AI voice generation and voice cloning skill built on top of the inference.sh platform. It lets you call text-to-speech and voice transformation models from the command line, including ElevenLabs, Kokoro TTS, DIA, Chatterbox, Higgs, and VibeVoice.

The skill is defined in the inferen-sh/skills repository and is designed to be embedded into agent workflows that can call Bash via infsh (the inference.sh CLI). It focuses on generating natural-sounding speech and transforming existing voice recordings, rather than model training or dataset management.

Key capabilities

Text-to-speech (TTS) from the CLI using infsh app run ...
Multiple AI voice models in one place (e.g., elevenlabs/tts, infsh/kokoro-tts)
Voice cloning / voice changing for existing recordings via ElevenLabs Voice Changer
Support for many voices and languages (via ElevenLabs models, per the upstream description)
Long-form narration suitable for voiceovers, audiobooks, and podcasts
Conversation-style and expressive reads using models tuned for natural speech

Because ai-voice-cloning is a skill definition rather than a standalone app, you interact with it through the inference.sh CLI and any agents or tools that are allowed to run Bash commands.

Who is ai-voice-cloning for?

This skill is a good fit if you:

Work with audio or video and need fast, scripted voice generation
Build AI agents, CLIs, or automation that should speak or narrate
Produce voiceovers, explainers, tutorials, or training videos
Want ElevenLabs-quality voices and other specialized TTS models behind a single CLI
Prefer command-line workflows over web GUIs

It is less suitable if you:

Need a purely graphical interface with no CLI usage
Want to train custom models from raw audio datasets (not covered by this skill)
Require in-browser or on-device operation without calling the inference.sh service

Typical use cases

Generating narration tracks for YouTube or marketing videos
Creating audiobook or podcast speech from text scripts
Producing multiple character voices for dialogue and conversation
Applying voice changing to existing recordings using ElevenLabs Voice Changer
Adding audio prompts and system voices to agents, bots, and interactive tools

How to Use

1. Prerequisites and installation options

To use ai-voice-cloning you need:

Access to the inference.sh CLI (infsh)
Network connectivity to inference.sh APIs
A shell environment where Bash commands are allowed

You can integrate the skill into your agent environment using:

npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning

This pulls the skill definition from inferen-sh/skills and registers it so your agent can call the associated tools (notably Bash with infsh).

For direct CLI use outside of an agent, install the inference.sh CLI itself. The skill’s SKILL.md links to CLI install instructions at:

https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Follow that document to install infsh on your system.

2. Log in to inference.sh

Once infsh is installed, authenticate:

infsh login

Follow the prompts to log in or configure your credentials as described in the CLI install guide.

3. Quick start: generate speech with Kokoro TTS

The SKILL.md provides a simple Kokoro TTS example. After logging in, you can generate speech with:

infsh app run infsh/kokoro-tts --input '{
  "prompt": "Hello! This is an AI-generated voice that sounds natural and engaging.",
  "voice": "af_sarah"
}'

What this does:

Calls the infsh/kokoro-tts app
Sends JSON input with a prompt (the text to read) and a voice selection
Produces synthesized speech as output (see CLI docs for output paths or streaming behavior)

You can adapt this pattern to different prompts and supported voices.

4. Using different models (ElevenLabs, DIA, and more)

The SKILL.md lists available models in an Available Models table. From the visible excerpt, you can expect entries similar to:

ElevenLabs TTS – App ID: elevenlabs/tts
ElevenLabs Voice Changer – App ID: elevenlabs/voice-changer
Kokoro TTS – App ID: infsh/kokoro-tts
DIA – App ID starting with infsh/dia-...
Other models like Chatterbox, Higgs, and VibeVoice are also referenced in the skill description.

To call a different app, change the App ID in your CLI command. For example, a typical pattern for TTS with ElevenLabs might look like:

infsh app run elevenlabs/tts --input '{
  "text": "This audio was generated using the ai-voice-cloning skill.",
  "voice": "some_voice_id"
}'

Use the repository documentation and model-specific README content (if present) to confirm the exact input schema for each app, as different models may use different field names like prompt, text, or voice_id.

5. Voice changing / voice cloning with ElevenLabs Voice Changer

The skill description explicitly includes ElevenLabs Voice Changer (App ID elevenlabs/voice-changer) for transforming existing recordings. A typical CLI call will:

Reference an input audio file (your original recording)
Specify target voice or settings
Output a transformed audio file

A generic pattern will look similar to:

infsh app run elevenlabs/voice-changer --input '{
  "audio_url": "https://.../your-input-audio.wav",
  "voice": "target_voice_id"
}'

Check the inference.sh app documentation to confirm the exact fields and supported formats.

6. Integrating ai-voice-cloning into agents

When you add ai-voice-cloning as a skill using npx skills add, an agent platform that understands the inferen-sh/skills format can:

See that Bash (infsh \*) is an allowed tool
Use the examples and description from SKILL.md as guidance
Automatically generate appropriate infsh app run ... commands to create or transform audio

To tune the behavior for your agent:

Open SKILL.md in the tools/audio/ai-voice-cloning directory.
Review any examples, available model tables, or notes about use cases.
Add your own prompt patterns, voice choices, or post-processing steps in your agent configuration or orchestration layer.

7. Files to inspect in the repository

For a deeper understanding of how the skill is defined and how it should be used:

tools/audio/ai-voice-cloning/SKILL.md – Core description, quick start, and model list
Root-level docs like README.md and cli-install.md – General inference.sh and CLI setup guidance

There may also be additional docs in the tools folder for broader tooling context.

FAQ

Is ai-voice-cloning a standalone app or a skill definition?

ai-voice-cloning is a skill definition inside the inferen-sh/skills repository. It describes how an agent can use the inference.sh CLI (infsh) for AI voice generation and voice cloning. You do not get a GUI application; instead, you get a clear way to call TTS and voice changer models from the command line or from agent workflows that can execute Bash.

What do I need installed to use ai-voice-cloning?

You need:

The inference.sh CLI (infsh) installed and accessible in your shell
Valid authentication for inference.sh (set up via infsh login)
An environment that allows Bash commands (for example, a local terminal or an agent runtime that exposes Bash)

Optionally, if you are integrating this into an agent platform that supports the skills format, install the skill with:

npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning

Which AI voice models are supported?

From the skill description and SKILL.md, ai-voice-cloning is designed to work with multiple models available via inference.sh, including:

ElevenLabs TTS – elevenlabs/tts
ElevenLabs Voice Changer – elevenlabs/voice-changer
Kokoro TTS – infsh/kokoro-tts
DIA TTS apps (App IDs starting with infsh/dia-...)
Additional models such as Chatterbox, Higgs, and VibeVoice mentioned in the description

Refer to the Available Models table in SKILL.md and inference.sh documentation for the current, complete list and their parameters.

Can ai-voice-cloning handle long-form narration?

Yes. The skill is explicitly described as suitable for long-form narration and use cases like audiobooks, podcasts, and video narration. That said, long-form handling details (such as chunking, maximum text length, and stitching behavior) depend on each underlying model’s limits and the inference.sh runtime. If you plan to process very long scripts, test with smaller sections first and consult model documentation.

How is this different from using ElevenLabs or other providers directly?

ai-voice-cloning:

Uses the inference.sh CLI as a unified interface
Lets you switch between multiple TTS and voice changer models with similar infsh app run ... commands
Integrates naturally into agent skills, Bash scripts, and automated workflows

If you already use a provider’s native API directly, ai-voice-cloning may still be useful when you want:

A single CLI that abstracts multiple providers and models
Easier integration with agent frameworks that understand the skills format

Does ai-voice-cloning support real-time streaming audio?

The SKILL.md excerpt focuses on batch-style commands (infsh app run ...) and does not explicitly describe real-time streaming behavior. Any streaming or low-latency options depend on the specific apps on inference.sh, not on the skill wrapper itself. Check the inference.sh documentation for the models you plan to use if real-time output is important for your use case.

What output format do I get from ai-voice-cloning?

Output formats (e.g., wav, mp3) and delivery methods (local files, URLs, etc.) are determined by the underlying inference.sh apps like infsh/kokoro-tts or elevenlabs/tts. The skill does not enforce a particular audio format; it simply defines how agents can call these models. Consult each app’s documentation or run a test command to see the default output behavior.

When is ai-voice-cloning not a good fit?

You might want a different solution if:

You need a no-CLI, fully browser-based workflow
You require offline, on-device TTS with no external API calls
Your priority is training custom models from large datasets rather than using prebuilt voices

In those cases, look for desktop DAWs with integrated TTS plugins or on-device TTS libraries. If your focus is scripted, automated AI voice generation through a CLI or agents, ai-voice-cloning is a strong candidate.

Where can I learn more about configuration and advanced options?

Start with:

tools/audio/ai-voice-cloning/SKILL.md in the inferen-sh/skills repository
The CLI install doc: cli-install.md referenced in SKILL.md
Any model-specific docs linked from inference.sh for apps like infsh/kokoro-tts or elevenlabs/tts

These resources will give you the latest example commands, parameter lists, and usage notes beyond the quick-start patterns included here.

Ratings & Reviews

No ratings yet

Share your review

0/10000

Latest reviews

Saving...

more skill

ai-music-generation

by inferen-sh

Generate AI music and full songs from text prompts using ElevenLabs Music, Diffrythm, and Tencent Song Generation via the inference.sh CLI. Ideal for background tracks, soundtracks, social clips, podcasts, and royalty-free music. Supports fast song generation, instrumentals, and full vocal songs.

Voice Generation

Favorites 0GitHub 0

elevenlabs-tts

by inferen-sh

ElevenLabs text-to-speech via inference.sh CLI, with 22+ premium voices, multilingual support, and fast model options for production voice generation workflows.

Voice Generation

Favorites 0GitHub 0

dialogue-audio

by inferen-sh

Create realistic multi-speaker dialogue audio with Dia TTS and ElevenLabs via the inference.sh CLI. The dialogue-audio skill helps you control speakers, emotion, pacing, and conversation flow for podcasts, audiobooks, explainers, character scenes, and other conversational content.

Voice Generation

Favorites 0GitHub 0

elevenlabs-stt

by inferen-sh

High-accuracy ElevenLabs speech-to-text via inference.sh CLI using Scribe v1/v2 models. Supports transcription, speaker diarization, audio event tagging, word-level timestamps, forced alignment, and subtitle generation for meetings, podcasts, and other audio workflows.

Audio Editing

Favorites 0GitHub 0

elevenlabs-dubbing

by inferen-sh

elevenlabs-dubbing lets you automatically dub and translate audio or video into 29 languages using the inference.sh CLI, preserving the original speakers’ voices. Ideal for video editors, podcasters, and localization teams who need fast, high‑quality multilingual versions of existing content.

Video Editing

Favorites 0GitHub 0

ai-podcast-creation

by inferen-sh

Create AI-powered podcasts and voice content from text using Kokoro TTS, DIA TTS, and the inference.sh CLI. Mix multiple voices, add music, and assemble full episodes for podcasts, audiobooks, and audio newsletters.

Voice Generation

Favorites 0GitHub 0

elevenlabs-music

by inferen-sh

Generate original AI music from text prompts using the inference.sh CLI and ElevenLabs. Control duration, style, and mood to create royalty-free background music, soundtracks, jingles, podcasts beds, and game audio directly from your terminal.

Audio Editing

Favorites 0GitHub 0

elevenlabs-dialogue

by inferen-sh

Generate polished multi-speaker dialogue audio with ElevenLabs via the inference.sh CLI. Turn structured scripts into natural-sounding conversations with multiple voices in a single file for podcasts, audiobooks, explainers, tutorials, character dialogue, and video scripts.

Voice Generation

Favorites 0GitHub 0