I

ai-voice-cloning

by inferen-sh

ai-voice-cloning is an inference.sh-based skill for AI voice generation, text-to-speech, and voice cloning from the CLI. It wraps ElevenLabs, Kokoro TTS, DIA, Chatterbox, Higgs, and VibeVoice models for natural speech, multi-voice narration, and voice transformation for audio and video projects.

Stars0
Favorites0
Comments0
CategoryVoice Generation
Install Command
npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning
Overview

Overview

What is ai-voice-cloning?

ai-voice-cloning is a CLI-focused AI voice generation and voice cloning skill built on top of the inference.sh platform. It lets you call text-to-speech and voice transformation models from the command line, including ElevenLabs, Kokoro TTS, DIA, Chatterbox, Higgs, and VibeVoice.

The skill is defined in the inferen-sh/skills repository and is designed to be embedded into agent workflows that can call Bash via infsh (the inference.sh CLI). It focuses on generating natural-sounding speech and transforming existing voice recordings, rather than model training or dataset management.

Key capabilities

  • Text-to-speech (TTS) from the CLI using infsh app run ...
  • Multiple AI voice models in one place (e.g., elevenlabs/tts, infsh/kokoro-tts)
  • Voice cloning / voice changing for existing recordings via ElevenLabs Voice Changer
  • Support for many voices and languages (via ElevenLabs models, per the upstream description)
  • Long-form narration suitable for voiceovers, audiobooks, and podcasts
  • Conversation-style and expressive reads using models tuned for natural speech

Because ai-voice-cloning is a skill definition rather than a standalone app, you interact with it through the inference.sh CLI and any agents or tools that are allowed to run Bash commands.

Who is ai-voice-cloning for?

This skill is a good fit if you:

  • Work with audio or video and need fast, scripted voice generation
  • Build AI agents, CLIs, or automation that should speak or narrate
  • Produce voiceovers, explainers, tutorials, or training videos
  • Want ElevenLabs-quality voices and other specialized TTS models behind a single CLI
  • Prefer command-line workflows over web GUIs

It is less suitable if you:

  • Need a purely graphical interface with no CLI usage
  • Want to train custom models from raw audio datasets (not covered by this skill)
  • Require in-browser or on-device operation without calling the inference.sh service

Typical use cases

  • Generating narration tracks for YouTube or marketing videos
  • Creating audiobook or podcast speech from text scripts
  • Producing multiple character voices for dialogue and conversation
  • Applying voice changing to existing recordings using ElevenLabs Voice Changer
  • Adding audio prompts and system voices to agents, bots, and interactive tools

How to Use

1. Prerequisites and installation options

To use ai-voice-cloning you need:

  • Access to the inference.sh CLI (infsh)
  • Network connectivity to inference.sh APIs
  • A shell environment where Bash commands are allowed

You can integrate the skill into your agent environment using:

npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning

This pulls the skill definition from inferen-sh/skills and registers it so your agent can call the associated tools (notably Bash with infsh).

For direct CLI use outside of an agent, install the inference.sh CLI itself. The skill’s SKILL.md links to CLI install instructions at:

  • https://raw.githubusercontent.com/inference-sh/skills/refs/heads/main/cli-install.md

Follow that document to install infsh on your system.

2. Log in to inference.sh

Once infsh is installed, authenticate:

infsh login

Follow the prompts to log in or configure your credentials as described in the CLI install guide.

3. Quick start: generate speech with Kokoro TTS

The SKILL.md provides a simple Kokoro TTS example. After logging in, you can generate speech with:

infsh app run infsh/kokoro-tts --input '{
  "prompt": "Hello! This is an AI-generated voice that sounds natural and engaging.",
  "voice": "af_sarah"
}'

What this does:

  • Calls the infsh/kokoro-tts app
  • Sends JSON input with a prompt (the text to read) and a voice selection
  • Produces synthesized speech as output (see CLI docs for output paths or streaming behavior)

You can adapt this pattern to different prompts and supported voices.

4. Using different models (ElevenLabs, DIA, and more)

The SKILL.md lists available models in an Available Models table. From the visible excerpt, you can expect entries similar to:

  • ElevenLabs TTS – App ID: elevenlabs/tts
  • ElevenLabs Voice Changer – App ID: elevenlabs/voice-changer
  • Kokoro TTS – App ID: infsh/kokoro-tts
  • DIA – App ID starting with infsh/dia-...
  • Other models like Chatterbox, Higgs, and VibeVoice are also referenced in the skill description.

To call a different app, change the App ID in your CLI command. For example, a typical pattern for TTS with ElevenLabs might look like:

infsh app run elevenlabs/tts --input '{
  "text": "This audio was generated using the ai-voice-cloning skill.",
  "voice": "some_voice_id"
}'

Use the repository documentation and model-specific README content (if present) to confirm the exact input schema for each app, as different models may use different field names like prompt, text, or voice_id.

5. Voice changing / voice cloning with ElevenLabs Voice Changer

The skill description explicitly includes ElevenLabs Voice Changer (App ID elevenlabs/voice-changer) for transforming existing recordings. A typical CLI call will:

  1. Reference an input audio file (your original recording)
  2. Specify target voice or settings
  3. Output a transformed audio file

A generic pattern will look similar to:

infsh app run elevenlabs/voice-changer --input '{
  "audio_url": "https://.../your-input-audio.wav",
  "voice": "target_voice_id"
}'

Check the inference.sh app documentation to confirm the exact fields and supported formats.

6. Integrating ai-voice-cloning into agents

When you add ai-voice-cloning as a skill using npx skills add, an agent platform that understands the inferen-sh/skills format can:

  • See that Bash (infsh \*) is an allowed tool
  • Use the examples and description from SKILL.md as guidance
  • Automatically generate appropriate infsh app run ... commands to create or transform audio

To tune the behavior for your agent:

  1. Open SKILL.md in the tools/audio/ai-voice-cloning directory.
  2. Review any examples, available model tables, or notes about use cases.
  3. Add your own prompt patterns, voice choices, or post-processing steps in your agent configuration or orchestration layer.

7. Files to inspect in the repository

For a deeper understanding of how the skill is defined and how it should be used:

  • tools/audio/ai-voice-cloning/SKILL.md – Core description, quick start, and model list
  • Root-level docs like README.md and cli-install.md – General inference.sh and CLI setup guidance

There may also be additional docs in the tools folder for broader tooling context.


FAQ

Is ai-voice-cloning a standalone app or a skill definition?

ai-voice-cloning is a skill definition inside the inferen-sh/skills repository. It describes how an agent can use the inference.sh CLI (infsh) for AI voice generation and voice cloning. You do not get a GUI application; instead, you get a clear way to call TTS and voice changer models from the command line or from agent workflows that can execute Bash.

What do I need installed to use ai-voice-cloning?

You need:

  • The inference.sh CLI (infsh) installed and accessible in your shell
  • Valid authentication for inference.sh (set up via infsh login)
  • An environment that allows Bash commands (for example, a local terminal or an agent runtime that exposes Bash)

Optionally, if you are integrating this into an agent platform that supports the skills format, install the skill with:

npx skills add https://github.com/inferen-sh/skills --skill ai-voice-cloning

Which AI voice models are supported?

From the skill description and SKILL.md, ai-voice-cloning is designed to work with multiple models available via inference.sh, including:

  • ElevenLabs TTSelevenlabs/tts
  • ElevenLabs Voice Changerelevenlabs/voice-changer
  • Kokoro TTSinfsh/kokoro-tts
  • DIA TTS apps (App IDs starting with infsh/dia-...)
  • Additional models such as Chatterbox, Higgs, and VibeVoice mentioned in the description

Refer to the Available Models table in SKILL.md and inference.sh documentation for the current, complete list and their parameters.

Can ai-voice-cloning handle long-form narration?

Yes. The skill is explicitly described as suitable for long-form narration and use cases like audiobooks, podcasts, and video narration. That said, long-form handling details (such as chunking, maximum text length, and stitching behavior) depend on each underlying model’s limits and the inference.sh runtime. If you plan to process very long scripts, test with smaller sections first and consult model documentation.

How is this different from using ElevenLabs or other providers directly?

ai-voice-cloning:

  • Uses the inference.sh CLI as a unified interface
  • Lets you switch between multiple TTS and voice changer models with similar infsh app run ... commands
  • Integrates naturally into agent skills, Bash scripts, and automated workflows

If you already use a provider’s native API directly, ai-voice-cloning may still be useful when you want:

  • A single CLI that abstracts multiple providers and models
  • Easier integration with agent frameworks that understand the skills format

Does ai-voice-cloning support real-time streaming audio?

The SKILL.md excerpt focuses on batch-style commands (infsh app run ...) and does not explicitly describe real-time streaming behavior. Any streaming or low-latency options depend on the specific apps on inference.sh, not on the skill wrapper itself. Check the inference.sh documentation for the models you plan to use if real-time output is important for your use case.

What output format do I get from ai-voice-cloning?

Output formats (e.g., wav, mp3) and delivery methods (local files, URLs, etc.) are determined by the underlying inference.sh apps like infsh/kokoro-tts or elevenlabs/tts. The skill does not enforce a particular audio format; it simply defines how agents can call these models. Consult each app’s documentation or run a test command to see the default output behavior.

When is ai-voice-cloning not a good fit?

You might want a different solution if:

  • You need a no-CLI, fully browser-based workflow
  • You require offline, on-device TTS with no external API calls
  • Your priority is training custom models from large datasets rather than using prebuilt voices

In those cases, look for desktop DAWs with integrated TTS plugins or on-device TTS libraries. If your focus is scripted, automated AI voice generation through a CLI or agents, ai-voice-cloning is a strong candidate.

Where can I learn more about configuration and advanced options?

Start with:

  • tools/audio/ai-voice-cloning/SKILL.md in the inferen-sh/skills repository
  • The CLI install doc: cli-install.md referenced in SKILL.md
  • Any model-specific docs linked from inference.sh for apps like infsh/kokoro-tts or elevenlabs/tts

These resources will give you the latest example commands, parameter lists, and usage notes beyond the quick-start patterns included here.

Ratings & Reviews

No ratings yet
Share your review
Sign in to leave a rating and comment for this skill.
G
0/10000
Latest reviews
Saving...