[go: up one dir, main page]

Open Source Python Text to Speech Software - Page 2

Python Text to Speech Software

View 205 business solutions

Browse free open source Python Text to Speech Software and projects below. Use the toggles on the left to filter open source Python Text to Speech Software by OS, license, language, programming language, and project status.

  • MongoDB Atlas runs apps anywhere Icon
    MongoDB Atlas runs apps anywhere

    Deploy in 115+ regions with the modern database for every enterprise.

    MongoDB Atlas gives you the freedom to build and run modern applications anywhere—across AWS, Azure, and Google Cloud. With global availability in over 115 regions, Atlas lets you deploy close to your users, meet compliance needs, and scale with confidence across any geography.
    Start Free
  • Yeastar: Business Phone System and Unified Communications Icon
    Yeastar: Business Phone System and Unified Communications

    Go beyond just a PBX with all communications integrated as one.

    User-friendly, optimized, and scalable, the Yeastar P-Series Phone System redefines business connectivity by bringing together calling, meetings, omnichannel messaging, and integrations in one simple platform—removing the limitations of distance, platforms, and systems.
    Learn More
  • 1
    ElevenLabs Python

    ElevenLabs Python

    The official Python SDK for the ElevenLabs API

    elevenlabs-python is the official Python SDK for the ElevenLabs API, giving developers a convenient way to access ElevenLabs’ high-quality, lifelike voices. The library wraps the HTTP API into a typed Python client, so you can perform text-to-speech, streaming, voice cloning, voice management, and agents-related operations with simple method calls. It exposes ElevenLabs’ main models such as Eleven Multilingual v2, Eleven Flash v2.5, and Eleven Turbo v2.5, each targeting different trade-offs between latency, cost, and quality. The SDK is designed for quick setup: after installing the package and setting an API key, you can generate speech in multiple languages and play or process the resulting audio bytes. It includes helper utilities (like play and stream) so you can either play audio locally or integrate it into your own playback or networking pipeline.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 2
    MeloTTS

    MeloTTS

    High-quality multi-lingual text-to-speech library by MyShell.ai

    MeloTTS is an open-source text-to-speech (TTS) system that generates natural-sounding speech from text input. It utilizes advanced machine-learning models to produce high-quality audio outputs.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 3
    Pocket TTS

    Pocket TTS

    A TTS that fits in your CPU (and pocket)

    Pocket TTS is a lightweight text-to-speech project designed to run efficiently on CPUs, targeting developers who want local speech generation without depending on GPUs or hosted web APIs. It is built to feel practical in everyday applications, where installation and usage should be as simple as adding a dependency and calling a function. The project focuses on keeping the runtime footprint manageable while still producing natural-sounding speech, which makes it attractive for offline tools, prototypes, and privacy-sensitive workflows. Because it is CPU-oriented, it fits well in server environments where GPU access is limited, in desktop apps, or in edge deployments where simplicity matters more than maximum throughput. It also emphasizes developer ergonomics, providing a straightforward API surface that can be integrated into pipelines, assistants, accessibility tools, or batch generation scripts.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 4
    Speech-AI-Forge

    Speech-AI-Forge

    Speech-AI-Forge is a project developed around TTS generation model

    Speech-AI-Forge is a full-stack project built around modern text-to-speech generation models, providing both an API server and a Gradio-based web UI for interactive use. At its core, it acts as a hub that wires together multiple speech-related capabilities, including TTS, speech-to-text and LLM-based control flows, behind a consistent interface. The system is designed to be deployed in several ways: you can try it online via hosted demos, spin it up in a one-click Colab environment, run it in Docker containers, or set it up locally with its environment preparation scripts. It is model-agnostic and advertises support for a variety of TTS and speech models such as ChatTTS, CosyVoice, Fish-Speech, FireredTTS and others, as well as Whisper-based ASR, giving you a flexible playground for experimenting with different speech stacks. The project also integrates with general-purpose LLMs (for example GPT- or LLaMA-style models), which can be used to pre-process text, manage conversations.
    Downloads: 4 This Week
    Last Update:
    See Project
  • The sales CRM that makes your life easy, so all you have to do is sell. Icon
    The sales CRM that makes your life easy, so all you have to do is sell.

    The simpler way to sell

    Welcome to the simpler way to sell. Pipedrive is CRM software that makes your life easy, for less legwork and more sales. Let us track your sales conversations, eliminate admin tasks, get you more leads and uncover how you win, because your day belongs to you. Join more than 100,000 sales teams around the world that use the CRM rated #1 by SoftwareReviews in 2019. Start your free 14-day trial and get full access – no credit card needed.
    Try it free (No Credit Card Required)
  • 5
    Style-Bert-VITS2

    Style-Bert-VITS2

    Style-Bert-VITS2: Bert-VITS2 with more controllable voice styles

    Style-Bert-VITS2 is a text-to-speech system based on Bert-VITS2 that focuses on highly controllable voice styles and emotional expression. It takes the original Bert-VITS2 v2.1 and its Japanese-Extra variant and extends them so you can control emotion and speaking style with fine-grained intensity, not just choose a generic tone. The project targets both power users and beginners: Windows users without Git or Python can install and run it using bundled .bat scripts, while advanced users can work with virtual environments, uv, and Python tooling. It includes a full GUI editor to script dialogue, set different styles per line, edit dictionaries, and save/load projects, plus a separate web UI and Colab notebooks for training and experimentation. For those who only need synthesis, the project is published as a Python library (pip install style-bert-vits2) and can run on CPU without an NVIDIA GPU, though training still requires GPU hardware.
    Downloads: 4 This Week
    Last Update:
    See Project
  • 6
    Audiblez

    Audiblez

    Generate audiobooks from e-books

    Audiblez is a tool for generating high-quality .m4b audiobooks directly from .epub e-books using the Kokoro-82M neural text-to-speech model. It focuses on making audiobook creation easy and fast: from a single command, the tool splits an e-book into chapters, synthesizes audio for each section, and then merges the results into a structured audiobook with chapter-based WAV files and a final .m4b container. The Kokoro-82M model it uses is compact (82M parameters) yet natural sounding, trained on under 100 hours of audio, and supports multiple languages, including English (US/UK), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese. Audiblez can run entirely from the command line via a PyPI package or through a simple cross-platform GUI built on wxPython, giving both advanced users and non-technical users an accessible workflow.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 7
    MLX-Audio

    MLX-Audio

    A text-to-speech, speech-to-text and speech-to-speech library

    MLX-Audio is a speech library built on Apple’s MLX framework and optimized for Apple Silicon machines (M-series Macs). It focuses on text-to-speech and speech-to-speech workflows, with APIs and a command-line interface that make it easy to generate high-quality audio from text. Because it uses MLX and targets Apple Silicon, inference is fast and can take advantage of hardware acceleration and quantization for efficient on-device performance. The project provides a straightforward CLI (mlx_audio.tts.generate) as well as a Python API for programmatic generation of audio, including parameters for voice choice, speed, language hints, output format, and sample rate. It includes examples such as audiobook generation to demonstrate long-form synthesis and joined audio segments. On top of that, MLX-Audio offers a modern web interface powered by FastAPI, with real-time waveform and 3D visualizations, file upload, and audio management.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 8
    NVIDIA NeMo

    NVIDIA NeMo

    Toolkit for conversational AI

    NVIDIA NeMo, part of the NVIDIA AI platform, is a toolkit for building new state-of-the-art conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of prebuilt modules that include everything needed to train on your data. Every module can easily be customized, extended, and composed to create new conversational AI model architectures. Conversational AI architectures are typically large and require a lot of data and compute for training. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node mixed-precision training. Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC. NGC collection of pre-trained speech processing models.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 9
    NVIDIA NeMo Framework

    NVIDIA NeMo Framework

    Scalable generative AI framework built for researchers and developers

    NVIDIA NeMo is a scalable, cloud-native generative AI framework aimed at researchers and PyTorch developers working on large language models, multimodal models, and speech AI (ASR and TTS), with growing support for computer vision. It provides collections of domain-specific modules and reference implementations that make it easier to pre-train, fine-tune, and deploy very large models on multi-GPU and multi-node infrastructure. NeMo 2.0 introduces a Python-based configuration system, replacing YAML with more flexible, programmable configs that can be versioned and composed for different experiments. The framework builds on PyTorch Lightning–style modular abstractions, so training scripts are composed from reusable components for data loading, models, optimizers, and schedulers, which simplifies experimentation and adaptation. NeMo is designed to scale: with tools like NeMo-Run, users can orchestrate large-scale experiments across thousands of GPUs.
    Downloads: 3 This Week
    Last Update:
    See Project
  • Curtain LogTrace File Activity Monitoring Icon
    Curtain LogTrace File Activity Monitoring

    For any organizations (up to 10,000 PCs)

    Curtain LogTrace File Activity Monitoring is an enterprise file activity monitoring solution. It tracks user actions: create, copy, move, delete, rename, print, open, close, save. Includes source/destination paths and disk type. Perfect for monitoring user file activities.
    Learn More
  • 10
    Spark TTS

    Spark TTS

    Spark-TTS Inference Code

    Spark TTS is an open-source, PyTorch-based text-to-speech inference system that leverages large language models to produce highly natural, intelligible speech from text input. It uses an efficient single-stream architecture where speech tokens are directly reconstructed from the predictions of an LLM, removing the need for external acoustic models or complex vocoders and making the generation pipeline cleaner and faster. The project supports zero-shot voice cloning, meaning it can imitate a new speaker’s voice without dedicated training for that specific voice, and works across languages, including English and Chinese, even in cross-lingual code-switching scenarios. Spark-TTS allows users to control speech characteristics like gender, pitch, and speaking rate to customize synthesized output and support virtual speaker creation.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 11
    StyleTTS 2

    StyleTTS 2

    Towards Human-Level Text-to-Speech through Style Diffusion

    StyleTTS2 is a state-of-the-art text-to-speech system that aims for human-level naturalness by combining style diffusion, adversarial training, and large speech language models. It extends the original StyleTTS idea by introducing a style diffusion model that can sample rich, realistic speaking styles conditioned on reference speech, allowing highly expressive and diverse prosody. The architecture uses a two-stage training process and leverages an auxiliary speech language model to guide generation toward more natural and coherent utterances. StyleTTS2 supports both single-speaker and multi-speaker configurations, with the ability to sample or transfer styles from reference audio, making it powerful for expressive TTS and character voices. The repository includes training scripts, configuration files, and pre-trained auxiliary modules such as a text aligner, pitch extractor, and PL-BERT-based linguistic encoder.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 12
    VALL-E X

    VALL-E X

    Open source implementation of Microsoft's VALL-E X zero-shot TTS model

    VALL-E-X is an open-source implementation of Microsoft’s VALL-E X zero-shot text-to-speech model, focused on multilingual, cross-lingual voice cloning. It is capable of synthesizing speech in English, Chinese, and Japanese from text while mimicking the voice characteristics of a speaker given only a short 3–10 second prompt. The model attempts to match not just timbre, but also tone, pitch, emotion, and prosody of the reference audio, resulting in highly personalized output. VALL-E-X supports zero-shot cross-lingual synthesis, meaning a monolingual speaker’s voice can be used to speak other languages without additional training. It also preserves aspects of the acoustic environment, such as background noise or reverb, making the generated audio feel more like it came from the same setting as the prompt. The repository includes Python APIs, sample scripts, ready-to-use voice presets, and demos hosted on Hugging Face Spaces and Google Colab so users can try it.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 13
    VibeVoice ComfyUI

    VibeVoice ComfyUI

    ComfyUI integration for Microsoft's VibeVoice text-to-speech model

    VibeVoice ComfyUI is a comprehensive wrapper that integrates Microsoft’s VibeVoice text-to-speech models directly into ComfyUI workflows. It exposes VibeVoice as a set of custom nodes so you can build single-speaker and multi-speaker voice generation pipelines visually, combining TTS with other audio or generative components. The integration supports high-quality single-speaker synthesis as well as scripted multi-speaker conversations, with optional voice cloning from audio samples for each speaker. It includes advanced control over generation parameters like attention backend, diffusion steps, sampling temperature, guidance scale, and quantization settings, allowing users to tune the trade-offs between quality, VRAM usage, and speed. The project also introduces first-class LoRA support, making it possible to fine-tune and load custom LoRA adapters that modify voice identity or style while keeping the base VibeVoice model intact.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 14
    EmotiVoice

    EmotiVoice

    Multi-Voice and Prompt-Controlled TTS Engine

    EmotiVoice is a multi-voice, prompt-controlled text-to-speech engine designed to generate highly expressive speech across thousands of voices. It supports both English and Chinese and ships with over 2,000 preset voices, making it suitable for everything from characters and virtual anchors to narration and dialogue. The core idea is prompt-based emotional and style control: you can ask the engine to speak “happy,” “sad,” “excited,” or with other high-level style prompts that shape prosody, pitch, speed, and energy. EmotiVoice provides multiple ways to interact with it, including a web interface, a Docker image, an HTTP API (including an OpenAI-compatible TTS API), and Python scripts for batch synthesis. It also supports voice cloning with your own data, backed by recipes for popular datasets like DataBaker and LJSpeech, so you can train or adapt voices to custom personas.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 15
    Open Vision Agents by Stream

    Open Vision Agents by Stream

    Build Vision Agents quickly with any model or video provider

    Open Vision Agents by Stream is an open source framework from Stream for building real time, multimodal AI agents that watch, listen, and respond to live video streams. It focuses on combining video understanding models, such as YOLO and Roboflow based detectors, with real time large language models like OpenAI Realtime and Gemini Live to create interactive experiences. The framework uses Stream’s ultra low latency edge network so agents can join sessions quickly and maintain very low audio and video latency while processing frames and generating responses. Developers work with an agent abstraction that connects video edge providers, LLMs, and processors into pipelines, making it easier to orchestrate tasks like object detection, pose estimation, and conversational guidance. The project includes SDKs for React, Android, iOS, Flutter, React Native, and Unity, enabling integration into a wide variety of client environments such as mobile apps, web apps, and games.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 16
    Orpheus TTS

    Orpheus TTS

    Towards Human-Sounding Speech

    Orpheus TTS is a state-of-the-art open-source text-to-speech system built on a Llama-3B backbone, treating speech synthesis as a large language model problem instead of a traditional TTS pipeline. It is designed to produce human-like speech with natural intonation, emotion, and rhythm, targeting quality comparable to or better than many closed-source systems. The project ships both pretrained and finetuned English models, as well as a family of multilingual models released as a research preview, and includes data-processing scripts so users can train or finetune their own variants. Inference is provided through a Python package that uses vLLM under the hood for high-throughput, low-latency generation, including streaming examples that show how to generate audio chunks in real time. The maintainers provide Colab notebooks, a standardized prompting format, and one-click deployment via Baseten for production-grade, FP8/FP16 optimized inference with ~200 ms streaming latency.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    StreamSpeech

    StreamSpeech

    StreamSpeech is a seamless model for offline speech recognition

    StreamSpeech is an “all-in-one” speech model designed to perform offline and simultaneous speech recognition, speech translation, and speech synthesis within a single unified architecture. Developed as part of an ACL 2024 paper, it targets streaming and low-latency scenarios where intermediate results and final translations or synthetic speech must be produced continuously as audio is being received. The model supports eight tasks: offline ASR, speech-to-text translation, speech-to-speech translation, and TTS, as well as their streaming or simultaneous counterparts, all handled by the same underlying system. During simultaneous translation, StreamSpeech can optionally output intermediate ASR transcripts and text translations, giving users or downstream applications real-time visibility into what the system is hearing and how it is translating.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 18
    VoxCPM

    VoxCPM

    TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

    VoxCPM is a tokenizer-free text-to-speech system that models speech in a continuous space, aiming for extremely realistic, context-aware synthesis and true-to-life zero-shot voice cloning. Instead of converting speech into discrete tokens, it uses an end-to-end diffusion-autoregressive architecture built on the MiniCPM-4 backbone, combining hierarchical language modeling, finite scalar quantization (FSQ), and local Diffusion Transformers. This design helps decouple semantic and acoustic information while preserving fine-grained prosody, leading to more stable and expressive generation than many discrete-token systems. Trained on a large 1.8-million-hour bilingual corpus, VoxCPM can infer appropriate speaking style from context, dynamically adjusting intonation, rhythm, and emotional tone. It supports zero-shot voice cloning from a short reference audio clip, capturing timbre, accent, and pacing to closely mimic a target speaker without per-speaker fine-tuning.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 19
    ChatTTS webUI & API

    ChatTTS webUI & API

    A simple native web interface that uses ChatTTS to synthesize text

    ChatTTS-ui is a local web interface and API wrapper around the ChatTTS speech synthesis system, designed to make advanced TTS models easy to use from a browser. It runs a small backend server (Python + Torch + ffmpeg) and exposes a simple webpage where you can type text, adjust parameters, and generate audio. The project supports Chinese, English, and mixed text with digits and control symbols, making it suitable for bilingual content and numerically heavy text like announcements or prompts. From version 0.96 onward, ffmpeg installation is required for deployment, and previous CSV/PT voice tables are no longer valid, so users instead work with updated “voice value” parameters. For convenience, there is a prepackaged Windows build: you download a release archive, extract it, and double-click app.exe to start the web UI, which opens on localhost:9966.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 20
    GLM-TTS

    GLM-TTS

    Controllable & emotion-expressive zero-shot TTS

    GLM-TTS is an advanced text-to-speech synthesis system built on large language model technologies that focuses on producing high-quality, expressive, and controllable spoken output, including features like emotion modulation and zero-shot voice cloning. It uses a two-stage architecture where a generative LLM first converts text into intermediate speech token sequences and then a Flow-based neural model converts those tokens into natural audio waveforms, enabling rich prosody and voice character even for unseen speakers. The system introduces a multi-reward reinforcement learning framework that jointly optimizes for voice similarity, emotional expressiveness, pronunciation, and intelligibility, yielding output that can rival commercial options in naturalness and expressiveness. GLM-TTS also supports phoneme-level control and hybrid text + phoneme input, giving developers precise control over pronunciation critical for multilingual or polyphone­-rich languages.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 21
    HiFi-GAN

    HiFi-GAN

    Generative Adversarial Networks for Efficient and High Fidelity Speech

    HiFi-GAN is a GAN-based neural vocoder designed to generate high-fidelity speech waveforms from mel spectrograms with exceptional efficiency. It introduces a generator architecture tailored to model the periodic structure of speech and a set of discriminators that focus on different scales and periods of the waveform to better capture naturalness. The model targets a sweet spot between sample quality and generation speed, outperforming many previous GAN vocoders while being far faster than typical autoregressive models. In experiments on LJSpeech, HiFi-GAN was shown to achieve mean opinion scores close to human recordings while synthesizing 22.05 kHz audio up to ~168× faster than real time on an NVIDIA V100 GPU. A smaller configuration trades a bit of quality for even higher speed and can run more than 13× faster than real time on CPU, making it suitable for deployment scenarios without powerful GPUs.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 22
    IMS Toucan

    IMS Toucan

    Controllable and fast Text-to-Speech for over 7000 languages

    IMS-Toucan is a toolkit for training, using, and teaching state-of-the-art text-to-speech systems, built at the Institute for Natural Language Processing (IMS), University of Stuttgart. It is the official home of ToucanTTS, a massively multilingual TTS system designed to support over 7,000 languages with a single unified framework. The toolkit focuses on being fast and controllable while not requiring huge amounts of compute, making it practical for research labs and smaller teams. It includes complete pipelines for preprocessing datasets, training models, and running inference, plus a storage configuration system to manage where models and caches are stored. IMS-Toucan ships with several ready-to-run scripts, including GUIs for interactive demos, prosody override tools, zero-shot language embedding injection, and text-to-audio file generation. Pretrained models are automatically downloaded when needed, and there is an online demo instance hosted on GPU that anyone can try.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 23
    MARS5

    MARS5

    MARS5 speech model (TTS) from CAMB.AI

    MARS5-TTS is CAMB.AI’s open-source English speech model designed for high-quality text-to-speech and voice emulation. It uses a two-stage architecture that combines an autoregressive (AR) model with a non-autoregressive (NAR) model, giving it both expressiveness and speed. The model is built to handle prosodically challenging content such as sports commentary, anime dialogue, and other high-energy or highly varied speech patterns with realistic rhythm and intonation. To control speaker identity, MARS5 uses a short reference audio clip, typically between 2 and 12 seconds, from which it learns the voice characteristics. It supports two main inference modes: shallow clone, which is faster and only needs the reference audio, and deep clone, which additionally uses the transcript of the reference audio to increase similarity and naturalness at the cost of more computation.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 24
    OuteTTS

    OuteTTS

    Interface for OuteTTS models

    OuteTTS is an interface library for running OuteTTS text-to-speech models across a range of backends, making it easier to deploy the same model on different hardware and runtimes. It provides a high-level Interface API that wraps model configuration, speaker handling, and audio generation so you can focus on integrating speech into your application rather than wiring up low-level engines. The project supports multiple backends including llama.cpp (Python bindings and server), Hugging Face Transformers, ExLlamaV2, VLLM and a JavaScript interface via Transformers.js, allowing it to run on CPUs, NVIDIA CUDA GPUs, AMD ROCm, Vulkan-capable GPUs, and Apple Metal. It also includes a notion of speaker profiles: you can create a speaker from a short audio sample, save it as JSON, and reuse it for consistent voice identity across generations and sessions. For best quality, the model is designed to work with a reference speaker clip and will inherit emotion, style, and accent from that reference.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 25
    WavTokenizer

    WavTokenizer

    SOTA discrete acoustic codec models with 40/75 tokens per second

    WavTokenizer is a state-of-the-art discrete acoustic codec designed specifically for audio language modeling, capable of compressing 24 kHz audio into just 40 or 75 tokens per second while preserving high perceptual quality. It is built to represent speech, music, and general audio with extremely low bitrate, making it ideal as a front-end for large audio language models like GPT-4o and similar architectures. The model uses a single-quantizer design together with temporal compression to achieve extreme compression without sacrificing reconstruction fidelity. Its architecture incorporates a broader vector-quantization space, extended contextual windows, and improved attention networks, combined with multi-scale discriminators and inverse Fourier transform blocks to enhance waveform reconstruction. Extensive experiments show that WavTokenizer matches or surpasses previous neural codecs across speech, music, and general audio on both objective metrics and subjective listening tests.
    Downloads: 1 This Week
    Last Update:
    See Project