[go: up one dir, main page]

Search Results for "transcribe audio to text"

Showing 746 open source projects for "transcribe audio to text"

View related business solutions
  • Gen AI apps are built with MongoDB Atlas Icon
    Gen AI apps are built with MongoDB Atlas

    The database for AI-powered applications.

    MongoDB Atlas is the developer-friendly database used to build, scale, and run gen AI and LLM-powered apps—without needing a separate vector database. Atlas offers built-in vector search, global availability across 115+ regions, and flexible document modeling. Start building AI apps faster, all in one place.
    Start Free
  • JS7 JobScheduler is an open source workload automation solution. Icon
    JS7 JobScheduler is an open source workload automation solution.

    JS7 offers cross-platform job execution, managed file transfer, complex no-code job dependencies and a real REST API.

    JS7 JobScheduler is an open source workload automation solution. It is used to run executable files, shell scripts etc. and database procedures.
    Learn More
  • 1
    Transcripciones con Whisper Esta aplicación de escritorio basada en web permite transcribir (o transcribir y traducir al ingles), archivos de audio o video utilizando el modelo Whisper de OpenAI. Transcriptions with Whisper This web-based desktop application allows you to transcribe—or both transcribe and translate into English—audio or video files using OpenAI's Whisper model.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 2
    Whishper

    Whishper

    Transcribe any audio to text, translate and edit subtitles 100% locall

    Open-source, local-first audio transcription and subtitling suite with a simple web UI. Thanks to open-source technologies, Whishper can run 100% offline. Your data never leaves your computer. Whishper allows you to translate your transcriptions to and from more than 60 languages thanks to Argos Translate and LibreTranslate. Download the transcriptions in many formats (json, txt, vtt, srt). Easily edit your subtitles right in the Web-UI.
    Downloads: 19 This Week
    Last Update:
    See Project
  • 3
    MLX-Audio

    MLX-Audio

    A text-to-speech, speech-to-text and speech-to-speech library

    MLX-Audio is a speech library built on Apple’s MLX framework and optimized for Apple Silicon machines (M-series Macs). It focuses on text-to-speech and speech-to-speech workflows, with APIs and a command-line interface that make it easy to generate high-quality audio from text. Because it uses MLX and targets Apple Silicon, inference is fast and can take advantage of hardware acceleration and quantization for efficient on-device performance.
    Downloads: 3 This Week
    Last Update:
    See Project
  • 4
    Qwen-Audio

    Qwen-Audio

    Chat & pretrained large audio language model proposed by Alibaba Cloud

    Qwen-Audio is a large audio-language model developed by Alibaba Cloud, built to accept various types of audio input (speech, natural sounds, music, singing) along with text input, and output text. There is also an instruction-tuned version called Qwen-Audio-Chat which supports conversational interaction (multi-round), audio + text input, creative tasks and reasoning over audio.
    Downloads: 1 This Week
    Last Update:
    See Project
  • Create engaging surveys on your tablet or computer with ease. Icon
    Create engaging surveys on your tablet or computer with ease.

    Choose any of our carefully designed themes, or easily customize colors, fonts, and more to reflect your brand's true look and feel.

    Create great-looking surveys, forms, polls, voting, questionnaires, NPS, customer satisfaction, customer experience, employee satisfaction surveys... on your computer or tablet, customize the look of your survey however you like, & display collected data with eye-catching and insightful graphics.
    Learn More
  • 5
    Bootleg Text Slicer

    Bootleg Text Slicer

    Text transcription & slicing tool with visual timeline and WAV output.

    ... - Record the timeline position, along with the global and per‑word timing offsets for each exported word, into a cutTemplate.txt file so that the individual words can later be played using only the source audio file. GitHub repository: https://github.com/Northstrix/bootleg-text-slicer Successfully tested with English and Italian audio files. Both scripts work, but I wouldn’t advise you to use Bootleg Text Slicer V2.py to transcribe more than 60–90 seconds at a time. Otherwise, its UI might become laggy. You can easily adjust the transcription duration by moving the start and end sliders below the timeline.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 6
    SpeechRecognition

    SpeechRecognition

    Speech recognition module for Python

    Library for performing speech recognition, with support for several engines and APIs, online and offline. Recognize speech input from the microphone, transcribe an audio file, save audio data to an audio file. Show extended recognition results, calibrate the recognizer energy threshold for ambient noise levels (see recognizer_instance.energy_threshold for details). Listening to a microphone in the background, various other useful recognizer features. The easiest way to install this is using pip install SpeechRecognition. ...
    Downloads: 20 This Week
    Last Update:
    See Project
  • 7
    Concordia

    Concordia

    Crowdsourcing platform for full text transcription and tagging

    Concordia is a platform for crowdsourcing transcription and tagging of text in digitized images. It was developed by the Library of Congress so that volunteers of all backgrounds could transcribe and tag digitized images of manuscripts and typed materials from the Library’s collections that could not otherwise be done by optical character recognition.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    Qwen2-Audio

    Qwen2-Audio

    Repo of Qwen2-Audio chat & pretrained large audio language model

    Qwen2-Audio is a large audio-language model by Alibaba Cloud, part of the Qwen series. It is trained to accept various audio signal inputs (including speech, sounds, etc.) and perform both voice chat and audio analysis, producing textual responses. It supports two major modes: Voice Chat (interactive voice only input) and Audio Analysis (audio + text instructions), with both base and instruction-tuned models.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 9
    Handy STT

    Handy STT

    A free, open source, and extensible speech-to-text application

    Handy is a free, open-source, offline speech-to-text application built for privacy, accessibility, and extensibility. Developed using Tauri (Rust + React/TypeScript), it runs natively across Windows, macOS, and Linux while performing local speech recognition without sending any audio to cloud servers. Handy allows users to start transcription instantly using a configurable keyboard shortcut—press to record, release to transcribe—and automatically pastes the resulting text into any active text field. ...
    Downloads: 48 This Week
    Last Update:
    See Project
  • Empower Your Workforce and Digitize Your Shop Floor Icon
    Empower Your Workforce and Digitize Your Shop Floor

    Benefits to Manufacturers

    Easily connect to most tools and equipment on the shop floor, enabling efficient data collection and boosting productivity with vital insights. Turn information into action to generate new ideas and better processes.
    Learn More
  • 10
    Kimi-Audio

    Kimi-Audio

    Audio foundation model excelling in audio understanding

    Kimi-Audio is an ambitious open-source audio foundation model designed to unify a wide array of audio processing tasks — from speech recognition and audio understanding to generative conversation and sound event classification — within a single cohesive architecture. Instead of fragmenting work across specialized models, Kimi-Audio handles automatic speech recognition (ASR), audio question answering, automatic audio captioning, speech emotion recognition, and audio-to-text chat in one system, enabling developers to build rich, multimodal audio applications without stitching together disparate components. ...
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    Step-Audio

    Step-Audio

    Open-source framework for intelligent speech interaction

    Step-Audio is a unified, open-source framework aimed at building intelligent speech systems that combine both comprehension and generation: it integrates large language models (LLMs) with speech input/output to handle not only semantic understanding but also rich vocal characteristics like tone, style, dialect, emotion, and prosody. The design moves beyond traditional separate-component pipelines (ASR → text model → TTS), instead offering a multimodal model that ingests speech or audio and produces speech accordingly, enabling natural dialogue, voice cloning, and expressive speech synthesis. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    Buzz

    Buzz

    Transcribe and translate audio offline on your personal computer

    Buzz transcribes and translates audio to text offline using OpenAI's Whisper. Import audio and video files into Buzz and export them as TXT, SRT, or VTT files. Buzz supports Whisper, Whisper.cpp, Faster Whisper, Whisper-compatible models from the Hugging Face repository, and the OpenAI Whisper API. Get linux versions from: - https://flathub.org/apps/io.github.chidiwilliams.Buzz - https://snapcraft.io/buzz Home page of Buzz https://github.com/chidiwilliams/buzz Note for Windows: App is not signed, you will get a warning when you install it. ...
    Leader badge">
    Downloads: 3,922 This Week
    Last Update:
    See Project
  • 13
    Step-Audio-EditX

    Step-Audio-EditX

    LLM-based Reinforcement Learning audio edit model

    Step-Audio-EditX is an open-source, 3 billion-parameter audio model from StepFun AI designed to make expressive and precise editing of speech and audio as easy as text editing. Rather than treating audio editing as low-level waveform manipulation, this model converts speech into a sequence of discrete “audio tokens” (via a dual-codebook tokenizer) — combining a linguistic token stream and a semantic (prosody/emotion/style) token stream — thereby abstracting audio editing into high-level token operations. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 14
    Text Generation Web UI

    Text Generation Web UI

    Oobabooga - The definitive Web UI for local AI, with powerful features

    ...Instruct mode compatible with Alpaca and Open Assistant formats. Nice HTML output for GPT-4chan. Markdown output for GALACTICA, including LaTeX rendering. Custom chat characters. Advanced chat features (send images, get audio responses with TTS). Very efficient text streaming. Parameter presets, 8-bit mode. Layers splitting across GPU(s), CPU, and disk. CPU mode, FlexGen, DeepSpeed ZeRO-3, API with streaming and without streaming. LLaMA model, including 4-bit GPTQ. RWKV model, LoRA (loading and training), Softprompts, and extensions.
    Downloads: 45 This Week
    Last Update:
    See Project
  • 15
    Step-Audio 2

    Step-Audio 2

    Multi-modal large language model designed for audio understanding

    ...It integrates a latent-space audio encoder, discrete acoustic tokens, and reinforcement-learning–based training (CoT + RL) to enhance its ability to capture and reproduce voice styles, intonations, and subtle vocal cues. Moreover, Step-Audio2 supports tool-calling and retrieval-augmented generation (RAG), allowing it to access external knowledge sources or audio/text databases, thus reducing hallucinations and improving coherence in complex dialogues.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 16
    OpenAI.fm

    OpenAI.fm

    Code for openai.fm, a demo for the OpenAI Speech API

    OpenAI.fm is an official interactive demo application built to showcase the OpenAI Speech API and its advanced text-to-speech capabilities, providing developers and creators with a hands-on web interface to convert text into high-quality, customizable audio using state-of-the-art TTS models. Developed using Next.js and the OpenAI Speech API, this demo illustrates how the latest neural voice models can produce natural, expressive speech with adjustable styles and voices, highlighting features like emotional range, tone, and real-time playback. ...
    Downloads: 185 This Week
    Last Update:
    See Project
  • 17
    Google AI Edge Gallery

    Google AI Edge Gallery

    A gallery that showcases on-device ML/GenAI use cases

    Gallery is a curated collection of on-device machine learning examples, demo apps, and model artifacts designed to help developers experiment with and deploy ML at the edge. The project bundles runnable samples that show how to run TensorFlow Lite/Edge TPU models (and similar lightweight runtimes) on mobile and embedded platforms, demonstrating common tasks like image classification, object detection, audio recognition, and pose estimation. Each sample is intended to be both a learning aid...
    Downloads: 56 This Week
    Last Update:
    See Project
  • 18
    HeartMuLa

    HeartMuLa

    A Family of Open Sourced Music Foundation Models

    ...The project also includes HeartCodec, a music codec optimized for high reconstruction fidelity, enabling efficient tokenization and reconstruction workflows that are critical for training and generation pipelines. For text extraction from audio, it provides HeartTranscriptor, a Whisper-based model tuned specifically for lyrics transcription, which helps bridge generated or recorded audio back into structured text. It also introduces HeartCLAP, which aligns audio and text into a shared embedding space.
    Downloads: 23 This Week
    Last Update:
    See Project
  • 19
    sherpa-onnx

    sherpa-onnx

    Speech-to-text, text-to-speech, and speaker recognition

    Speech-to-text, text-to-speech, and speaker recognition using next-gen Kaldi with onnxruntime without an Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter.
    Downloads: 216 This Week
    Last Update:
    See Project
  • 20
    abogen

    abogen

    Generate audiobooks from EPUBs, PDFs and text with captions

    abogen is a tool designed to generate audiobooks (or speech narrations) from textual sources such as EPUBs, PDFs, or plain text, with synchronized captions. In other words, it automates the pipeline of reading a digital book (or document), converting its text into speech via a TTS engine, and packaging the result into an audiobook format — likely along with timestamped captions or subtitles that align with the spoken audio. This can be very useful for accessibility, content consumption on the go, or for users who prefer audio over reading. ...
    Downloads: 6 This Week
    Last Update:
    See Project
  • 21
    Frescobaldi

    Frescobaldi

    LilyPond sheet music text editor

    Frescobaldi is a free and open source LilyPond sheet music text editor. Designed to be powerful yet lightweight and easy-to-use, Frescobaldi offers great functionality and a host of useful features such as music view with advanced two-way Point & Click, Midi capturing to enter music, a Snippet Manager and many more. Frescobaldi is named after Girolamo Frescobaldi (1583-1643), an Italian composer of keyboard music in the late Renaissance and early Baroque period.
    Downloads: 36 This Week
    Last Update:
    See Project
  • 22
    Vibe

    Vibe

    Transcribe on your own

    Vibe is an open-source project by thewh1teagle designed to deliver a collaborative and interactive social application experience, though its specifics depend on its evolving community scope; its development often focuses on connecting users through dynamic features that can include chat, shared spaces, and immersive interactions. The repository typically includes backend logic, frontend integration, and real-time communication stacks to support live user engagement, performance...
    Downloads: 19 This Week
    Last Update:
    See Project
  • 23
    Qwen3-Omni

    Qwen3-Omni

    Qwen3-omni is a natively end-to-end, omni-modal LLM

    Qwen3-Omni is a natively end-to-end multilingual omni-modal foundation model that processes text, images, audio, and video and delivers real-time streaming responses in text and natural speech. It uses a Thinker-Talker architecture with a Mixture-of-Experts (MoE) design, early text-first pretraining, and mixed multimodal training to support strong performance across all modalities without sacrificing text or image quality. The model supports 119 text languages, 19 speech input languages, and 10 speech output languages. ...
    Downloads: 3 This Week
    Last Update:
    See Project
  • 24
    Voice-Pro

    Voice-Pro

    Comprehensive Gradio WebUI for audio processing

    Voice-Pro is the best gradio WebUI for transcription, translation and text-to-speech. It can be easily installed with one click. Create a virtual environment using Miniconda, running completely separate from the Windows system (fully portable). Supports real-time transcription and translation, as well as batch mode.
    Downloads: 38 This Week
    Last Update:
    See Project
  • 25
    AudioCraft

    AudioCraft

    Audiocraft is a library for audio processing and generation

    AudioCraft is a PyTorch library for text-to-audio and text-to-music generation, packaging research models and tooling for training and inference. It includes MusicGen for music generation conditioned on text (and optionally melody) and AudioGen for text-conditioned sound effects and environmental audio. Both models operate over discrete audio tokens produced by a neural codec (EnCodec), which acts like a tokenizer for waveforms and enables efficient sequence modeling. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Previous
  • You're on page 1
  • 2
  • 3
  • 4
  • 5
  • Next