Top Repositories
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
ASR client for Triton ASR Service
Podcast Summarizer with LLM Technology
OpenAI-Compatible Frontend for Nvidia Triton Inference ASR/TTS Server
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
Repositories
58Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
Fun-ASR is an end-to-end speech recognition large model launched by Tongyi Lab.
ASR client for Triton ASR Service
Training library for Megatron-based models with bidirectional Hugging Face conversion capability
Scalable toolkit for efficient model reinforcement
No description provided.
FireRedASR2S is a state-of-the-art, industrial-grade, all-in-one ASR system with ASR, VAD, LID, and Punc modules. All modules achieve SOTA performance
We Speech Toolkit, LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction
No description provided.
OpenAI-Compatible Frontend for Nvidia Triton Inference ASR/TTS Server
verl: Volcano Engine Reinforcement Learning for LLMs
🤗 R1-AQA Model: mispeech/r1-aqa
A Datacenter Scale Distributed Inference Serving Framework
Podcast Summarizer with LLM Technology
FlashCosyVoice: A lightweight vLLM implementation built from scratch for CosyVoice.
A framework for efficient model inference with omni-modality models
Real-time speech recognition using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, x86_64 servers, websocket server/client, C/C++, Python, Kotlin
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics recognition capability.
🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch.
[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
No description provided.
No description provided.
Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.
NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
NeMo: a toolkit for conversational AI
A high-throughput and memory-efficient inference and serving engine for LLMs
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
Spark-TTS Inference Code
Pytriton ASR Server