๐๏ธ Local Speech-to-Text with NVIDIA Parakeet ASR (TDT 0.6B)
A fully local, GPU-accelerated speech-to-text system using NVIDIA Parakeet-TDT 0.6B, with punctuation, timestamps, and real-world audio demos.

Ever spent hours cleaning up a transcript? Inserting commas, capitalizing words, adjusting timestamps, and fixing numbers spoken as โtwenty-two thousand three hundred tenโ rather than โ22,310โ? I was tired of cloud-based speech recognition tools that compromised privacy and desktop solutions that delivered flat, unpunctuated text without timestamps.
So I tried Parakeet-TDT.
TL;DR
Most speech-to-text tools miss key elements like punctuation, timestamps, or rely on cloud APIs. This blog showcases a fully local transcription system using NVIDIAโs Parakeet-TDT 0.6B model.
โ
Auto punctuation & capitalization
โ
Word/segment-level timestamps
โ
Long audio support
โ
Tested on financial news, lyrics, and tech conversations
โ
Built using Streamlit + NeMo โ runs 100% offline
๐ฏ The Problem: ASR That Misses the Metadata
Most ASR tools do a decent job with basic transcripts. But they fall short when real-world applications demand:
๐ Business number accuracy
๐งพ Structured formatting
๐ Local processing with privacy
๐ฌ Subtitle alignment
Whether youโre handling earnings calls, voice notes, or executive interviews, flat transcripts wonโt cut it.
๐ก The Solution: NVIDIA Parakeet-TDT 0.6B
๐ฅ Live Demo
Watch Parakeet transcribe business audio, lyrics, and interviews โ entirely offline:
A full walkthrough of the local ASR system built with Parakeet-TDT 0.6B. Includes architecture overview and transcription demos for financial news, song lyrics, and a tech dialogue.
Note: The lyrics demo segment (Wavinโ Flag) has been muted to comply with copyright restrictions on YouTube.

โ๏ธ Key Features
Auto punctuation & casing
Word and segment-level timestamps
Handles long audio (up to 24 mins per chunk)
CUDA-accelerated
Free for commercial use (CC-BY-4.0)
Fast: RTFx 3380 (~56 min of audio/sec at batch size 128)
๐ง Under the Hood: Architecture & Training
๐ Architecture
FastConformer encoder + TDT decoder
600M parameters
Trained on over 120K hours
๐งช Training Overview
Pretrained with wav2vec on LibriLight
Fine-tuned on 500 hours of clean speech
Final training on YouTube-like public datasets
Trained using NVIDIA NeMo on 64ร A100 GPUs
๐ป Setup: Run It Locally (Windows)
The code, requirements, and sample audio files are available on GitHub:
๐ GitHub โ SridharSampath/parakeet-asr-demo
1. Create Conda Environment
create -n parakeet-asr python=3.10 -y
conda activate parakeet-asr
2. Install Dependencies
pip install -r requirements.txt
Includes NeMo, PyTorch, Streamlit, and audio libraries.
3. Install FFmpeg
choco install ffmpeg
๐ง Code Walkthrough
๐ Load the Model
model = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
model = model.to(torch.bfloat16)
๐ง Audio Preprocessing
audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export("processed.wav", format="wav")
๐ Transcription
output = model.transcribe([processed_path], timestamps=True)
for seg in output[0].timestamp["segment"]:
print(f"{seg['start']}s - {seg['end']}s: {seg['segment']}")
Streamlit handles exporting to .csv, .srt, and .txt.
๐ฅ๏ธ Application Interface โ Local ASR in Action
System runs fully offline, loads the 600M model in seconds, and transcribes a 2:37 clip in under 2 seconds on CUDA.

๐งช Real-World Transcription Tests
1. Stock Market News (2:30 mins)
๐ง File: Stockmarketnews.wav
Simulates a financial update with spoken numbers, companies, and currencies.
Transcription wins:
Phrases like โThe Nifty 50 closed at 22,310 pointsโ
Correct formatting for โโน3,487โ and percentage figures
Accurate punctuation and clarity

2. Song Lyrics โ Wavinโ Flag (3:40 mins)
๐ง File: Wavin-Flag-song.wav
Focuses on lyric structure and repetition.
Transcription wins:
Captures phrasing: โWhen I get older, I will be strongerโฆโ
Punctuation preserves rhythm
Line breaks and structure detected

3. Tech Dialogue โ Satya x Jensen (5:00 mins)
๐ง File: JensenHuang-SatyaNadella-Conference-talk.wav
First 5 minutes of a Build Conference chat on AI.
Transcription wins:
Captures phrases like โtokens per dollar per wattโ
Maintains sentence integrity and structure
Handles longer, multi-speaker content

๐งพ Sample Audio Files
JensenHuang-SatyaNadella-Conference-talk.wavStockmarketnews.wavWavin-Flag-song.wav
Available in the GitHub repo
๐ Parakeet vs Whisper (Medium)
| Feature | Parakeet-TDT 0.6B | Whisper Medium |
| Params | 600M | 769M |
| WER (test-clean) | 2.5% | 3.6% |
| WER (test-other) | 6.2% | 7.8% |
| RTFx (batch) | 3386 | ~300 |
| Word-level timestamps | Yes | No |
| Commercial license | CC-BY-4.0 | MIT |
| Financial number accuracy | Excellent | Good |
๐ Benchmark Leadership
Parakeet ranks #1 on Hugging Face Open ASR Leaderboard (as of May 2025):
WER: 6.05% (best open model)
RTFx: 3386
License: CC-BY-4.0

โ ๏ธ Limitations
English-only
Requires GPU (CUDA) for optimal performance
No built-in speaker diarization
๐ง Final Thoughts
Parakeet-TDT 0.6B offers a strong open-source alternative to Whisper for English transcription โ especially when speed, timestamps, and offline processing are critical.
Perfect for:
Executive interviews
Financial transcription
Subtitles & media apps
Research projects
โ๏ธ Test Environment
GPU: NVIDIA RTX 3050 Laptop GPU
CUDA: 11.8
OS: Windows 11
Frameworks: NeMo + PyTorch
๐ Resources
๐ Letโs Connect
If you're exploring ASR, real-time transcription, or multimodal RAG โ I'd love to connect:



