Skip to main content

Command Palette

Search for a command to run...

๐ŸŽ™๏ธ Local Speech-to-Text with NVIDIA Parakeet ASR (TDT 0.6B)

A fully local, GPU-accelerated speech-to-text system using NVIDIA Parakeet-TDT 0.6B, with punctuation, timestamps, and real-world audio demos.

Published
โ€ข5 min read
๐ŸŽ™๏ธ Local Speech-to-Text with NVIDIA Parakeet ASR (TDT 0.6B)

Ever spent hours cleaning up a transcript? Inserting commas, capitalizing words, adjusting timestamps, and fixing numbers spoken as โ€œtwenty-two thousand three hundred tenโ€ rather than โ€œ22,310โ€? I was tired of cloud-based speech recognition tools that compromised privacy and desktop solutions that delivered flat, unpunctuated text without timestamps.

So I tried Parakeet-TDT.

TL;DR

Most speech-to-text tools miss key elements like punctuation, timestamps, or rely on cloud APIs. This blog showcases a fully local transcription system using NVIDIAโ€™s Parakeet-TDT 0.6B model.

โœ… Auto punctuation & capitalization
โœ… Word/segment-level timestamps
โœ… Long audio support
โœ… Tested on financial news, lyrics, and tech conversations
โœ… Built using Streamlit + NeMo โ€” runs 100% offline

๐ŸŽฏ The Problem: ASR That Misses the Metadata

Most ASR tools do a decent job with basic transcripts. But they fall short when real-world applications demand:

๐Ÿ“ˆ Business number accuracy
๐Ÿงพ Structured formatting
๐Ÿ” Local processing with privacy
๐ŸŽฌ Subtitle alignment

Whether youโ€™re handling earnings calls, voice notes, or executive interviews, flat transcripts wonโ€™t cut it.

๐Ÿ’ก The Solution: NVIDIA Parakeet-TDT 0.6B

๐ŸŽฅ Live Demo
Watch Parakeet transcribe business audio, lyrics, and interviews โ€” entirely offline:

A full walkthrough of the local ASR system built with Parakeet-TDT 0.6B. Includes architecture overview and transcription demos for financial news, song lyrics, and a tech dialogue.

Note: The lyrics demo segment (Wavinโ€™ Flag) has been muted to comply with copyright restrictions on YouTube.


โš™๏ธ Key Features

  • Auto punctuation & casing

  • Word and segment-level timestamps

  • Handles long audio (up to 24 mins per chunk)

  • CUDA-accelerated

  • Free for commercial use (CC-BY-4.0)

  • Fast: RTFx 3380 (~56 min of audio/sec at batch size 128)

๐Ÿง  Under the Hood: Architecture & Training

๐Ÿ“ Architecture

FastConformer encoder + TDT decoder
600M parameters
Trained on over 120K hours

๐Ÿงช Training Overview

  • Pretrained with wav2vec on LibriLight

  • Fine-tuned on 500 hours of clean speech

  • Final training on YouTube-like public datasets

  • Trained using NVIDIA NeMo on 64ร— A100 GPUs


๐Ÿ’ป Setup: Run It Locally (Windows)

The code, requirements, and sample audio files are available on GitHub:
๐Ÿ”— GitHub โ€” SridharSampath/parakeet-asr-demo

1. Create Conda Environment

create -n parakeet-asr python=3.10 -y
conda activate parakeet-asr

2. Install Dependencies

pip install -r requirements.txt

Includes NeMo, PyTorch, Streamlit, and audio libraries.

3. Install FFmpeg

choco install ffmpeg

๐Ÿง  Code Walkthrough

๐Ÿ”Œ Load the Model

model = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model = model.to(torch.bfloat16)

๐ŸŽง Audio Preprocessing

audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export("processed.wav", format="wav")

๐Ÿ“ Transcription

output = model.transcribe([processed_path], timestamps=True)
for seg in output[0].timestamp["segment"]:
    print(f"{seg['start']}s - {seg['end']}s: {seg['segment']}")

Streamlit handles exporting to .csv, .srt, and .txt.


๐Ÿ–ฅ๏ธ Application Interface โ€” Local ASR in Action

System runs fully offline, loads the 600M model in seconds, and transcribes a 2:37 clip in under 2 seconds on CUDA.

๐Ÿงช Real-World Transcription Tests

1. Stock Market News (2:30 mins)

๐ŸŽง File: Stockmarketnews.wav
Simulates a financial update with spoken numbers, companies, and currencies.

Transcription wins:

  • Phrases like โ€œThe Nifty 50 closed at 22,310 pointsโ€

  • Correct formatting for โ€œโ‚น3,487โ€ and percentage figures

  • Accurate punctuation and clarity


2. Song Lyrics โ€” Wavinโ€™ Flag (3:40 mins)

๐ŸŽง File: Wavin-Flag-song.wav
Focuses on lyric structure and repetition.

Transcription wins:

  • Captures phrasing: โ€œWhen I get older, I will be strongerโ€ฆโ€

  • Punctuation preserves rhythm

  • Line breaks and structure detected


3. Tech Dialogue โ€” Satya x Jensen (5:00 mins)

๐ŸŽง File: JensenHuang-SatyaNadella-Conference-talk.wav
First 5 minutes of a Build Conference chat on AI.

Transcription wins:

  • Captures phrases like โ€œtokens per dollar per wattโ€

  • Maintains sentence integrity and structure

  • Handles longer, multi-speaker content

๐Ÿงพ Sample Audio Files

  • JensenHuang-SatyaNadella-Conference-talk.wav

  • Stockmarketnews.wav

  • Wavin-Flag-song.wav

Available in the GitHub repo


๐Ÿ“Š Parakeet vs Whisper (Medium)

FeatureParakeet-TDT 0.6BWhisper Medium
Params600M769M
WER (test-clean)2.5%3.6%
WER (test-other)6.2%7.8%
RTFx (batch)3386~300
Word-level timestampsYesNo
Commercial licenseCC-BY-4.0MIT
Financial number accuracyExcellentGood

๐Ÿ† Benchmark Leadership

Parakeet ranks #1 on Hugging Face Open ASR Leaderboard (as of May 2025):

  • WER: 6.05% (best open model)

  • RTFx: 3386

  • License: CC-BY-4.0

โš ๏ธ Limitations

  • English-only

  • Requires GPU (CUDA) for optimal performance

  • No built-in speaker diarization

๐Ÿง  Final Thoughts

Parakeet-TDT 0.6B offers a strong open-source alternative to Whisper for English transcription โ€” especially when speed, timestamps, and offline processing are critical.

Perfect for:

  • Executive interviews

  • Financial transcription

  • Subtitles & media apps

  • Research projects

โš™๏ธ Test Environment

  • GPU: NVIDIA RTX 3050 Laptop GPU

  • CUDA: 11.8

  • OS: Windows 11

  • Frameworks: NeMo + PyTorch

๐Ÿ”— Resources


๐Ÿ™Œ Letโ€™s Connect

If you're exploring ASR, real-time transcription, or multimodal RAG โ€” I'd love to connect: