Ever spent hours cleaning up a transcript? Inserting commas, capitalizing words, adjusting timestamps, and fixing numbers spoken as “twenty-two thousand three hundred ten” rather than “22,310”? I was tired of cloud-based speech recognition tools that compromised privacy and desktop solutions that delivered flat, unpunctuated text without timestamps.

So I tried Parakeet-TDT.

TL;DR

Most speech-to-text tools miss key elements like punctuation, timestamps, or rely on cloud APIs. This blog showcases a fully local transcription system using NVIDIA’s Parakeet-TDT 0.6B model.

✅ Auto punctuation & capitalization
✅ Word/segment-level timestamps
✅ Long audio support
✅ Tested on financial news, lyrics, and tech conversations
✅ Built using Streamlit + NeMo — runs 100% offline

🎯 The Problem: ASR That Misses the Metadata

Most ASR tools do a decent job with basic transcripts. But they fall short when real-world applications demand:

📈 Business number accuracy
🧾 Structured formatting
🔐 Local processing with privacy
🎬 Subtitle alignment

Whether you’re handling earnings calls, voice notes, or executive interviews, flat transcripts won’t cut it.

💡 The Solution: NVIDIA Parakeet-TDT 0.6B

🎥 Live Demo
Watch Parakeet transcribe business audio, lyrics, and interviews — entirely offline:

https://www.youtube.com/watch?v=9L8CLNsiVIg

A full walkthrough of the local ASR system built with Parakeet-TDT 0.6B. Includes architecture overview and transcription demos for financial news, song lyrics, and a tech dialogue.

Note: The lyrics demo segment (Wavin’ Flag) has been muted to comply with copyright restrictions on YouTube.

⚙️ Key Features

Auto punctuation & casing
Word and segment-level timestamps
Handles long audio (up to 24 mins per chunk)
CUDA-accelerated
Free for commercial use (CC-BY-4.0)
Fast: RTFx 3380 (~56 min of audio/sec at batch size 128)

🧠 Under the Hood: Architecture & Training

📐 Architecture

FastConformer encoder + TDT decoder
600M parameters
Trained on over 120K hours

🧪 Training Overview

Pretrained with wav2vec on LibriLight
Fine-tuned on 500 hours of clean speech
Final training on YouTube-like public datasets
Trained using NVIDIA NeMo on 64× A100 GPUs

💻 Setup: Run It Locally (Windows)

The code, requirements, and sample audio files are available on GitHub:
🔗 GitHub — SridharSampath/parakeet-asr-demo

1. Create Conda Environment

create -n parakeet-asr python=3.10 -y
conda activate parakeet-asr

2. Install Dependencies

pip install -r requirements.txt

Includes NeMo, PyTorch, Streamlit, and audio libraries.

3. Install FFmpeg

choco install ffmpeg

🧠 Code Walkthrough

🔌 Load the Model

model = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
if torch.cuda.is_available():
    model = model.to(torch.bfloat16)

🎧 Audio Preprocessing

audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(16000).set_channels(1)
audio.export("processed.wav", format="wav")

📝 Transcription

output = model.transcribe([processed_path], timestamps=True)
for seg in output[0].timestamp["segment"]:
    print(f"{seg['start']}s - {seg['end']}s: {seg['segment']}")

Streamlit handles exporting to .csv, .srt, and .txt.

🖥️ Application Interface — Local ASR in Action

System runs fully offline, loads the 600M model in seconds, and transcribes a 2:37 clip in under 2 seconds on CUDA.

🧪 Real-World Transcription Tests

1. Stock Market News (2:30 mins)

🎧 File: Stockmarketnews.wav
Simulates a financial update with spoken numbers, companies, and currencies.

Transcription wins:

Phrases like “The Nifty 50 closed at 22,310 points”
Correct formatting for “₹3,487” and percentage figures
Accurate punctuation and clarity

2. Song Lyrics — Wavin’ Flag (3:40 mins)

🎧 File: Wavin-Flag-song.wav
Focuses on lyric structure and repetition.

Transcription wins:

Captures phrasing: “When I get older, I will be stronger…”
Punctuation preserves rhythm
Line breaks and structure detected

3. Tech Dialogue — Satya x Jensen (5:00 mins)

🎧 File: JensenHuang-SatyaNadella-Conference-talk.wav
First 5 minutes of a Build Conference chat on AI.

Transcription wins:

Captures phrases like “tokens per dollar per watt”
Maintains sentence integrity and structure
Handles longer, multi-speaker content

🧾 Sample Audio Files

JensenHuang-SatyaNadella-Conference-talk.wav
Stockmarketnews.wav
Wavin-Flag-song.wav

Available in the GitHub repo

📊 Parakeet vs Whisper (Medium)

Feature	Parakeet-TDT 0.6B	Whisper Medium
Params	600M	769M
WER (test-clean)	2.5%	3.6%
WER (test-other)	6.2%	7.8%
RTFx (batch)	3386	~300
Word-level timestamps	Yes	No
Commercial license	CC-BY-4.0	MIT
Financial number accuracy	Excellent	Good

🏆 Benchmark Leadership

Parakeet ranks #1 on Hugging Face Open ASR Leaderboard (as of May 2025):

WER: 6.05% (best open model)
RTFx: 3386
License: CC-BY-4.0

⚠️ Limitations

English-only
Requires GPU (CUDA) for optimal performance
No built-in speaker diarization

🧠 Final Thoughts

Parakeet-TDT 0.6B offers a strong open-source alternative to Whisper for English transcription — especially when speed, timestamps, and offline processing are critical.

Perfect for:

Executive interviews
Financial transcription
Subtitles & media apps
Research projects

⚙️ Test Environment

GPU: NVIDIA RTX 3050 Laptop GPU
CUDA: 11.8
OS: Windows 11
Frameworks: NeMo + PyTorch

🔗 Resources

🙌 Let’s Connect

If you're exploring ASR, real-time transcription, or multimodal RAG — I'd love to connect:

🎙️ Local Speech-to-Text with NVIDIA Parakeet ASR (TDT 0.6B)

TL;DR

🎯 The Problem: ASR That Misses the Metadata

💡 The Solution: NVIDIA Parakeet-TDT 0.6B

⚙️ Key Features

🧠 Under the Hood: Architecture & Training

📐 Architecture

🧪 Training Overview

💻 Setup: Run It Locally (Windows)

1. Create Conda Environment

2. Install Dependencies

3. Install FFmpeg

🧠 Code Walkthrough

🔌 Load the Model

🎧 Audio Preprocessing

📝 Transcription

🖥️ Application Interface — Local ASR in Action

🧪 Real-World Transcription Tests

1. Stock Market News (2:30 mins)

2. Song Lyrics — Wavin’ Flag (3:40 mins)

3. Tech Dialogue — Satya x Jensen (5:00 mins)

🧾 Sample Audio Files

📊 Parakeet vs Whisper (Medium)

🏆 Benchmark Leadership

⚠️ Limitations

🧠 Final Thoughts

⚙️ Test Environment

🔗 Resources

🙌 Let’s Connect

Comments

More from this blog

🚀 Beyond Text: Building Multimodal RAG Systems with Cohere and Gemini

🚀 Dynamic Multi-Function Calling Locally with Gemma 3 and Ollama

🚀 Exploring GraphRAG: Smarter AI Knowledge Retrieval with Neo4j & LLMs

How to Build Multi-Agent Collaboration on AWS Bedrock: A Financial Assistant Tutorial

Command Palette

TL;DR

🎯 The Problem: ASR That Misses the Metadata

💡 The Solution: NVIDIA Parakeet-TDT 0.6B

⚙️ Key Features

🧠 Under the Hood: Architecture & Training

📐 Architecture

🧪 Training Overview

💻 Setup: Run It Locally (Windows)

1. Create Conda Environment

2. Install Dependencies

3. Install FFmpeg

🧠 Code Walkthrough

🔌 Load the Model

🎧 Audio Preprocessing

📝 Transcription

🖥️ Application Interface — Local ASR in Action

🧪 Real-World Transcription Tests

1. Stock Market News (2:30 mins)

2. Song Lyrics — Wavin’ Flag (3:40 mins)

3. Tech Dialogue — Satya x Jensen (5:00 mins)

🧾 Sample Audio Files

📊 Parakeet vs Whisper (Medium)

🏆 Benchmark Leadership

⚠️ Limitations

🧠 Final Thoughts

⚙️ Test Environment

🔗 Resources

🙌 Let’s Connect

Comments

More from this blog