Text-to-Speech vs AI Podcasts
A Hacker News comment last week called Podidex “glorified text-to-speech.” Fair enough. The audio comes out of a speaker. A voice says words. From the outside, that does look like text-to-speech.
But it's the difference between a printer and a journalist. A printer takes a finished document and puts it on paper. A journalist finds sources, reads them, decides what matters, writes a story, and then sends it to the printer. Text-to-speech is the printer. An AI podcast is the journalist.
What text-to-speech actually does
Text-to-speech takes existing text and reads it aloud. You give it an article, it gives you audio of that article. The words don't change. Nothing gets summarized, restructured, or explained. TTS is a rendering step.
If you've used Speechify, Apple's Spoken Content, or a Read Aloud browser extension, you've used TTS. Paste text in, get a voice reading it back. That's the whole pipeline.
The tech behind it has changed a lot. Early TTS stitched together pre-recorded sound fragments, word by word. It sounded like a robot dictating a ransom note. Modern neural TTS models generate speech from scratch. They know where to pause, what to stress, how to pace a sentence. ElevenLabs' Eleven v3 takes inline emotion tags like [excited] and [whispers]. Google's Gemini 2.5 Pro TTS handles multi-speaker dialogue natively. OpenAI's gpt-4o-mini-tts lets you steer delivery with plain English instructions like “speak like a calm news anchor.”
The voices sound good now. Way better than five years ago. But TTS still has a hard constraint: it can only read what you give it. It can't decide what's interesting. It can't skip the boring parts. It can't connect two articles together or explain a term the author assumed you knew.
What an AI podcast does differently
An AI podcast uses TTS at the end, but everything before it is different. The pipeline looks like this:
- Scrape content from source URLs, PDFs, or feeds
- Analyze the content: extract key facts, identify themes, filter noise
- Write a script: structure a narrative, add context, explain terms
- Make editorial calls: what's important, what gets cut, what needs more depth
- Synthesize audio using TTS voices
TTS is step 5 out of 5. The first four steps are where the actual work happens. They're the reason an AI podcast can take three blog posts about battery chemistry and produce a 15-minute episode that connects the ideas, explains the technical bits, and skips the marketing fluff.
In Podidex, you can paste a URL and get a single episode, or set up sources and a schedule so episodes generate automatically. Either way, the AI reads the source material, writes a script, and produces the audio. You don't write the script. You don't even pick which parts matter. The AI does.
The key differences at a glance
- Input: TTS takes finished text. AI podcasts start with URLs, documents, or just a topic.
- Content creation: TTS creates nothing. It reads what you give it, word for word.
- Editorial judgment: Does the third paragraph of that article actually matter? TTS reads it anyway. An AI podcast might skip it entirely.
- Multi-source: Paste one article into Speechify and you get one article read aloud. Give Podidex five articles and it weaves them into one episode.
- Format: TTS gives you a monologue. AI podcasts give you structured episodes with segments and narrative.
- Automation: TTS is always manual. Some AI podcasts run on a schedule without you touching anything.
The spectrum from TTS to full AI podcast
It's a spectrum. Different tools sit at different points.
Pure TTS: Speechify, Apple Spoken Content, Read Aloud extensions. Paste text, get audio. No changes to content.
AI summary + TTS: Some news apps and Google Daily Listen. AI condenses articles, then TTS reads the summary. Single voice, monologue.
AI script + multi-voice: NotebookLM, Wondercraft, Jellypod. AI generates a conversational script from uploaded documents, rendered with distinct voices. You upload content manually each time.
Full AI podcast platform: Podidex. Automated source monitoring, script generation, audio synthesis, scheduling. Episodes appear without you doing anything after initial setup.
When TTS is the right tool
Sometimes all you need is a voice reading text. You have one article, you want to hear it verbatim while commuting. Or you need accessibility. Or you're proofreading your own writing by listening to it read back. Speechify, iOS Spoken Content, and browser extensions handle this fine. Nothing wrong with that.
When you need an AI podcast instead
TTS falls short when you want more than a reading. When you follow 10 blogs and want one episode that pulls the interesting bits from all of them. When you're dealing with a dense research paper that needs someone to explain the jargon. When you want episodes showing up automatically every Monday without pasting anything.
When Mistral released Voxtral Transcribe 2 earlier this month, I had coverage from Mistral's own blog, VentureBeat, and a Hacker News thread with benchmarks. TTS would give me three separate readings with a lot of overlap. Podidex gave me one episode that covered the model specs, the pricing comparison to ElevenLabs Scribe, and the HN reaction, without repeating the same benchmark numbers three times.
I listen to about 2,000-3,000 minutes of Podidex per month. If that were TTS reading articles word-for-word, I'd need twice the time to get the same information. TTS can't cut filler. It can't connect ideas across sources. It just reads.
The voice quality question
Both TTS and AI podcasts use the same underlying voice models for the actual speech synthesis. The difference in audio quality comes down to which model you use and what you're willing to pay.
ElevenLabs' Eleven v3 has emotion tags and a multi-speaker dialogue API. Google's Gemini 2.5 Pro TTS handles podcast and audiobook generation. OpenAI's gpt-4o-mini-tts is steerable with plain English instructions. Cartesia's Sonic-3 has 40ms time-to-first-audio. All premium options, priced accordingly.
Podidex runs on Kokoro, an open-source TTS model. It's small and it punches above its weight for its size. The audio might sound slightly more mechanical than ElevenLabs, but it's a long way from the robotic TTS of five years ago. The tradeoff is intentional: running Kokoro keeps the cost low enough that you can generate thousands of minutes per month at a reasonable price. The big labs are better, but they're not 10x better, and they are 10x more expensive.
Personally, I prefer AI voices that don't try too hard to sound human. Over-polished voices feel uncanny. A voice that's clearly synthetic but well-paced and natural in its cadence sits better with me over long listening sessions.
Where this is going
I care about personalized content. Not the “personalized” that ad platforms sell, where an algorithm picks what keeps you scrolling. I mean actually personal. My news, my topics, my pace. Personalization I control, for my benefit, not to keep me glued to a feed.
That's what I'm building toward. Not a better TTS engine. A way to discover, learn, and think more. And then close the app.
You have tabs to close
Paste a URL, pick a style, listen while you do literally anything else.
Create your first episode