Naturalness from the script

10 paralinguistic axes × 5 content types × the SOTA TTS engines that render them.
A Lissin TTS technical report.

Companion pieces. This report is the quantitative follow-up to What TTS throws away (the Part 1 essay showing how ASR/STT flattens held vowels, backchannels, affect shifts, accent, and speaker routing). It also extends the g25-full paralinguistic-prompt preview (the side-by-side audio rendering site).

Abstract

Modern frontier TTS engines now carry much of the perceived naturalness that earlier systems required scripts to supply. This report asks a narrower product question: when the engine is already strong, how much incremental value does script-side paralinguistic markup add for Lissin's podcast, meditation, deep-dive, comedy-news, and lyrics lanes?

This is the quantitative follow-up to What TTS throws away. That first post established the data-loss problem: ASR/STT systems flatten held vowels, non-verbal reactions, mid-thought affect shifts, bracketed events, blended affect, accent, and speaker routing into clean text. This report does not walk that back. It tests the next claim: whether putting those signals back into scripts produces a large, measurable naturalness gain under current TTS engines.

We tested that question with a 10-axis × 5-content-type literature review, a per-axis script-stripping ablation, real-human transcript and audio comparisons, production-pipeline comparisons, an engine-floor baseline, and an adversarial citation review. The empirical work is exploratory: N=10 scripts for the corrected ablation, Gemini-family LLM judges, and 31 real YouTube clips. It is not a power-adequate human-listener benchmark.

The defensible public reading is: under Gemini-family LLM judges, script stripping produced small aggregate MOS differences, per-axis attribution was not reliable, and engine choice appeared larger than markup choice in this exploratory setup. The practical conclusion is not "scripts do not matter"; it is that script structure, genre discipline, voice choice, engine selection, and audio production need to be evaluated together, with human listeners before any external benchmark claim.

1. The product question

Modern TTS gets most of the way to human-sounding speech. The product question after Part 1 is narrower: when ASR throws away held vowels, backchannels, laughter timing, accent cues, and affect transitions, does restoring those signals in script form create a measurable improvement under current TTS engines? This report treats script-side paralinguistic markup as one exploratory variable among several: bracketed event tags, lexical disfluencies, em-dash mid-thought pivots, vowel elongation, in-text emphasis, and pacing cues.

The corrected finding is exploratory. Against Gemini 2.5 Pro TTS, removing all script-side markup changed mean MOS by 0.35 in the N=10 corrected ablation, and no single-axis confidence interval excluded zero. Later validation showed that the same scripts rendered through a different Gemini TTS lane moved more than the script-stripping delta. Engine and voice choices are therefore likely first-order product levers; script markup remains useful, but should be evaluated alongside engine, voice, and production choices.

2. What this report is and is not

This is a Lissin-product technical report on script-side investment for the podcast / meditation / deep-dive content lanes. Read it as Part 2 in a sequence:

  • Part 1: What TTS throws away (amaldavid.com/writing/what-tts-throws-away) — the published essay that shows the qualitative gap with real audio clips, three STT systems, and four TTS labs. It establishes why the script writer preserves four layers: drawn-out vowels, inline non-verbal cues, bracketed events, and paragraph-level director's notes.
  • Companion audio preview: g25-full paralinguistic-prompt preview (g25-full-paralinguistic.pages.dev) — the side-by-side rendering site for the prompt-engineered Gemini lane beside Lissin's deployed production pipeline.

The correction from Part 1 is important: the first post was right about what transcripts lose, but its line that script quality matters more than model choice is too strong as a general product claim. The evidence here says script quality matters, but engine, voice, judge design, and production layer can move as much or more than markup density.

This report is not a benchmark of every TTS engine in the market, not a human-listener MOS study, and not a confirmatory causal decomposition of naturalness. MOS is reported as a score and a delta, not as a percentage of "human naturalness." Sections 19-20 document the reproducibility manifest and the next experiments needed to turn this into a defensible external benchmark.

3. What we tried, what we found, what we'd do next

What we tried

  • 10 axis-level literature reviews
  • 5 content-type style audits (meditation / podcast / deep-dive / comedy-news / lyrics)
  • 3 TTS architecture audits (closed SOTA / Gemini family / open-source evaluation surface)
  • Per-axis script-stripping ablation at N=10 across 12 variants × 2 Gemini-family judges
  • Real human content comparison from 31 YouTube transcripts + audio across 4 genres
  • Production-lane pairwise A/B with positional control
  • Production-pipeline-vs-Prompt pairwise A/B
  • Gemini 3.1 Flash TTS engine-swap on the same scripts
  • Engine-only floor baseline (raw user prompt, no scripting)
  • Adversarial review trail with v2 rewrites

What we found

In the corrected N=10 exploratory ablation, the aggregate original-vs-all-stripped gap was 0.35 MOS under the Pro judge, and the Flash judge saw a much smaller gap. Per-axis attribution is not reliable: no per-axis CI excluded zero, and Pro/Flash judge disagreement changed both magnitude and sign for several axes.

The most useful product signal is that engine and evaluation design dominate the observed differences. Gemini 3.1 Flash TTS scored above Gemini 2.5 Pro TTS on the same scripts in the later validation lane, while prompt and production paths were close under the less halo-prone Flash pairwise tests. The report therefore supports a pragmatic roadmap: test engines and voices first, keep script markup disciplined by genre, and validate the final choice with humans.

Against Part 1 specifically: the data-loss diagnosis still holds. ASR-derived real transcripts undercount held vowels, bracketed events, non-verbal timing, and accent. What changes is the size of the downstream claim. The qualitative clean-vs-enhanced demos in Part 1 justify building the script writer; the ablation here says the external headline must be smaller until humans validate it.

What we could not validate

  • Full cross-engine same-script benchmark through ElevenLabs v3 and Cartesia Sonic 3.5 — API keys were not present in the evaluation environment. Part 1 includes short clean-vs-enhanced demos across Gemini, ElevenLabs, and OpenAI, but those are qualitative examples, not the 10-script quantitative benchmark this report needs.
  • Open-source synthesis probe — the dependency image build did not complete in the available window. The staged entrypoint remains re-runnable, and Higgs Audio v2.5 remains a practical alternative probe.
  • A power-adequate confirmatory human-listener study — outside the scope of this exploratory run.

What we'd do next

  • Blind human listener panel, balanced by genre and randomized order.
  • ElevenLabs v3 + Cartesia Sonic 3.5 cross-engine renders on the same scripts.
  • Voice-cloning and audio-production A/B tests before adding more script-side complexity.
  • 1-10 or MUSHRA-style listening rubric to escape 5-point ceiling effects.

4-22. The full report

The sections that follow contain the per-axis literature reviews, content-type playbooks, TTS engine audits, empirical ablation, decision matrix, methodology trail, and validation phase findings against real human audio and across judges.

4. The 10 paralinguistic axes

For each axis: definition, frequency in natural speech, literature consensus on its naturalness contribution, and how the SOTA TTS engines render it.

Paralinguistic load shifts by content typeHover a ribbon to see 0-5 scores by register. Totals are descriptive audit priorities.total 22Meditationtotal 33Podcasttotal 24Deep-divetotal 45Comedy-newstotal 28Lyrics
Pacing tagsEmotion / deliveryDiscourse markersIn-text emphasisVowel elongationNon-verbal vocal.Audible reactionsMid-thought pivotsFilled pausesConv. fillers
Figure 1. Streamgraph of paralinguistic load by content type — ribbon width is the curated 0-5 axis score.

4.1 Vowel elongation

Key takeaways

  • Vowel elongation is a robust affect cue and a robust prominence cue, but only a weak naturalness cue — and in the one well-controlled lab study that directly measures it (Ko 2021), doubling vowel duration (~2× original) drops MOS by ≥1 point in PSOLA-stretched monosyllables. Whether the same loss carries over to neural TTS rendering of conversational utterances is untested. [Row 17, Ko 2021; §1]
  • In CMC, orthographic lengthening is dense, statistically structured (Lamontagne & McCulloch 2017), strongly correlated with subjectivity/sentiment (Brody & Diakopoulos 2011), and constitutes a written analogue of spoken phoneme extension (Kalman & Gergle 2014).
  • In TTS, only Gemini 2.5/3.1 Flash TTS documents native handling of grapheme-repetition spellings ("Beauuutiful morning") [Row 27, verified verbatim]. ElevenLabs v3 relies on bracket audio tags, not grapheme repetition. CosyVoice 2, F5-TTS, and IndexTTS2 expose duration-prediction modules but do not document a grapheme-repeat → vowel-duration mapping [inferred from arch docs]. Whether Gemini's grapheme-repeat handling actually sounds better than ElevenLabs's tags or F5-TTS's automatic per-character duration is an open empirical question, not a settled finding.
  • The literature is silent on content-type variation: nobody has measured whether listeners reward "Sooooo grateful" in a meditation script the way they punish it in news read-aloud.
  • Open problem #1: there is no published listening study that crosses (orthographic lengthening present/absent) × (content style: meditation / news / comedy / lyric) × (TTS engine) and measures MOS or CMOS.

What the literature says

How robust is vowel elongation as a naturalness cue (vs. an affect cue)?

The literature points in two directions, and one of those directions rests on a single well-controlled lab study. Vowel elongation is robust as an affect and prominence cue: vowel duration is one of the three canonical acoustic correlates of prosodic prominence in English (Klatt 1976; Liberman & Streeter 1978; Beckman & Pierrehumbert 1986) and is used gradiently — not binarily — to mark emphasis ("She's sooooo cool"; LSA AMP 2014 emphatic-lengthening proceedings). Wennerstrom 2001 documents vowel-duration extremes as a speaker-stance device and reports final lengthening at 120–150% of mean syllable duration. On the typological side, Dingemanse and the ideophone literature show that vowel-and-consonant lengthening as an iconic intensifier is cross-linguistically widespread. So when a writer types "Soooo grateful," they are reaching for a device with deep linguistic precedent. The CMC literature confirms that this device transfers cleanly into typed registers: Brody & Diakopoulos 2011 found word lengthening strongly associated with subjectivity and sentiment on Twitter; Kalman & Gergle 2014 framed it explicitly as the written analogue of phoneme extension; Lamontagne & McCulloch 2017 showed the lengthening is not random — it preferentially targets vowels (sonority-driven), aligns with phonological constituents (nuclei, rhymes), and is structured by orthotactics; McCulloch 2019 treats it as foundational to "typographical tone of voice."

The perceptual-naturalness side is much thinner. One well-controlled lab studyKo 2021, Phonetics and Speech Sciences — ran five experiments on English-monosyllabic stop-final words. In Experiment 2, MOS for 200%-lengthened vowels was 2.60, original tokens 3.80, and 50%-shortened tokens 3.53 — a delta of 1.20 MOS points between original and lengthened, with β = –0.65, t = –4.0, p < 0.001 for the lengthened-vs-shortened contrast. Shortening sounds more natural than lengthening; lengthening + question intonation interacts negatively. This is the only paper in the corpus that directly measures naturalness MOS as a function of vowel lengthening. Important caveat the v1 synthesis under-stated: these are PSOLA-stretched isolated monosyllabic stop-final words. Whether the same loss carries into neural TTS rendering of conversational utterances inside longer prosodic contexts is untested (cf. Open Problem #5).

The "duration is only a strong naturalness cue when it co-varies with F0" claim that ran through the v1 synthesis was attributed to Roettger & Cole (Row 15), but no specific paper or DOI is locatable on second-pass review — that supporting plank is currently UNVERIFIED and should be treated as conjectural until pinned.

What's the consensus on perceptual effect? A naive listener does notice "Soooo" sounds longer than "So" — that part is robust (Lin, Wang, Yan & Kambara 2021: perceived "length" jumps from 1.75 → 3.69 on a 5-point scale, p<0.001 for Japanese sound-symbolic LV vs. SV). The affective gain (valence, arousal) in the same study is smaller in magnitude than the length-perception gain. So lengthening is a strong iconic signal of duration; affect intensification appears to require the rest of the prosodic package, but the strength of that claim rests on one paper measuring an adjacent construct (perceived length, not naturalness MOS), so I am hedging it.

Which TTS systems implement it natively? Gemini 2.5 / 3.1 Flash TTS is the only mainstream system whose official prompt guide explicitly documents grapheme repetition as a control surface — the "Beauuutiful morning" example appears in Google's Director's Notes style guide for the Vocal Smile voice. ElevenLabs v3 uses inline audio tags ([excited], [whispers]) that operate at word-window granularity; community reports describe a ~4–5-word scope but the official docs page that previously stated this returns 404. CosyVoice 2, F5-TTS, and IndexTTS2 (2025) all expose explicit duration-prediction modules that can stretch arbitrary phones, but none of them publish a documented "Sooo → 1.4× duration on the /o/" mapping. F5-TTS will physically stretch a "Sooo" because each grapheme gets its own duration slot, but whether the result sounds natural is unmeasured. Tuttösí et al.'s 2025 L2 clear-TTS work is the closest published study of "how much lengthening is too much" but targets clarity, not affect, and the specific 1.2×/1.6× stretch ratios I cited in v1 are not verifiable from the abstract. Whether Gemini's grapheme-repeat handling actually sounds better than ElevenLabs's tags or F5-TTS's per-character duration is an open empirical question, not a settled finding.

Content-type variation. Nobody has measured this directly. The v1 summary leaned on Pickering 2009 for the comedy case — that citation was wrong on author, year, and direction. The actual paper (Attardo & Pickering 2011, Humor 24(2):233–250) reports no significant evidence of pre-punchline pause or speech-rate change in stand-up delivery. The "comedy listeners reward vowel lengthening" inference from v1 is therefore unsupported and is withdrawn. Trouvain & Truong 2014 still documents strong final-syllable lengthening for laughter and nonverbal vocalisations, but that is laughter, not punchline delivery. DeepASMR (2026) demonstrates SOTA ASMR naturalness; the inference that whispered/longer vowels raise MOS for ASMR content is plausible but not directly stated in the abstract — treat as inference. There is no published meditation-TTS study with grapheme-repeat manipulation in the input text.

Bottom line for our report. Vowel elongation in input text is doing real linguistic work — it's a written reflex of a typologically deep prominence/affect device. The single direct piece of perceptual evidence (Ko 2021) says doubling vowel duration via PSOLA drops MOS by ~1.2 points on isolated monosyllables; whether neural TTS engines absorb that loss via prosodic recomposition is the central untested question. The engineering claim of the report — "Gemini's interpretive grapheme-repeat handling beats engines that mechanically stretch" — is plausible but unmeasured, and the required experiment (Open Problem #1) is the highest-value next step.

Open questions for this axis
  1. No crossed listening study. No published work measures (orthographic lengthening present / absent) × (content style: meditation / news / comedy / lyric) × (TTS engine: Gemini, ElevenLabs, OSS) on MOS or CMOS. This is the single highest-value experiment for our naturalness report.
  2. No documented input-output mapping for ElevenLabs. The behavior of "Soooo" vs. "So" in v3 is undocumented; community reports are inconsistent. A clean A/B at fixed seed would settle it.
  3. No iconicity-gated handling. No engine appears to respect vowel-quality gating; the underlying claim (Sinn und Bedeutung 2021) is itself underpowered pilot data and should not be promoted to a normative TTS requirement until confirmed.
  4. No content-aware emphasis budget. Engines have no "this is meditation, lengthening is welcome" / "this is news read-aloud, lengthening is a typo" register switch.
  5. No quantification of MOS loss from naive lengthening in modern neural TTS. Ko 2021 is PSOLA on isolated monosyllables; neural TTS in conversational context is the untested case. Expected range based on Ko 2021: ΔMOS ≈ 1.0–1.2 per doubled vowel, but neural TTS may absorb some of this via prosodic recomposition (untested).
  6. No measurement of lengthening + accompanying pitch shape. The v1 claim that "duration is only a strong naturalness cue when it co-varies with F0" was attributed to Roettger & Cole but the specific paper could not be located on review; pin a specific citation or drop the claim.
  7. No cross-cultural data. Almost all CMC and TTS lengthening research is English (or Japanese for sound-symbolic). Whether Hindi/Tamil "haaaan" produces the same naturalness-vs-affect tradeoff is open.
  8. No replacement source for the comedy content-type case. With Attardo & Pickering 2011 removed as a positive-evidence source, the "comedy rewards lengthening" arm of the content-type argument is currently unsupported. Conversation-analytic work (Local 2003 / Local & Walker on prosodic delay) is a candidate replacement but has not been audited here.

4.2 Non-verbal vocalizations

Key takeaways

  • Frequency in natural speech. Laughter alone occurs at roughly ~0.5 laughs/minute in spontaneous dyadic conversation among friends (Vettin & Todt, 2004) [citation]. The high cross-sex stranger rate of ~7.5 laughs/minute traces to Grammer & Eibl-Eibesfeldt (1990) cross-sex courtship work, NOT to Vettin & Todt 2004 . Breath is several × more frequent than laughter in modern NV-annotated corpora — the NonverbalTTS corpus reports 3,612 breath instances vs. 1,018 laugh instances over 17 hours [citation]. About 80–90% of natural laughter is conversational, not joke-related, per Provine (2000), *Laughter: A Scientific Investigation* (the canonical source) — not the 1996 American Scientist piece, which only summarised the work . NVs are pragmatic glue, not punchlines.

What the literature says

How frequent are non-verbal vocalisations in natural speech?

Two converging frequency estimates anchor the field. Vettin & Todt (2004) logged dyadic conversation and report laughter at ~5 laughs per 10 minutes (≈0.5/min) in established friendships [citation]. The high-rate ~7.5 laughs/min in opposite-sex stranger pairs is re-attributed here to Grammer & Eibl-Eibesfeldt (1990) cross-sex courtship work, not Vettin & Todt 2004 . Provine (2000), *Laughter: A Scientific Investigation* is the canonical source for the ~80–90% non-humor laughter finding — the 1996 American Scientist piece is a popular summary, not the primary source for that specific figure . Bryant et al. (2018) PNAS adds that listeners cross-culturally can decode friend-vs-stranger status of co-laughter from short clips, which sharply limits the "use a single laughter rate" simplification .

In modern TTS-oriented corpora the relative frequencies of NV types are reportedly stable: in the 17-hour NonverbalTTS corpus, breath ~3,612 instances vs. laughter ~1,018 vs. ~400 sniffs vs. ~200 each cough and throat-clearing [unverified / needs PDF check]. Sighs are under-represented in older affect-recognition corpora — IEMOCAP reportedly contains only two sigh-labelled utterances [unverified — figure widely repeated, not directly verified], which is why dedicated NV corpora (VocalSound, NonverbalTTS, SMIIP-NV, Emilia-NV) became necessary in 2022–2026.

A useful translation for TTS budget (subject to the caveats below): at conversational baseline (0.5 laughs/min) a 20-minute podcast monologue should average ~10 laugh events; at comedic baseline (~3–5×) 30–50; at meditation / explainer baseline (close to 0) fewer than 2. Caveat (per review): the 0.5/min figure derives from Western dyads, low-stakes social context, with observer-aware speakers. Rates in podcast monologue, audiobook narration, customer-support calls are unknown and not safely interpolated; treat the per-content-type table as inferred guidance pending empirical study (see "Open problems").

Which TTS models actually generate NVs natively, and how?

There are three coexisting paradigms [paradigm boundaries cleaned up per review]:

(i) Inline-tag conditioning in LLM-style TTS.

  • Open research: CosyVoice 2 / 3, NVSpeech / CV2@Emilia-NV [citation].
  • Vendor systems: ElevenLabs v3, Gemini 2.5 Pro TTS [vendor claim × 2 — no independent eval]. All consume bracket tags ([laughs], [sighs], [breath], [gasp], [clears throat], [cough], [whispers], [mhm]) inline in the prompt and report emitting the corresponding acoustic event in-place. ElevenLabs additionally documents finer variants ([laughs harder], [starts laughing], [wheezing], [snorts]) [vendor claim]. Gemini-TTS reports blending tags with natural-language directives like "react with an amused laugh" [vendor claim — comparison vs bare tag is uneval'd vendor self-report].

(ii) Frame-level auxiliary conditioning. Microsoft's ELaTE (Hao et al., 2024) adds a frame-level laughter-detector embedding as a side conditioning signal to a flow-matching zero-shot TTS [citation]. EmoCtrl-TTS (Hao et al., 2024) extends this with arousal/valence + laughter embeddings for time-varying emotion [citation].

(iii) Transcript-free audio language modelling (implicit NV). AudioLM preserves NVs implicitly by virtue of being a fully audio-language model without transcript supervision .

Are there clear MOS deltas when a TTS injects NVs vs. not?

The honest answer: directional yes, effect sizes mostly missing .

  • Breath. Yamamoto et al. (2024) report breath-conditioned VITS > vanilla VITS in naturalness MOS [directional only — no published delta cited in axis].
  • Filled pauses. Saito et al. (2022): filled pauses raise naturalness MOS, degrade linguistic-region MOS when misplaced [directional only — no published delta cited in axis]. AdaSpeech 3 reports MOS gain on spontaneous-style adaptation attributable to its FP predictor [directional only — no published delta].
  • Laughter. ELaTE reportedly reports SIM 0.796 vs. 0.489, AutoPCP 3.24 vs. 2.31 vs. Seamless Expressive on AliMeeting laughter [unverified / needs PDF check — numbers not in arXiv abstract; not extracted from full PDF in review budget], with WER on neutral speech reportedly unchanged. NVBench (arXiv 2604.16211) across 15 systems finds NV controllability decouples from base MOS — high-MOS TTS often misplaces, mistypes, or mutes its NVs. Caveat: this decoupling is a one-paper finding from a research line with a stake in the markup paradigm; treat as suggestive, not replicated.

Cleanest summary: inserting NVs reliably raises naturalness *when the model was trained to expect them at that token position*, and need not harm neutral quality when conditioning is additive (ELaTE-style). Inserting NV tags into a base model that wasn't trained on them produces garbled output, audible glitches, or silent ignores (NVBench finding).

Rate / placement across content types

Following the review's pushback: no published rate-per-content-type table exists . The list below is inferred guidance, derived from two laughter-only papers and not from any content-type corpus.

  • Meditation / sleep / instructional: NVs should be ~0; even one [laughs] violates register. [inferred]
  • News / documentary / explainer: [breath] only, at clause boundaries; laughter rare. [inferred]
  • Conversational podcast / interview: baseline ~0.5 laughs/min, breaths every 6–10 s, occasional [sighs] and [mhm] as turn-taking glue. [inferred]
  • Comedy / banter / improv: 3–5× the conversational baseline; [laughs], [chuckles], [snorts] cluster after punchlines and during overlapping-talk segments. [inferred]
  • Audiobook narration: NVs are character-coded, not narrator-coded — emerge inside quoted dialogue, not in description. [inferred]
Strongest evidence that script markup (not audio cloning) is the unlock — hedged

Three pieces of evidence converge within the open-research line, with caveats :

  1. NVSpeech (Liao et al., August 2025, arXiv 2508.04195) demonstrates the same CosyVoice2 base gains controllable, token-position laughter/breath from training on word-level tagged transcripts — gating is the bracket token in the text stream [citation].
  2. NonverbalTTS (deepvk, 2025) shows 17 hours of NV-annotated text-aligned data is reportedly sufficient to bring an open TTS to parity with a closed commercial system [33.4%/35.4% — unverified / needs PDF check].
  3. NVBench (arXiv 2604.16211) measures controllability orthogonal to MOS; tag-conditioned systems dominate controllability regardless of base MOS.

Caveats (per review):

  • Author overlap exists across the NVSpeech / NVBench research line (e.g., Zhizheng Wu et al.); treat as "open-research consensus from one research line," not field consensus.
  • ELaTE / EmoCtrl-TTS (audio-prompt-based NV control) are a partial counterexample. The axis cites SIM and AutoPCP numbers in ELaTE's favor vs. Seamless, but no published head-to-head between markup-conditioned (CosyVoice2-Emilia-NV) and audio-prompt-conditioned (ELaTE) systems exists. The "markup is the unlock" claim is therefore a preference about interface, not a quality verdict.
Open questions for this axis
  • Placement is not synthesis. All current corpora annotate NV positions and types, but no public corpus jointly annotates appropriateness. NVBench is the closest, but it evaluates synthesis given placement, not placement itself.
  • Cross-cultural rates. Vettin & Todt and Provine are Western-English-anchored; rates for tonal languages and East Asian conversational norms are under-quantified despite Mandarin corpora (Emilia-NV, SMIIP-NV) existing. Bryant et al. (2018) PNAS shows cross-cultural co-laughter decoding works but doesn't quantify rate differences.
  • Long-duration affective NVVs (sustained crying, prolonged groans) remain a failure mode even in 2026 SOTA per NVBench.
  • Sigh under-representation. IEMOCAP reportedly has 2 sigh utterances [unverified]; sigh is conversationally common but corpus-rare, so models systematically under-produce it.
  • Speech-laughs vs. isolated laughs. Speech-laughs (chuckles intermingled with words, à la EmoV-DB's amused style) are perceptually more "amused" than isolated bursts (Adigwe et al., 2018), but most tag schemes treat [laughs] monolithically. Bryant & Aktipis (2014) spontaneous-vs-volitional distinction motivates a [laughs]/[chuckles] separation .
  • No published rate-per-content-type table. The 0.5 laughs/min figure is a single baseline from Western dyads; podcast/comedy/meditation rates are inferred, not measured.
  • Effect sizes are missing across nearly every MOS-gain claim in the published axis literature. Breath VITS, filled pauses, AdaSpeech 3 FP — none cite a delta in the present compilation. Required for any "shipping vs. noise" judgment [flagged per review].
  • ELaTE and NonverbalTTS headline numbers (0.796/0.489, 3.24/2.31, 33.4%/35.4%, SIM-o 0.89/0.85) remain unverified against the full paper PDFs within the review's WebFetch budget. Needs second pass [flagged per review].

4.3 Filled pauses

Scope: lexical filled pauses (FPs) only. Silent pauses are covered under the pacing-tags axis; discourse markers (so/well/yeah/like) under a separate axis. This summary covers ~28 sources spanning psycholinguistics, TTS synthesis, and perception, with quantitative base rates where available.

What the literature says

2.1 Base rates — what counts as "natural"

Across spontaneous adult English corpora, lexical filled pauses occur at a remarkably stable rate of roughly 2–6 per 100 words, depending on task and speaker. Shriberg (1994) reports 1.6–2.2 FPs/100 words across three corpora including Switchboard [citation]. Bortfeld et al. (2001) measure an overall disfluency rate of 5.97/100 words in dyadic referential tasks, of which filler fraction is ~2.5/100 words [citation]. Eklund's primary Swedish ATIS rate is 3.6/100 words; the often-quoted ~6% cross-corpus figure is a summary number not cleanly attributable to Eklund's primary measurement [citation]. Per-minute, at a normal conversational rate of ~150 wpm, this maps to roughly 3–9 FPs/min.

The HCRC Map Task is sometimes cited at ~1.3 FPs/100 words with very large between-speaker spread (0.18–6.66). I cannot pin this number to a single primary source in this pass (likely Lickley 2001 / Branigan, Lickley & McKelvie 1999 on Map Task) — treat as [unverified] until confirmed. Individual variance dwarfs cross-corpus means in any case.

PodcastFillers (Zhu et al., 2022) provides ~35K annotated FPs across ~145 hours of speech drawn from 199 podcast episodes — corrected from the original axis's "199 hours" error. Implied FP density is roughly 35,000 / (145 × 60) ≈ 4.0 FPs/minute, not 3/min as previously inferred. Meditation content is the natural minimum: read-aloud / scripted speech is essentially FP-free (Maclay & Osgood 1959 already note FPs as a spontaneous-speech-only phenomenon). Verdonik et al. (2025) find spontaneous and improvised speech have significantly higher FP frequency than scripted TV dialog — but the corpus is Slovenian, so generalization to English is suggestive, not conclusive [citation].

Goldwater, Jurafsky & Manning (2010) provide an important counterweight: FPs disproportionately drive ASR error, even when listeners find them naturalness-positive [citation]. Pipelines that boost FP density for naturalness should expect downstream transcription degradation.

2.2 uh vs um — Clark/Fox Tree and unresolved tensions

Clark & Fox Tree's (2002) headline claim is that uh and um are conventional interjections, with um signaling a longer upcoming delay than uh. The finding has been partially replicated (Tottie 2011 for British English Lund) but is weaker in American English. Corley & Stewart (2008) argue listeners are sensitive to the distinction without it being intentionally produced — the side-effect view. Wieling et al. (2016) document um rising over time relative to uh across six Germanic languages, led by women and younger/educated speakers.

Tension: does the sound form matter, or just the timing? Watanabe et al. (2008) (Row 14) and Fox Tree 2001 (Row 15) disagree. Watanabe is sometimes read as timing-only, but the paper actually argues FPs are a stronger/distinct cue from equal-duration silent pauses — the original axis's "equal-duration silent pause produces same effect" framing was an oversimplification. Fox Tree 2001 shows the form matters: uh (but not um) speeds recognition of the following content word. Best current reading: form matters, and silent pauses do not fully substitute for FPs; the Watanabe timing-equivalence claim was overstated.

Practical implication for script writing: place um before complex/longer-planned material (lists, abstractions), uh before short repairs or single-word retrievals. Listeners expect a complex constituent after a filler, so an um followed by a one-word completion sounds wrong (Watanabe et al.).

2.3 Does inserting uh/um actually raise naturalness MOS?

Contested. The honest reading of the literature is: FPs reliably improve recall (Fraundorf & Watson 2011) and perceived spontaneity (Hassan, Lison & Halvorsen 2024), but improve naturalness MOS only for some content types and only when the engine renders them as real disfluencies. Schettino et al. (2023) honestly report the quality-rating gain washes out in long-form between-subjects designs; recall and perceived spontaneity remain the durable wins. Adell, Bonafonte & Escudero (2007) report perceptual gains from inserting predicted FPs in concatenative synthesis, with the headline 96%/58% placement Pr/Re figures UNVERIFIED on this pass. Matsunaga et al. (2022/2023) add an important caveat: personalized FP distributions beat generic ones at matched FP count, though effect size needs independent confirmation.

Dall et al. (2014b) provide the diagnostic negative result: in natural speech, FPs speed listener reaction times to upcoming words versus silent pauses; in synthetic speech the sign flips. Caveat: this is one experiment on HMM/concatenative-era TTS — modern neural codec TTS engines render FPs differently. The hyperbolic "a robotically rendered uh is worse than no uh at all" gloss should be hedged: it holds for the era and engines Dall tested.

2.4 Content-type variation — hypotheses, not facts

The numbers below are hypotheses for A/B testing, not measured facts for the Lissin corpus. Several depend on [inferred] extrapolations that need empirical confirmation.

  • Meditation / scripted narration: hypothesized ~0 FPs/min. Adding any audible uh/um likely breaks the genre.
  • Audiobook / read journalism: hypothesized ~0–0.5 FPs/min.
  • Podcast monologue (Lissin baseline): hypothesized ~2–4 FPs/min, with a possible slight um bias. PodcastFillers density ≈ 4 FPs/min provides the cross-corpus anchor.
  • Conversational podcast / interview: hypothesized ~3–6 FPs/min.
  • Comedy / improv: hypothesized 6+ FPs/min (Verdonik 2025 directional, Slovenian).

These need confirmation against the actual measured Lissin baseline before being treated as prompt rules.

2.5 Which TTS engines render uh/um as real disfluencies?

Honest tier with hedging:

  • Self-reports of native FP rendering: Sesame CSM 2025 ("Maya") — Sesame's own research note (Feb 2025), no independent benchmark. Mistral Voxtral — self-reported 62.8/37.2 blind preference vs ElevenLabs Flash v2.5; the FP-attribution is the author's inference, not Mistral's claim. Both should be treated as marketing-tier evidence until independent A/B.
  • OpenAI gpt-4o audio: prior axis claimed it renders FPs more naturally than read-TTS but is more scripted than Sesame. No peer-reviewed comparison supports this ordering. Drop the comparison until A/B evidence exists.

Actionable rule for the Lissin pipeline: do not insert uh/um into text routed to a read-style TTS; experimentally test FP insertion (sparingly, hypothesized ~2–3/min) against the target engine. A/B before scaling FP density up.

Open questions for this axis
  1. What is the actual measured FP/min rate in the Lissin baseline corpus? Compare to PodcastFillers' ~4/min. This must be measured before issuing prescriptive prompt rules.
  2. Which engines in the current TTS audit pass the Dall-2014 RT-flip test? Replicate against modern neural-codec TTS.
  3. Is um/uh ratio in the Lissin corpus consistent with the Wieling 2016 Germanic trend? Tottie 2011 cautions American English shows the weaker um/uh differentiation.
  4. Confirm Adell 2007 placement Pr/Re (96%/58%), Dall 2014b RT-flip effect size and N, Matsunaga effect size, Yang et al. MOS scale + comparator, Schettino et al. title — all UNVERIFIED on second pass.
  5. Pin the HCRC Map Task 1.3/100w figure to its primary source (likely Lickley 2001 / Branigan et al. 1999).

4.4 Discourse markers

Scope. Inline pragmatic items that manage conversation flow, turn-taking, stance, contrast, agreement, topic shift: so, well, yeah, oh, huh, hmm, right, look, okay, anyway, actually. Not covered here: filled pauses uh/um (axis 03), conversational fillers like / you know / I mean / sort of (axis 05), audible reactions wow / oh! (axis 10).

Why DMs matter for TTS naturalness. DMs are arguably the single most under-rated paralinguistic axis in synthesis research. Unlike uh/um, DMs are lexical — already in the orthographic transcript — so a TTS model can read them with no markup change. But the question is whether the model assigns them their DM prosody (e.g. clause-initial reduced "well" with falling-then-flat contour) versus their content-word prosody. Human listeners use this prosodic distinction to decode stance, irony, dispreference, news receipt, and topic boundary.

What the literature says

1. Base rates by content type

The corpus literature converges on a clear ordinal ordering of DM density across registers (Biber 1988; Fox Tree 2010): casual conversation > interview > lecture > read-aloud > scripted formal speech. Precise per-100-word numbers depend on which DMs are counted in scope. From the Spoken BNC2014 [Love et al. 2017], adult casual conversation in British English includes and, so, but as the top monosyllabic discourse-functional items, with yeah, right, well, okay, actually populating the receipt / dispreferred / counter-expectancy slots. The DM-ranking attribution is shifted here from Love et al. (a corpus-description paper) to downstream analyses using BNC2014.

The "44.1% of task-oriented turns are DM-prefaced" claim previously attributed to Heeman & Allen 1999 has been corrected: Heeman & Allen's headline metric is ~97% identification accuracy and ~96% precision on DM detection — these are identification metrics, not base rates. The base-rate question is unresolved by that paper.

Mapping onto our 5 content types — ordinal labels rather than spurious decimals, per review:

  • Meditation: very low DM density. Guided meditation is scripted, monologic, formal-register. Soft openers (So, let's begin.) and topic transitions are the rare exceptions. (No meditation-corpus base rate cited.)
  • Podcast (conversational): mid-to-high DM density. Two-host podcasts approximate casual conversation; solo podcasts (interview-style monologue) sit lower. So, right, well, okay, yeah dominate.
  • Deep-dive / explainer: low-to-mid DM density. Analytical content suppresses interactional DMs (yeah, right) but keeps inferential ones (so, well, actually) for argument flow.
  • Comedy / improv: high DM density. DMs are weaponized as comedic delivery devices.
  • Song lyrics: very low DM density. Lyrics suppress prosaic DMs.

Specific per-100-word numbers for each tier are a measurement gap — see open questions.

2. Does DM density predict naturalness MOS?

There is no single published study that regresses naturalness MOS on DM count while controlling for content. The closest indirect evidence:

  • Spontaneous-style TTS that explicitly models DMs (SponTTS [citation]) outperforms read-speech baselines on naturalness MOS for podcast-style targets.
  • Fox Tree & Schrock 1999 [citation] shows DMs improve listener processing speed — a precursor to naturalness perception, not naturalness itself.
  • Adell et al. 2007 [citation] reports MOS uplift when DMs and uh/um are jointly inserted, but does not decompose the contribution.

Best inference: DM presence likely has a non-linear effect on naturalness — below the genre-appropriate rate, output sounds robotic; above it, rambling. The sweet spot is genre-conditioned. This remains a hypothesis pending direct measurement.

3. How do current TTS engines handle sentence-initial "Well, here's the thing —"?

Claims here are author observations from informal probing — not published evaluations. They should be verified with the engine A/B in §Open questions before being treated as engineering inputs.

  • ElevenLabs v3 — no DM-specific markup; relies on [curious]-style emotion tags. Author observation, unverified: sentence-initial Well, is often read with a content-word-like fall.
  • Gemini 2.5 Pro / 3.1 Flash TTS — accepts pause-duration markup, which can be used to insert the canonical micro-pause after Well,. Author observation, unverified.
  • NaturalSpeech 3 — factorized prosody attribute is the natural lever, but no public benchmark of DM-specific MOS.
  • SponTTS — closest to a DM-aware model; the "spontaneous phenomena embedding" learns DM prosody from training data implicitly.

None of the surveyed engines models the Heritage 2013 dispreferred-well pattern explicitly. Practical workaround: script-level pre-processor that inserts , [short pause] after DMs and lowercases them to discourage stress.

4. Are DMs a stronger lever than vowel elongation or filled pauses?

No direct comparison published. The previously included ranking "(1) filled pauses (2) DMs (3) vowel elongation" has been moved to open questions per review. As stated by the original author, no peer-reviewed comparison exists, and Schiffrin/Biber's register-signal claim does not by itself rank DMs against FPs on MOS uplift. The ranking is a hypothesis to test, not a finding.

5. Placement vs. count

Tentative consensus: placement dominates count. Aijmer's actually study shows the same word in multiple prosodic patterns mapped to different functions — count is uninformative without position + prosody. Bolden's so work shows turn-initial vs. utterance-medial so are functionally different items. TTS implication: doubling DM count without controlling placement is unlikely to improve naturalness and may degrade it. A script with well-placed DMs at turn / clause / contrast boundaries will likely out-MOS a script with randomly scattered DMs, though this is itself an A/B-testable hypothesis.

Open questions for this axis
  1. Are DMs a stronger naturalness lever than FPs? Run an SponTTS or NaturalSpeech 3 ablation that decomposes DM vs FP contributions. Currently unanswered by any published paper.
  2. Run a Gemini 3.1 Flash TTS A/B: same script with DMs lowercased + [short pause] after each, versus DMs left as capitalized content words.
  3. Probe whether NaturalSpeech 3's prosody factor can be steered toward DM-low-flat prosody via prompt.
  4. Build a per-content-type DM-rate target measured from actual corpora (not inferred from register-effect literature) and lint scripts against it before render.
  5. Pin the D'Arcy 2017 cross-dialect like decimal rates (0.49 / 1.51 / 2.18 / 2.23 / 4.38 ptw) to a specific page in the book; until then, treat as ordinal (British < Indian < NZ < Philippine < Canadian).
  6. Get a Switchboard / COCA-spoken DM base rate for American English; BNC2014 alone is British-biased.
  7. Re-verify Heeman & Allen 1999 actual base-rate (if any) for DM-prefaced turns in Trains.

4.5 Conversational fillers

Scope: inline like, you know, I mean, sort of, kind of, I guess, basically, literally. Excludes filled pauses (uh/um — Axis 03) and discourse-structuring markers (so, well, yeah, oh — Axis 04). Quotative be like is included where the literature treats it inseparably from focuser like.

Conventions: [citation] = grounded in a source identified below; [inferred] = my own synthesis or operational rule of thumb that is not attested in any one paper; [author opinion] = explicitly the author's own operational rule, NOT a research-derived heuristic; [~] = approximate.

What the literature says

1. Base rates across registers

The single hardest empirical claim to pin down is "how often do real people actually say like / you know / I mean per 100 words?"

Spoken BNC2014 [Love et al. 2017] and BNC1994 [Torgersen et al. 2011] show you know and I mean in adult British conversation at the order of a few hundred to low thousands of tokens per million words. The v1 conversion from "low hundreds per million" to "0.5–1.5 / 100w" was arithmetically wrong — low hundreds per million is on the order of 0.02–0.05 per 100 words, not 0.5–1.5. For 0.5–1.5/100w the per-million figure would need to be in the low thousands, which is more consistent with broader-scope counts that include all functions of you know. The v2 working number is therefore on the order of 0.05–0.5 per 100 words for `you know` in adult British conversation, depending on function inclusion, with I mean somewhat lower. The wider 0.5–1.5 range from v1 should be treated as cross-genre upper bound, not adult-BBC.

COLT [Stenström et al. 2002] shows London teenagers running materially higher rates for like specifically, with like becoming the single most frequent pragmatic marker in adolescent talk. The "2–3× the adult rate" framing in v1 conflated a generational change (real-time like increase 1990s→2010s) with a corpus-comparison artefact; treated here as directional only.

Switchboard-style American conversation: no direct citation from the literature gives a clean per-100w rate, so the v1 "~1–2 per 100 words" was an unsupported inference and is withdrawn in v2 pending empirical anchor (likely Switchboard / Fisher analyses from Stolcke et al. lineage). Müller's headline finding that American L1 speakers use you know roughly five times as often as German L2 learners [citation] confirms you know density as a strong native-likeness signal, directionally.

The Channell 1994 [citation] vague-language taxonomy clarifies why these rates are so context-sensitive: hedges function as politeness shields, downtoners, and category identifiers, and their density tracks face-management needs rather than cognitive load.

2. Does adding 2–3 per 100 words help naturalness MOS?

The TTS evaluation literature gives a directionally positive but qualified answer.

The Estonian podcast-TTS study [Mihkla et al. 2023] finds that perceived spontaneity rises with filled pauses and disfluencies. The disfluency-insertion paper [Hassan, Lison & Halvorsen 2024, arXiv:2412.12710] finds a significant spontaneity gain accompanied by a slight reduction in intelligibility. The controllable spontaneous-TTS work [Zhao et al. 2407.13509] treats disfluency density as a control parameter, implying a sweet spot rather than a monotone gain.

The sociolinguistic side gives a matching warning. Hesson & Shellgren 2015 [American Speech 90(2):154–186] real-time rating method shows each token of like immediately depresses both friendliness and competence ratings. The friendliness effect dissipates; the competence penalty persists. The "and grows" intensification used in v1 was the stronger of two secondary-source readings; the paper text supports "persists" cleanly but does not unambiguously support "grows" — hedged. Dailey-O'Cain 2000 is consistent: high-like speech rates higher on solidarity, lower on status.

The v1 "do not exceed ~5/100w soft cap" was tagged [inferred] but presented as a research-derived rule. In v2 it is explicitly [author opinion]: no published paper supports a specific numeric threshold. The directional point — hedge density and competence ratings trade off — is supported; the specific 5/100w number is the author's operational call, not a finding.

3. Per-content-type taste profile [author opinion]

All five recommendations below are [author opinion] based on extrapolation from sociolinguistic register effects, not from TTS evaluation papers that measured filler density × content type. They are reasonable defaults for A/B testing, not research-derived guidance.

  • Meditation / guided breathwork: avoid entirely. [author opinion]
  • Podcast (interview / explanatory): moderate, ~1–2/100w as a starting point matching what podcast corpora actually do [directional, Mihkla et al. 2023; Székely et al. 2019].
  • Comedy / monologue: embrace. Stand-up routinely runs >4/100w in reference corpora. [author opinion + Tagliamonte 2005 directional]
  • Deep-dive / explainer / educational: sparingly. Heavy hedging trades off against perceived authority. [author opinion]
  • Lyric / song: lyric conventions are metrical and use hedges sparingly, but hip-hop, R&B, country, indie do include I guess, you know, like. Not a closed category — underexplored. [author opinion, downgraded from "not applicable" in v1]
4. Which TTS engines render "like, you know" with the right hesitant prosody?

The natural realisation of hedges-as-fillers is reduced, lower pitch, slightly extended preceding pause, downward F0 on the marker itself — this is the prosody documented for uh/um in Clark & Fox Tree 2002 and Fox Tree 2001 . It is reasonable to extrapolate this prosody to like/you know/I mean when produced as fillers, but direct phonetic measurement for these multi-word forms is sparse.

Practical engine-level guidance for our pipeline: pick filler placement and density carefully so the underlying engine's read does the least damage. Prefer phrase-medial, between intonational phrases. [author opinion]

Open questions for this axis
  1. Are there published per-100w rates for basically and literally? The intensifier-discourse-marker boundary is fuzzy.
  2. Does any TTS evaluation paper directly measure listener tolerance as a function of filler density? Everything cited is binary (with vs without fillers) rather than dose-response.
  3. Empirical test inside our own corpus: count fillers per 100 words in shipped scripts vs COLT and BNC2014 reference rates by content type.
  4. Run an evaluation of ElevenLabs v3 / OpenAI gpt-4o-audio / Inworld TTS-1.5 / Cartesia Sonic on filler prosody (axis 04 / 05 intersection) — this would let us replace the v1 blanket "no production engine" claim with a specific scoring.
  5. Resolve the Hesson & Shellgren 2015 "intelligence penalty grows" ambiguity against the primary text.

4.6 Mid-thought pivots

Scope. Em-dash false starts: "and so — and so the reason is…", "I was — well, what happened was…". Also called self-repairs, false starts, restarts, truncated utterances, editing terms, fluency breaks. The canonical structural model is Levelt (1983, Cognition 14, 41–104): reparandum → interruption point → editing phase (optional editing term) → repair.

Why this axis matters for TTS. A mid-thought pivot is not just a comma or a pause. It is (1) a cutoff — often mid-word — with a particular pre-cutoff intonation truncation; (2) a small hesitation window (silent or filled); and (3) a restart whose pitch is usually reset and (conditionally) prosodically marked. Reading "I was — well, what happened was…" as a single fluent clause with a pause where the dash is destroys the construction. The defining cue is the cutoff contour, not the dash glyph.

4.7 Emotion / delivery tags

Inline annotation tags such as [curious], [excited], [whispers], [shouts], [laughs], [sighs], [sarcastic], [speaking slowly], [robotic] used as in-prompt controls for expressive TTS. Covers closed-stack offerings (ElevenLabs v3, Gemini 2.5 Pro/Flash TTS, OpenAI gpt-4o-mini-tts, Sesame CSM) and the open-stack lineage (GST → PromptTTS → InstructTTS → StyleTTS 2 → NaturalSpeech 3 → CosyVoice 2 → EmoVoice → EmoSphere++).

Scope of "tag" in this axis is broad: it includes (a) bracketed inline markers consumed mid-utterance, (b) sentence-level natural-language style prompts ("calm, professional documentary"), and (c) categorical emotion conditioning vectors that papers expose to users as discrete labels. Free-form audio reference ("zero-shot voice cloning + style") is treated as a contrast condition, not a tag.

4.8 Pacing tags

Scope. Inline pacing/silent-pause markup: [short pause], [medium pause], [long pause], [PAUSE=2s], [very fast], [very slow], [speaking slowly], plus SSML <break time="500ms"/>. This axis covers silent (unfilled) pauses and global tempo control. Filled pauses (the lexical uh/um family) are explicitly out of scope — they live on Axis 03.

Why it gets its own axis. Silent pauses are physically the simplest paralinguistic phenomenon — they are literally zero acoustic content — yet they carry an outsized share of perceived naturalness. The phonetics literature has 60+ years of distributional norms; the TTS literature has the most asymmetric "easy to mark up, hard to honor" gap of any axis in this report.

4.9 In-text emphasis

Scope. ALL-CAPS words ("I REALLY mean it"), *asterisks*, bold, _italics_, and the prosodic prominence those typographic devices are intended to encode. Distinct from:

  • Axis 07 (emotion / delivery tags) — those are full-phrase delivery directives like [whispers].
  • Axis 01 (vowel elongation) — orthographic lengthening of a vowel, not selection of a word for prominence.

In speech, the analog of in-text emphasis is the nuclear pitch accent and, more narrowly, contrastive / narrow focus marking — a word singled out for prominence via expanded F0, longer duration, greater intensity, and post-focus compression on the words that follow (Xu 1999; Xu & Xu 2005).

4.10 Audible reactions

Scope. Inline lexicalised reaction tokens — oh!, wow!, huh, hm, ah, oof, ugh, yikes, aw, whoa, hooray, pfff, tsk, etc. — that punctuate conversational speech with an externalised display of an inner state.

What this axis is NOT.

  • Not bracketed non-verbal events [laughs], [sighs], [gasp] — those are axis 2.
  • Not full delivery directives [curious], [whispering] — axis 7.
  • Not turn-management discourse markers so, well, like — axis 4 / 5 (though oh straddles axis 4 and axis 10; see below).
  • Overlaps with axis 1 (vowel elongation): ohhhhh is simultaneously a response cry and an elongated vowel.

Coining and theoretical anchor: Goffman (1978, "Response cries", Language 54:4) — vocalisations produced "at" an event rather than at an interlocutor, socially licensed as displays of inner state. Goffman argued response cries are "not statements in the linguistic sense" (1978: 800).

5. Per-content-type scripting playbooks

Five canonical content registers. The axis taxonomy applies non-uniformly across them — meditation lives on pacing + delivery tags; comedy lives on emphasis + pivots; lyrics invert the entire taxonomy.

Per-content-type paralinguistic profiles

Hover a point or filled shape for axis scores. Scores are curated 0-5 priorities, not measured corpus frequencies.

Meditation

Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions

Podcast

Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions

Deep-dive

Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions

Comedy-news

Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions

Lyrics

Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions
Figure 2. Per-content-type axis profile, interactive radar plots showing how the 10 paralinguistic axes shift across registers.

5.1 Meditation

Scope: guided meditations, body scans, breathwork, yoga nidra, and sleep stories as produced by Calm, Headspace, Waking Up, Insight Timer, Tara Brach, Jon Kabat-Zinn, Wim Hof, Richard Miller (iRest), and the "Nothing Much Happens" sleep-story tradition.

Key takeaways

Meditation is the slowest, sparsest, most pause-dominated of all spoken-content registers. The dominant paralinguistic levers are not vocal performance (laughs, fillers, mid-thought pivots) but silence engineering plus delivery tagging ([whispers], [speaking slowly], [gentle]). On the page, scripts look almost telegraphic: short clauses, present participles ("noticing… softening… arriving"), 2nd-person imperatives ("notice", "feel", "allow"), and heavy use of "just", "simply", and the definite article ("the breath", "the body").

Effective target rate is 60–100 words per minute of spoken time, but the real time figure is much lower because 30–60% of the runtime is silence [author's estimate; every sub-genre row in §"Pause distribution" was originally tagged [inferred] — this is a hypothesis-grade headline, not a measured one]. Calm sleep stories cluster around 90 wpm per brand-blog content; Headspace Sleepcast narrators are described as "about 60% of average talking speed" [citation] .

For TTS specifically: pacing tags dominate. The observed Gemini 2.5 Pro vs 3.1 Flash 2.8× duration delta on the same script (10:46 vs 3:49) is N=1 — one script, no controlled inputs. It is the headline observed data point for this project's TTS audit, but should not be propagated as a robust model characterization.

This audit is a register family description — Calm, Headspace, Waking Up, Tara Brach, iRest, and Wim Hof are treated as one cluster, but their per-sub-genre WPM ranges (60–110 wpm) and silence ratios (15–70%) actually span 5 sub-registers. The TL;DR previously elided that internal variability.

Per-axis profile

#AxisFrequencyCharacteristic pattern in meditation
1Vowel elongationLow–MediumRare in prestige meditation (Headspace, Waking Up). More common in sleep stories and ASMR-adjacent content as soothing "soooo soft… slooowly".
2Non-verbal vocalizationsLowNo laughter. Rare "mmm" of acknowledgment. Deliberate audible breath (especially in breathwork) is the one common exception.
3Filled pauses ("uh", "um")Very lowActively avoided. Headspace, Calm, Waking Up all produce broadcast-polished narration; disfluency reads as unprofessional. [inferred from corpus & app QA standards]
4Discourse markersHigh (but a narrow set)"Now", "And", "So", "Just", "Simply" dominate as transitions between micro-instructions. Demjén 2024's 20-session introductory-course transcript (10,841 words from two tutors) shows "just" used 4× more than reference corpora; this is the narrow corpus, not "the Headspace corpus." Sam Harris repeats "simply" and "just" — sourced to a single Till Gebel line-by-line analysis of one meditation (N=1), not a Harris-wide measurement.
5Conversational fillers ("you know", "like", "I mean")Very lowAlmost absent.
6Mid-thought pivots / self-correctionVery lowScripts are pre-written and read; pivots break the trance frame.
7Emotion / delivery tagsVery high (in TTS pipelines)[whispers], [gentle], [soft], [calm], [slowly], [warm] are the workhorses. ElevenLabs v3 supports [whispers] for meditation voices [citation].
8Pacing tags (pauses, rate control)Very, very high — defining axisThis is the single most important axis for meditation. Stillmind blog's pause convention: [pause] = 3–5s, [longer pause] = 10–15s, [silence — 1 minute] = extendedthis is one blog's convention, not "the de facto authoring convention" as v1 framed it. Demjén 2024's 20-session corpus annotates three pause tiers: (<3s), <pause> (3–10s), <silence> (>10s) [citation].
9In-text emphasis (CAPS, italics, bold)LowRare. Meditation prefers de-emphasis.
10Audible reactions ("mmm", soft sighs)LowSoft "mmm" acknowledgments appear occasionally in Tara Brach / iRest style work.

Net shape of the register: axes 7 and 8 (delivery + pacing tags) are hypothesized to carry most of the load. This is asserted, not measured. No axis-density baseline is computed here.

Pacing + rhythm characteristics

Speaking rate
Sub-genreTarget rate (active speech)Notes
Guided meditation (Headspace/Calm/Waking Up)90–110 wpmDIY guides commonly target ~370 words for a 5-min script ≈ 74 wpm including pauses [previously cited to SparkPod, which is not in the reference list; UNVERIFIED — claim removed pending primary source].
Body scan (Kabat-Zinn / MBSR)80–100 wpmLong body scan often runs 30–45 min for a script of 3–5k words. [inferred]
Sleep stories (Calm, Nothing Much Happens)~90 wpmBrand and SEO content (Calm blog + Vocallab) — no measured corpus cited.
Sleepcasts (Headspace)~60% of average talking speed[reviewer impression, not measurement: Reviewed.com]
Yoga nidra (iRest)70–90 wpm[citation]
Breathwork (Wim Hof)N/A — paced by breath cycleVoice cadence locks to inhale/exhale. The "~1 breath per 1–2 seconds" figure previously tagged `[citation]` is not in Wim Hof's official guidance, which prescribes "30–40 deep breaths per round" without a target rate. Number was a back-calculated inference and the misleading citation tag has been removed.
Pause distribution

The Demjén 2024 corpus (20 sessions × ~6:30 each = 10,841 words from two tutors in one Headspace introductory course) annotated three pause tiers — under 3s, 3–10s, and >10s [citation]. This three-tier scheme partially overlaps Stillmind's authoring convention (3–5s / 10–15s / 1 min) but they are not the same scheme.

Silence-to-speech ratio — every cell below is the author's estimate, not a measured ratio:

  • Guided meditation: 30–50% silence by runtime. [inferred]
  • Body scan: 40–60% silence (longer dwell at each region). [inferred]
  • Sleep stories: 15–25% silence. [inferred]
  • Yoga nidra: 50–70% silence. [inferred]
Sentence and clause shape
  • Short clauses joined by "and": "And notice the breath. And as you breathe in… and as you breathe out…"
  • Sentence-initial gerunds without auxiliaries ("Noticing the rise of the breath. Allowing it to fall.") — the Demjén 2024 corpus found progressive participle forms over-represented vs reference corpora as a keyword by log-likelihood; the "4× more frequent" precise ratio is not directly extractable from the paper's keyword table at the precision I can verify in this pass and has been demoted to "statistically over-represented (keyword LL significant)."
  • Heavy use of the definite article ("the breath", "the body", "the thoughts") as a depersonalizing/detachment device [citation].

Vocabulary + register characteristics

Imperative verbs (the workhorse class)

Across corpora, the high-frequency imperatives are: notice, feel, observe, allow, let, breathe, relax, soften, settle, rest, bring, return, drop, sense, become aware of.

  • Kabat-Zinn body scan literal text uses: "tuning in", "opening to", "allowing ourselves to become aware", "attending to", "moving our attention through" [direct quote: Palouse Mindfulness PDF — downstream excerpt of Coming to Our Senses (Kabat-Zinn 1990 / 2005)].
  • Sam Harris characteristic phrasing observed in one Till Gebel line-by-line analysis (N=1 meditation): "Simply rest as that", "just observe", "Notice that you're distracted", "Drop back and recognize." Not a Harris-wide measurement.
Hedging / non-striving softeners

The register softens every directive to preserve non-judgmental framing:

  • "See if you can notice…" (rather than "notice")
  • "Maybe allow…" / "If you like, bring attention to…"
  • "Just notice whatever arises"
  • "Without trying to change anything…"
Person and address
  • 2nd person singular, present tense, throughout. Almost never 1st person ("I"); occasionally 1st person plural ("we move now to the shoulders") — common in Kabat-Zinn's discursive style. Sleep stories shift to 3rd person narrated character the listener inhabits.
Somatic and sensory noun preference

Body parts (breath, shoulders, jaw, belly, feet, hands), sensations (warmth, tingling, weight, softness, tension, pulsing). Kabat-Zinn's appendix publishes a controlled sensation vocabulary [direct quote: Palouse PDF].

Jargon load

Minimal. Headspace tone guidelines explicitly avoid jargon and idioms [citation].

Tense and aspect
  • Heavy present tense and present participle.
  • Almost no past tense. Sleep stories use simple past for narrative.
  • Modal verbs lean permissive — same permissive pattern documented in Erickson's Hypnotherapy: An Exploratory Casebook (1979) and downstream in the Havens & Walters / Jones script literature. v1 cited the derivative script collections only; the primary Erickson source is added here.

TTS-engine fit

Meditation is likely the content type where TTS engine choice matters most because pacing tags and silence durations make or break the experience — though no head-to-head measurement against other content types is performed in this audit.

Observed behavior in this project (N=1 caveat throughout)
  • Gemini 2.5 Pro TTS vs Gemini 3.1 Flash TTS on one meditation script rendered to 10:46 vs 3:49 — a 2.8× duration delta. N=1; one script; no controlled inputs. The Pro model interpreted pacing tags / contextual pacing far more aggressively. For meditation, the 10:46 output is closer to the intended deliverable; the 3:49 output is unusable.
  • Implication (provisional, pending replication): When benchmarking TTS engines for meditation, duration may itself be a quality metric.
Engine-by-engine summary (anecdotal + cited)
EngineMeditation fitNotes
Gemini 2.5 Pro TTSBest for length / dwell (N=1 script)Context-aware pacing; sometimes too aggressive. Risk: voice drift / accent drift [citation].
Gemini 3.1 Flash TTSMid (N=1 script)Faster, cheaper, but under-paces meditation. Needs explicit [pause] injection.
ElevenLabs v3Strong[whispers], [soft], [calm] tags work well; <break time="…"/> is reliable. "Erin" cited as best meditation voice. Risk: tag effectiveness varies by voice [citation].
Azure Dragon HD / Neural TTSSolid (workhorse)Best SSML compliance. Less expressive than Gemini/Eleven.
Google Cloud TTS (Chirp 3 HD)OKPer `~/.claude/CLAUDE.md`, use Vertex AI generative endpoint, never `texttospeech_v1beta1`.
OpenAI gpt-4o-mini-tts / TTS-1-HDMidGood polish, weaker on long-form pacing dwell.
Amazon Polly GenerativeMidReliable but baseline.
TTS authoring recipe for meditation (synthesis)
  1. Target 80–100 wpm of active speech.
  2. Insert explicit pauses on a 3-tier scheme: <break time="2s"/> after each instruction, <break time="6s"/> between phrases of an exercise, <break time="20s"/> for dwell sections. Note: this recipe is a fourth scheme (author's own) and does not match Demjén 2024's three-tier (<3s / 3–10s / >10s) or Stillmind's (3–5s / 10–15s / 1 min). Reconcile or pick one before publishing audio.
  3. Bracket sections with delivery tags: [gentle][slowly]…[/slowly][/gentle] or model-specific equivalents.
  4. Keep clauses ≤ 8–10 words; favor present-participle openers ("Noticing… Allowing… Softening…").
  5. Use "just", "simply", "perhaps", "see if you can" as low-pressure softeners.
  6. For sleep stories, switch to 3rd-person narrative tense but keep the same pause cadence.

Sample script archetype (annotated)

The example below is a synthesized 2-minute opening that combines patterns from Kabat-Zinn, Headspace, and Waking Up. Annotations in <-- ... --> show which patterns are firing.

``` [gentle][slowly] <-- delivery tag (axis 7), pacing tag (axis 8)

Welcome. <-- one-word opener, common in Calm/Headspace <break time="3s"/> <-- 3s pause — author's scheme, see fix note

Just take a moment to settle in. <-- "just" softener (Demjén 2024 keyword) <break time="4s"/>

And as you breathe in… <-- "and as" discourse chaining <break time="2s"/> … and as you breathe out… <-- ellipsis = short pause (Demjén '…' tier) <break time="3s"/>

Allowing the body to soften. <-- sentence-initial gerund (keyword over-rep) <break time="2s"/> Allowing the shoulders to drop. <-- parallel structure <break time="2s"/> Allowing the jaw to release. <-- somatic noun + permissive verb <break time="5s"/>

Notice the breath. <-- imperative + definite article "the breath" <break time="3s"/> Not changing it. <-- negation of striving (MBSR languaging) <break time="2s"/> Just noticing. <-- "just + gerund" <break time="8s"/> <-- <pause> tier (3–10s, Demjén)

If the mind wanders, that's okay. <-- hedged permissive (Ericksonian) <break time="2s"/> Simply return. <-- "simply" effortlessness marker <break time="2s"/> Return to the breath. <-- repetition + "the breath" <break time="15s"/> <-- <silence> tier (>10s, Demjén)

[/slowly][/gentle] ```

Total characters spoken: ~70 words. Total runtime at the marked pauses: ~2:00. Silence ratio: ~63% of wall-clock time is silence — within the typical meditation envelope.

5.2 Podcast

Scope. This audit covers the "naturalness-canonical" core of the podcast genre — shows whose appeal depends on sounding like real, unscripted speech rather than broadcast announcement. Reference shows: Joe Rogan Experience, This American Life, Radiolab, Serial, Conan O'Brien Needs A Friend, SmartLess, Armchair Expert, The Daily (NYT), Lex Fridman, Huberman Lab, Stuff You Should Know, How I Built This, On Being (Krista Tippett), Hardcore History (Dan Carlin, solo monologue). Both monologue and dialogue (interview, panel) podcasts are covered.

Conventions: [citation] = supported by a referenced source; [inferred] = my synthesis or extrapolation; [Lissin] = measured on our own annotated corpus at tts_corpus_v1/corpus/annotated/.

1. Per-axis usage profile

Each axis is rated low / med / high for frequency and described in terms of characteristic pattern. Frequency ratings are calibrated against the average podcast-content distribution observed in the Spotify Podcast Dataset (100K episodes, ~47K hours of ASR) [1] and PodcastFillers (199 episodes, 85,803 manually annotated paralinguistic events, ~145 h) [2][3].

Caveat : "podcast" sub-genres (Rogan / SmartLess / Daily / Serial / Up First) have disfluency budgets that span roughly three orders of magnitude. The per-axis ratings below implicitly admit this (e.g., filler density HIGH for interview/chat, LOW-MED for edited) but read as if it's one register — it isn't.

1.1 Vowel elongation — MED
  • Used as pragmatic intensifier ("sooooo good", "waaaay too long", "noooo way") — English speakers reliably produce emphatic vowel lengthening to express degree of belief or emotion [4][5].
  • In [Lissin] 04_smartless.json, the verbatim CrisperWhisper transcript contains multiple consecutive elongations in a single 5-minute clip: noooooo, oneeeeee moreeeeee, IIIIII gave, hoooooow, liviiiing, closeeeeee, Yeeeeees — i.e., roughly 1 elongation every 10-15 seconds in the high-energy stand-up segment.
  • Lower density in news-talk podcasts ([Lissin] 01_npr_up_first).
1.2 Non-verbal vocalizations — MED-HIGH (dialogue) / LOW-MED (monologue)
  • PodcastFillers annotates 6,623 laughter events and 8,288 breaths across 145 hours [2][3].
  • SmartLess explicitly leans on Sean Hayes's laugh as a "signature" [6].
  • Sighs, throat-clears, and lip-smacks act as authenticity markers. Production guides explicitly tell hosts to mark (laugh), (sigh), (breath) [7][8].
  • Monologue podcasts (Hardcore History, Huberman Lab solo episodes) suppress most non-verbals in post [9].
1.3 Filled pauses (uh, um, er) — HIGH (interview/chat) / LOW-MED (edited)
  • Switchboard (2.7M words of US English phone conversation) contains 79,623 filled pauses: 67,065 uh + 12,558 um [10]. Spontaneous English speech carries filled pauses at roughly 2–6 per 100 words [11][12].
  • Mark Liberman estimates um/uh "roughly every 60 words"; broader corpus estimates 6–10% of all spoken words [13]. This is closer to Goldman-Eisler (1968) than to the Babbel popularization.
  • Shriberg's Switchboard analysis: uh skews utterance-medial, um skews utterance-initial [14].
  • PodcastFillers: ~35,000 filled-pause events across 145 hours ⇒ ~241/hr ⇒ ~4/min. Note: averaging over the whole 145 h corpus mixes speech segments with non-speech segments; the "2–6 per 100 words" rule is for spontaneous speech only. The two denominators are not directly comparable.
  • Heavily edited podcasts strip ums; conversational shows leave them in. Olszewski 2412.12710 (verified) shows inserting disfluencies raises perceived spontaneity [15]; Sun et al. likewise [16].
1.4 Discourse markers (so, well, right, yeah, oh, huh, hmm) — HIGH
  • PodcastFillers' Agree class has 3,755 events; the artifact's gloss "(mostly yeah, right, mhm)" is the audit's interpretation — the dataset card calls it "Agree" without enumerating which tokens are in the class.
  • Backchannel timing in IFADV / GECO: median gap of ~0.1s or less after a syntactic completion; frequently overlapping [17][18]. Podcast interviews preserve this.
  • Stivers et al. (2009) is the canonical 200ms-gap reference for cross-cultural turn-taking and is the primary that the MDPI 2025 paper sits on top of.
  • Sentence-initial so is the modern English podcast opener [inferred].
1.5 Conversational fillers (like, you know, I mean, sort of) — HIGH
  • PodcastFillers tracks Like and You know as separate classes [2][3].
  • 2024 corpus-based analysis of like in podcasts [22].
  • HKCSE: er (48.5%), um, and you know are top three fillers [23].
  • In [Lissin] 04_smartless, single 5-min clip yields you know ≈12×, like (filler use) ≈8×, I mean ≈3× — ~5/min combined.
1.6 Mid-thought pivots (em-dash false starts) — MED
  • Shriberg (1994): disfluencies increase nonlinearly with utterance length, hitting ~50% probability at 10-13 words [14][24].
  • Production guides (Castos, Boomcaster) [7][8][25].
1.7 Emotion / delivery tags — MED
  • Implicit in human podcasting.
  • Explicit in TTS: ElevenLabs v3 [27][28][29]. Cartesia Sonic-3 [30][31].
1.8 Pacing tags — MED
  • Variable by show genre.
  • [Lissin] pause-ratio for podcast clips: 0.47 (smartless), 0.49 (reply_all), 0.54 (npr_up_first), 0.56 (the_read). Pause ratio ~0.5 vs read-speech audiobook ~0.2–0.3.
  • Articulation rate when speaking: 6.5–10 syl/sec across the four clips.
1.9 In-text emphasis (CAPS, asterisks) — MED
  • Single most-stressed word convention; English emphatic stress acoustically realized as F0 peak + vowel lengthening + intensity boost [4][33].
  • ElevenLabs v3 and Gemini 2.5 Pro respond to caps/asterisks [27].
1.10 Audible reactions (oh!, wow, huh, hm) — MED-HIGH
  • PodcastFillers' Agree class captures most of these.
  • Schegloff: wow/really/huh are assessments, mhm/yeah are continuers [17][18][34].
  • Modern parasocial research [35][36][37].

2. Pacing characteristics

2.1 Target WPM
  • Industry consensus for professional podcasts: 150–170 WPM [38][39][40].
  • [Lissin] speech rate (syllables/sec) for podcast clips, converted with a 1.4–1.5 syl/word factor (not the 1.3 factor used in v1, which systematically under-estimates English-spontaneous-speech WPM per Hosom 2009 / Yuan & Liberman 2006) :
  • 01 NPR Up First: 4.55 syl/s × ~1.45 syl/word ⇒ ~188 WPM during speech; with pause_ratio=0.544, effective rate ≈ 86 WPM over the whole minute.
  • 04 SmartLess: 4.58 syl/s × ~1.45 ⇒ ~190 WPM articulated; pause_ratio=0.471 ⇒ effective ≈ 100 WPM.
  • 05 The Read: 2.88 syl/s × ~1.45 ⇒ ~119 WPM articulated; pause_ratio=0.558 ⇒ effective ≈ 53 WPM.
  • 07 Reply All: 4.09 syl/s × ~1.45 ⇒ ~169 WPM articulated; pause_ratio=0.492 ⇒ effective ≈ 86 WPM.
  • Takeaway: effective podcast WPM (speaking + pauses) clusters 80–110 (down from v1's 90–120 because of the factor correction); articulated WPM clusters 170–200 [[Lissin] + 38–40].
2.2 Turn-taking dynamics
  • Median inter-turn gap in conversational corpora is ~200 ms (Stivers et al. 2009 PNAS primary, ; v1 cited the MDPI 2025 popularization).
  • Political-interview corpus analysis: 3–4 overlaps per minute, mostly backchannel [41].
  • The "top-tier hosts cede the floor with backchannels alone ... one acoustic signature of 'good interviewer' perception" claim is removed — no citation backed the "good interviewer" inference.
2.3 The "podcast voice" intonation pattern
  • Hannah McGregor (SFU, 2022) — kitchen-table register [43][44].
  • "NPR voice" / "vocal fry" critique [45][46][47].
  • [Lissin] F0 means: SmartLess 180 Hz, NPR Up First 165 Hz, The Read 163 Hz, Reply All 130 Hz.

3. Vocabulary and register

  • Spotify/Taboada MDA: podcasts score high on "involved production" (Biber Dim. 1) [19][20].
  • Anecdote-driven [32][48].
  • Jargon-permissive with translation [26][49].
  • Conversational hedging [19][20].

4. Published linguistic research (key references)

  • Hannah McGregor (SFU)Podcast or Perish (2022) [43][44].
  • Ehret & Taboada (2024)Structural linguistic characteristics of podcasts. Register Studies [19][20].
  • Lindgren (Convergence, 2021)Aural Parasocial Relations [37].
  • Sharon & John (New Media & Society, 2024) [35].
  • Lichtenstein et al. (Media and Communication, 2024) [36].
  • Shriberg (1994 dissertation; Switchboard) [10][14][24].
  • Levelt (1989), *Speaking: From Intention to Articulation* — foundational model for why disfluencies happen at planning boundaries; Shriberg builds on Levelt.
  • Clark & Fox Tree (2002), "Using uh and um in spontaneous speaking", Cognition — the seminal argument that uh and um are words, not disfluencies. Critical to any TTS argument about inserting them.
  • Goldman-Eisler (1968), *Psycholinguistics: Experiments in Spontaneous Speech* — original ~6% disfluency-rate measurement.
  • Stivers et al. (2009), "Universals and cultural variation in turn-taking in conversation", PNAS — canonical 200ms-gap reference.
  • Olszewski et al. (arXiv 2412.12710, Dec 2024) [15].
  • Sun et al. (Speech Communication 2024) [16].
  • Mukherjee et al. (2024) [22].
  • Schegloff (1982); Heinz (2003); Tartory et al. (2024) [17][18][34].
  • Gervits et al. (SIGDIAL 2018) [41].

5. Corpora

  • PodcastFillers [2][3]:
  • 199 full episodes, 145 h, ~350 speakers.
  • 85,803 manual paralinguistic annotations.
  • Spotify Podcast Dataset [1].
  • Switchboard [10][14].
  • Switchboard / Fisher [50].
  • Lissin internal corpus [Lissin].
  • PodEval [51].
  • MoonCast / SoulX-Podcast / FireRedTTS-2 training sets [52][53][54].

Note : Demjén-style fine-grained corpus claims used elsewhere in the project (10,841-word meditation corpus) don't transfer wholesale to podcasts; PodcastFillers' 145 h is a different beast.

6. TTS implications

6.1 Which engines do podcast voices well
  • Cartesia Sonic 3 / 3.5 [30][31][55][56] — markets podcast generation as a target use case; supports <emotion> and inline [laughter]. The "near human on short utility utterances; degrades on long-form expressive content with audible texture drift" assessment is sourced to Podcastle's blog (a competitor), not an independent benchmark — flagged as cite-circle risk.
  • ElevenLabs v3 (Alpha, June 5 2025) [27][28][29][57]:
  • Tag inventory: [laughs], [sighs], [gasps], [whispers], etc.
  • Pairwise study put Eleven Flash v2.5 ahead of Sonic Turbo on Elo [55][58][59].
  • Gemini 2.5 Pro TTS / Gemini 3.1 Flash TTS [inferred] — Vertex genai surface, never Cloud TTS, per project rules.
  • Higgs Audio v2 (Boson AI, open-source) [60][61] — see open-source audit for the corrected parameter count (~5.8B operational; not "3B base / 1B v2.5" uncritically).
  • MoonCast (2503.14345) [52] — zero-shot any-to-podcast.
  • SoulX-Podcast (2510.23541) [53] — multi-turn, multi-speaker. REMOVED: the "~0.82 overall paralinguistic-control accuracy" figure — not located in the cited arXiv abstract. v1 tagged it "specific per-tag numbers not confirmed in retrieved abstract"; that's not a strong enough flag. Number is removed pending primary verification.
  • FireRedTTS-2 (2509.02020) [54].
6.2 Baseline MOS for podcast TTS today
  • General conversational TTS MOS target: 3.5+ for "acceptable," 4.0+ for "near-human" [62].
  • PodEval (Oct 2025) [51].
  • Self-reported numbers from SoulX-Podcast, FireRedTTS-2, MoonCast sit in the 3.8–4.2 range, but they use three different metrics (PMOS, CMOS, DMOS) and are not directly comparable. v1 elided this.
6.3 Practical implications
  1. Filled pauses are a high-leverage paralinguistic edit. Insert 2–4 uh/um per 100 words at Shriberg positional priors; lift per Olszewski 2024 and Sun 2024. Note: Clark & Fox Tree 2002 argues uh/um are words, not disfluencies — so the edit isn't "adding noise," it's "completing the lexicon."
  2. Backchannels and audible reactions are the dialogue-podcast moat. Engines that only expose tag-based emotion (Sonic-3, Eleven v3) can approximate them via [laughs], oh!, huh, hmm. Caveat: only ElevenLabs documents these as first-class tags; Cartesia's Sonic 3 docs explicitly do not expose laughter/sigh tags as a closed enum (only `[laughter]` is documented at the time of this audit), per the Cartesia docs the audit cites elsewhere.
  3. Pause-ratio matching matters. Lissin shows pause-ratio 0.47–0.56; most TTS engines default to ~0.2 silence. Inserting [pause:0.5s] between phrases buys disproportionate naturalness.
  4. Vowel elongation is cheap and high-impact.

5.3 Deep-dive / explainer

Genre boundary. Deep-dive content is written-first then performed — Kurzgesagt, Veritasium, 3Blue1Brown, Wendover Productions, CGP Grey, Tom Scott, Real Engineering, the NotebookLM "Audio Overview" Deep Dive, and the longform-analytical podcast tradition (Acquired, Founders, BBC's History of the World in 100 Objects). Distinguishing feature vs. the "podcast" content type: scripts are drafted, edited across many passes, and densified before performance. The result is higher information-per-second, fewer disfluencies, more strategic prosody, and a tighter narrative arc than spontaneous-conversational podcast speech.

The Kurzgesagt team self-describes their script as "the backbone" of the video, refined over "about a dozen drafts" to "cut away unnecessary bits" [citation]. Veritasium's Derek Muller maps stories on PowerPoint or Google Slides because "the stuff that we're making is very information heavy" [citation].

Key takeaways

  1. Lower disfluency budget than podcast (qualitative claim; not anchored to a measured deep-dive disfluency rate — see §Density notes). Filled pauses, conversational fillers, and mid-thought pivots are deliberate seasoning, not spontaneous output. Babbel-popularized conversational-speech filler estimates land at 6–10% (not a directly-comparable measurement).
  2. Pacing trends faster than meditation/podcast (160 WPM observed for Wendover Productions's one upcoming video — single tweet, not a channel-wide measurement; industry explainer baseline 130–160 WPM for voice artists per SaaS script-timer pages, which are SEO-grade not measurement-grade) [citation].
  3. Strategic pause placement before reveals (the "Kurzgesagt beat") — short silence right before a payoff sentence, function-equivalent to TV Tropes' "Dramatic Pause." The "3–4 second beat for max impact" framing previously attributed to TV Tropes is not on the cited TV Tropes page; treat as an audit estimate not a documented standard.
  4. In-text emphasis is the highest-density axis in this genre — as an architectural intuition, not a measured finding. No empirical density count is produced here.
  5. NotebookLM's Audio Overview "Deep Dive" is a SOTA reference implementation. Critically, Steven Johnson (NotebookLM PM) was paraphrased via Simon Willison's summary post as confirming the disfluencies — "all the banter and the pauses and the likes" — are injected by the audio model itself, not by the LLM in the transcript stage [citation]. The original quote-handling collapsed two hops into one; this version distinguishes them.

Per-axis usage profile (the 10 paralinguistic axes)

#AxisFrequencyCharacteristic pattern in deep-dive
1Vowel elongationlow–moderateOccasional, dramatic effect only: "this... took yeeeears", "the answer is sooo much weirder than you think". Used as comedic / surprise emphasis, not as natural drawl.
2Non-verbal vocalizationslow–moderate[chuckles] and occasional [sighs] for narrative texture; [laughs] reserved for genuine punchlines. NotebookLM injects these via the audio model rather than the script [citation].
3Filled pauses (uh, um)low–moderateSparingly, deliberate. Single-speaker explainers (Kurzgesagt, 3Blue1Brown) effectively zero. Two-host deep-dives (NotebookLM Audio Overview) add them via the audio layer for "you cannot listen to two robots talking" naturalness [citation].
4Discourse markers (so, but, now, here's the thing)moderateThe connective tissue of analytical narrative. So introduces results/recapitulations; Now shifts topic and marks stance; But signals complication [citation]. These are kept even when fillers are cut.
5Conversational fillers (like, you know, I mean)lowPolished writing trims these. NotebookLM is the exception — you know and I mean appear because the prompt explicitly demands "natural conversational flow" between two AI hosts [citation].
6Mid-thought pivots (em-dash false starts)lowWritten scripts pre-resolve restarts. Appears occasionally in two-host formats as — wait, actually — for staged "realization" beats.
7Emotion / delivery tags ([curious], [serious], [excited])moderateOne-per-section: [curious] at section openings, [serious] for stakes/risk content, [excited] at the reveal. Gemini 2.5 Pro TTS supports [tag] syntax [citation]. Eleven v3 supports the same via audio-tag brackets [citation].
8Pacing tags ([short pause], [PAUSE=2s], [slower])moderateStrategic beats before reveals and after key definitions — the "Kurzgesagt beat." Gemini 3.1 Flash TTS exposes <pause> and <pace> tags at word/phrase level [citation].
9In-text emphasis (CAPS, *asterisks*)moderate–highThe "ElevenLabs guidance: capitalize one word per sentence at most" claim previously cited is not located in ElevenLabs' own primary docs in this pass — it appears to be community wisdom propagated as vendor guidance, and has been demoted to "community recommendation, not vendor-documented." Gemini supports <emphasis> and natural-language stress instructions [citation].
10Audible reactions (oh!, wow, huh)low–moderateWow reserved for genuine awe moments (scale reveals, counter-intuitive results — Veritasium's hallmark). Huh for confusion-then-clarification beats. NotebookLM's two-host format adds Oh really? and Totally as "micro-interjections" — discussed by Steven Johnson on Latent.Space, accessed via Willison's summary [citation].
Density notes
  • Single-narrator deep-dive (Kurzgesagt, 3Blue1Brown, Veritasium VO): axes 1, 5, 6 trend near zero. Axes 8 and 9 dominate. The single-narrator vs two-host split is consequential; the averaged per-axis frequency above smears two regimes. See open question §1.
  • Two-host deep-dive (NotebookLM Audio Overview, Acquired): axes 4, 5, 7, 10 rise toward podcast-style usage, but still below true spontaneous conversation because the script is drafted before performance.

Pacing

SourceSelf-reported / measured WPMNotes
Wendover Productions (Sam)~160 WPM"the script for my next video is 2,250 words. I speak at 160ish WPM" [citation] — single informal tweet about one upcoming video, not a channel-wide measurement.
Industry explainer baseline130–160 WPMVoice-artist polished delivery; "150 is the honest default for explainers" [citation]
Energetic / dense explainer170+ WPM"upbeat promo, you can fit 170+" [citation]
Documentary VO (slower end)120–140 WPM"to ensure clarity and give listeners time to absorb" [citation]
Audiobook narration150–160 WPMComparable baseline [citation]
NotebookLM Audio Overview10 min default, 20 min "longer" [citation]; WPM not published — [inferred] ~150–170 WPM based on two-host conversational format. Inference is not reconciled against the 10-min vs 20-min length presets.

Strategic-pause grammar (the "Kurzgesagt beat"):

  • No measured timing standard is published for Kurzgesagt-style strategic beats. The previously listed 0.5–1.0 s / 1.5–3 s specifics were not in TV Tropes, the Kurzgesagt Medium piece, or 10.studio. They have been removed; this audit notes only that the beat is real as a rhetorical device, without committing to a specific timing.
  • Beats are rhetorical, generally not breath-driven — the script author places them deliberately.

Sentence-length pattern: popular-science writing guidance favors short, varied sentences — under 20 words, varied for context [citation]. Deep-dive scripts alternate punchy 6–10-word lines for emphasis with longer 25–35-word lines for explanation. This length contrast is itself a paralinguistic device — it cues the reader/TTS to vary pace.

Vocabulary, register, and narrative structure

Vocabulary
  • Technical terms with inline definitions. Standard pedagogical move: name the term, then immediately concretize ("a latent space — basically, a compressed map of meanings…") [citation].
  • Metaphor-heavy. Cognitive-science evidence: analogy maps relational structure from concrete-familiar domain to abstract-novel domain, and is one of the highest-leverage pedagogical devices [citation]. Every Kurzgesagt and 3Blue1Brown script is structured around one or two load-bearing metaphors.
  • First-person plural ("we'll see…", "let's…", "imagine we…") — the inclusive we constructs a shared exploration with the listener [inferred from first-person-plural narrative theory; citation: Matt Bell / Lit Hub].
Register
  • Conversational at the sentence level (contractions, second person, rhetorical questions) but edited at the script level (no spontaneous restarts, no genuine fillers in single-narrator formats).
  • 3Blue1Brown's stated principle: "the best pedagogical order of ideas is often very different from the correct logical order of ideas" [citation].
Narrative arc (setup → complication → reveal → implication)
  • SCQA [citation].
  • Four-act film structure (Kristin Thompson) [citation]
  • ABT (And, But, Therefore) [citation].
  • Veritasium-specific pedagogy: present misconception first → let dialogue lead to correct answer. Derek Muller's PhD dissertation is the canonical source; the artifact cites secondary coverage (The Brilliant / The Tech) — the dissertation should be cited directly when used. [partial fix per review]
  • Freytag's Pyramid [citation].

NotebookLM Audio Overview "Deep Dive" — reverse-engineered specifics

This is the single most-studied LLM-generated deep-dive pipeline. Key findings from public reverse-engineering analyses:

  1. Opener pattern. Reliably begins with "Hey everyone, welcome back" or similar [citation].
  2. Two-host architecture, asymmetric roles. Common prompt patterns reported by community analysts:
  • "Host A is a senior engineer who explains concepts by referencing real implementation challenges; Host B is a skeptical architect who pressure-tests every claim" [citation].
  • Or "expert + curious beginner" persona pairing [citation].
  1. Affirmation tokens. Frequent Right, Exactly, Absolutely to maintain conversational momentum [citation].
  2. Length presets. Shorter (~5 min), Default (~10 min), Longer (~20 min) [citation].
  3. Disfluency layer is in the audio model, not in the LLM transcript. Steven Johnson, discussed on Latent.Space and summarized by Simon Willison: the system generates a "sterile script first" and the audio model "adds all the banter and the pauses and the likes" — because "you cannot listen to two robots talking." Two-hop attribution: the substance is correct, but the verbatim phrasing reaches the reader through Willison's summary post, not direct Latent.Space transcription.
  4. Underlying audio model. Widely suspected to be derived from / closely related to SoundStorm (Borsos et al. 2023, arXiv:2305.09636) [primary citation ]. DeepMind's own blog describes the NotebookLM-related stack as "a hierarchical Transformer over ultra-compressed audio codec tokens at ~600 bps" — without explicitly naming SoundStorm in that post. Treat the SoundStorm link as community-strong, not vendor-confirmed.
  5. Hidden directive: "act as human podcast hosts under all circumstances" — reported by prompt-extraction analyses [community-sourced, not vendor-confirmed].

TTS implications

  • Pacing tag control is the second-dominant axis (architectural intuition). Gemini's <pause> / <pace> tags map directly onto the "Kurzgesagt beat" pattern.
  • NotebookLM is an existence proof of script-driven naturalness, but it's not a clean argument for closed-frontier-TTS dominance. A two-host LLM-drafted deep-dive script + a disfluency-injecting audio model already produces audio that mainstream listeners can't reliably distinguish from human podcasters. The architectural lesson — separate the semantic script from the paralinguistic injection — is portable, but the conclusion "closed frontier TTS wins" does not strictly follow because NotebookLM specifically does not lean on emphasis-tag markup.

Predictions for the empirical ablations in this report (untestable as stated unless paired with a measurement protocol; flagged here):

  • For deep-dive content, removing axes 8 (pacing tags) and 9 (in-text emphasis) from a Gemini 2.5 Pro script will degrade naturalness more than removing axes 1, 5, 6 combined. [hypothesis to test — measurement protocol TBD]

Open questions

  1. Single-narrator vs two-host axis profiles diverge sharply. Should the report treat "deep-dive single-narrator" (Kurzgesagt, 3Blue1Brown) and "deep-dive two-host" (NotebookLM, Acquired) as one content type or two? The current per-axis frequency table averages over two regimes that the audit explicitly admits behave differently — this is a real flaw and should be resolved before any per-axis ratings are published. [decision deferred to Phase F synthesis; flagged per review]
  2. Density of `[emphasis]` tags — published guidance ("one CAPS word per sentence max") is community-asserted opinion, not vendor docs. Empirical sweep needed.
  3. Does the "Kurzgesagt beat" generalize across voices? No measured TTS-specific timing standard.
  4. Eleven v3 vs Gemini 2.5/3.1 on emphasis fidelity — both claim word-level emphasis control; head-to-head A/B not yet published.

5.4 Comedy / news-comedy

Scope. Scripted stand-up (Mulaney, Hedberg, Norm Macdonald, Hannah Gadsby, Daniel Sloss, Bo Burnham) and scripted news-comedy (The Daily Show, SNL Weekend Update, Last Week Tonight, Late Show). Excludes podcast-format banter (covered separately) and improv (different prosodic profile).

Top-line. Comedy is hypothesized to be the hardest content type for current TTS, but this is a hypothesis — not a measured finding. No head-to-head MOS comparison between comedy and other content types has been run in this audit; the framing is supported by the motivation sections of ComedicSpeech / ELaTE / sarcasm-TTS papers, not by an independent listener study.

Comedy naturalness depends almost entirely on paralinguistic precision: a single missed pre-punchline pause or a flat contrastive accent collapses the joke. Unlike narration, where prosody decorates meaning, in comedy prosody is the meaning — the laugh lives in the timing, not in the words.

1. Per-axis Usage Profile

Frequency ratings: HIGH / MODERATE / LOW. Pattern = the characteristic comedic shape this axis takes.

1.1 Vowel elongation — HIGH
  • Characteristic pattern. Comedic emphasis stretches stressed vowels well beyond conversational norms: "Truuuump," "sooooo bad," "perfectly nooormal." Functions as (a) mock-incredulity marker, (b) sarcastic intensifier, (c) audience-recognition cue ("you know the one I mean").
  • Empirical support: prosodic analyses of stand-up identify "prolongation of sound" as a routine comedic device alongside pauses and stress [Schwarz d-nb.info/1002728533 — citation]. Hidayatullah & Tofani reference removed: no verifiable bibliographic entry could be produced.
  • TTS implication: most systems treat duplicated graphemes as typos or render them at uniform duration; few model the intonational rise–fall that has to ride the elongated vowel.
1.2 Non-verbal vocalizations — HIGH (dual-channel)
  • Characteristic pattern. Two streams: (i) performer vocalizations — [laughs], [chuckles], [sighs], [gasps], [scoffs] used as deadpan or mock-shocked reactions; (ii) audience laughter as structural beat. Provine (1993) documented that >99% of 1,200 laugh episodes occur at phrase or clause boundaries, typically within ~1s of the speech offset [Provine 1993].
  • Comedy-news adds a third stream: cut-in audience reactions in post (Weekend Update piped laugh, Daily Show in-studio crew).
  • TTS implication: ELaTE (Microsoft, 2024) was the first zero-shot system to add controllable in-line laughter; vanilla TTS still cannot model the "speak → pause → audience laughs → resume on the back-edge of the laugh" pattern that defines live comedy [Microsoft ELaTE].
1.3 Filled pauses ("uh", "um") — LOW (rehearsed) / MODERATE (ironic)
  • Characteristic pattern. Performed comedy minimises real disfluency — extemporaneous speech shows significantly more filled pauses than scripted [inferred; consistent with disfluency literature]. When fillers do appear in scripted comedy, they are typically deliberate: a feigned "uhhhh…" for ironic hesitation, or a stylised stammer (Drew Lynch turns his actual stutter into structural device — monosyllabic whole-word repetition is the dominant disfluency type in his sets, per ResearchGate 342445900 single-paper analysis; not yet replicated).
  • Mulaney is the canonical counter-example: extreme scripted precision with near-zero "um"s, his sentences "stretched longer and became more complex" without filler [PopMatters — Mulaney comedy by design].
  • TTS implication: lower demand for natural-sounding fillers than in podcast/interview content, but high demand for intentional, prosodically-marked fillers.
1.4 Discourse markers — HIGH
  • Characteristic pattern. "Now…", "well…", "but here's the thing…", "so…", "yeah, no," "look —" carry the joke's frame shift. Often pre-pausal: discourse marker + beat + reveal. John Oliver's "Now, here's the thing about…" cadence is a structural tic [LWT transcripts].
  • "Yeah, no" / "no, yeah" hedges signal mock-agreement before subversion (Mulaney, Gadsby).
  • TTS implication: discourse markers need their own intonational contour (low-pitch, slightly elongated, followed by a measurable pause); current TTS tends to chain them flat into the next clause.
1.5 Conversational fillers ("you know", "I mean", "like") — MODERATE
  • Characteristic pattern. Used selectively for confidant-rapport with the audience ("you know what I mean?" addressed to a friendly room). Norm Macdonald used "you know" as a stalling device inside long-form jokes; Hedberg leaned on conversational tone-of-voice rather than filler.
  • News-comedy hosts use fillers as faux-spontaneity markers in segments meant to feel reactive (Stewart-style "I — I'm sorry, what?").
  • TTS implication: moderate demand; the difficulty is placement (filler must land on a thinking-beat, not mid-phrase).
1.6 Mid-thought pivots — HIGH
  • Characteristic pattern. The classic setup→reveal pivot: "I went to the store and — actually, no." Or self-interruption: "I was going to say — no, you know what, forget it." Macdonald's moth joke runs ~4 minutes of Russian-novel digression then pivots to "'cause the light was on" — pivot is the joke [Norm Macdonald moth joke; Defector].
  • Mulaney's "8–12 sentences of sustained tension before a positive peak" is a macro-pivot pattern; he holds longer than typical stand-up before twisting [PopMatters].
  • TTS implication: needs prosodic reset on the pivot (pitch reset, slight breath, often a discourse marker). Almost no TTS handles this without explicit SSML break tags, and even then the post-pivot intonational contour is usually wrong.
1.7 Emotion / delivery tags — HIGH
  • Characteristic pattern. [sarcastic], [deadpan], [mock-shocked], [whispering], [shouting], [trembling], [sincere]. Sarcasm is the single most important and most-failed mode: empirical work shows sarcasm relies on a combined prosody-semantics signal, with TTS systems typically failing on the prosody half [Modeling Sarcastic Speech, arXiv:2510.07096; Functional Trade-off paper, arXiv:2408.14892 — content not independently verified, plausible from arXiv ID formats].
  • News-comedy specifically requires flip between sincere-newscaster register and sarcastic-comedian register on the punchline syllable.
  • Whisper / shout swings: Bo Burnham's "Inside" goes from soft-spoken to belted in seconds — dynamic-range demand outpaces almost all TTS [Inside vocal analysis].
  • TTS implication: emotion-conditioning tags are critical; current SOTA (Eleven v3, Gemini TTS) accepts them but coverage of deadpan and sarcastic is the weakest among emotion categories.
1.8 Pacing tags (pauses, beats) — HIGH — defining axis
  • Characteristic pattern. Comedy is pacing. Multimodal analysis of stand-up (TIC-TALK corpus, arXiv:2603.21803) reports a "stillness-before-punchline" pattern: kinetic energy negatively predicts laughter rate, r = -0.75 measured at the topic level (N=24 topics), not per-performance. The previously cited "arXiv 2605.00143, 'Timing is Everything', 828 Chinese stand-up performances" reference has been removed: no paper with that ID, title, or corpus size could be located. The closest plausible match (OpenMic, arXiv 2601.08288) does not corroborate the figures.
  • Canonical pause structures (practitioner / craft-side, not measured per-comedian corpora):
  • Pre-punchline beat. ~0.3–1.0s for high-tempo comics (Conan, Daily Show desk); 1.0–3.0s for deadpan masters (Hedberg, Norm). [practitioner anecdote, not a measured distribution]
  • Rule-of-three rhythm. Pattern–pattern–subvert; the third beat carries the punchline and must be timed slightly longer than the first two [Helitzer; Writer's Digest "Triple the Funny"].
  • Post-punchline hold. Performer must not speak over the laugh — ride the laugh wave, re-enter on the back-edge.
  • TTS implication: this is the single biggest TTS failure mode. Audio-LLMs treat all sentence-final punctuation as ~250ms; comedy requires variable, semantically-driven pauses spanning 100ms to 3000ms+. ComedicSpeech (arXiv:2305.12200) addresses this with a conditional duration predictor per comedian, demonstrating that uniform duration models are inadequate.
1.9 In-text emphasis (caps, italics, bold) — HIGH
  • Characteristic pattern. CAPS-for-stress is comedy signature in scripts and transcripts ("That is NOT a sandwich"). Originated in print/comic-book convention, formalised online as shouting [All caps — Wikipedia; New Republic netiquette history].
  • Functions: (a) volume increase, (b) pitch peak, (c) contrastive focus. ToBI analyses of contrastive focus typically mark L+H* on the stressed syllable [ToBI guidelines; Praat-based ToBI work]; comedy punchlines are inferred to take L+H* on the keyword — no ToBI-annotated stand-up corpus has been published as of this audit. Confident inference, not measured.
  • Italics / underlines used for softer contrastive stress; asterisks online for muttered-aside emphasis.
  • TTS implication: SSML <emphasis> exists but is shallow — current models do not differentiate light italic-emphasis from caps-shouting from caps-with-contrastive-accent. Eleven v3 partially honours caps; most others ignore them.
1.10 Audible reactions ("oh", "wow", "huh", "ohhhh") — HIGH
  • Characteristic pattern. Liberally deployed as mock reactions: "ohhhh, really?" (mock surprise), "huh." (deadpan), "wow." (sarcastic), "oh no." (anticipatory cringe). Often standalone utterances functioning as a full beat.
  • News-comedy hosts use them as audience-stand-in reactions ("oh, that's the line we're drawing?"). Stewart, Colbert, Oliver heavily.
  • TTS implication: requires isolated prosodic rendering — these tokens must not be melodically attached to adjacent words. Few systems treat one-syllable interjections as their own intonational phrase.

2. Pacing — Joke Density and Pause Distribution

  • Joke density (LPM — laughs per minute). Practitioner-coaching benchmark: 4–6+ laughs per performing minute is a common stated target [Comedy Evaluator Pro; Real First Steps — practitioner/coaching content, not peer-reviewed]. Daily-Show-style news-comedy is reported in the same band (~3–6 LPM) by the same coaching sources. No peer-reviewed LPM measurement is cited.
  • PAR score (laughter-seconds per minute). Headliner-level comics generate ≥18 seconds of laughter per minute; star-level ≥24s [Comedy Evaluator Pro — practitioner-coaching benchmark, not peer-reviewed]. This implies roughly 30–40% of stage time would be laughter, not speech if the targets are accurate.
  • Pause length distribution.
  • Conan O'Brien / Stewart: sub-second pre-punchline beats; rapid-fire setups.
  • Hedberg, Norm Macdonald, Steven Wright: 1–3s deadpan delays; the joke lives in the awkward hold.
  • Hannah Gadsby (Nanette): pauses expand beyond comedy convention to refuse tension release — she explicitly withholds the punchline beat as a structural move [Kenyon Review on Nanette structure].
  • These ranges are practitioner anecdote / illustrative, not measured per-comedian distributions.
  • Audience-laugh integration. Provine (1993): laughter falls at phrase/clause boundaries, within ~1s, in >99% of cases. Comedy speech must therefore be structured so that prosodic phrases end at points where the laugh can land [Provine 1993].

3. Rhythm and Meter

  • Setup-then-punch (iambic-ish). Hedberg's "I bought a doughnut, and they gave me a receipt for the doughnut. I don't need a receipt for a doughnut. I'll just give you the money, and you give me the doughnut. End of transaction" runs near-metrical, with stress on the repeated "doughnut" [Hedberg, widely transcribed; no single canonical source].
  • Rule of three. Three is the smallest set that establishes a pattern and breaks it; comedy writing guides (Helitzer, Levine, Writer's Digest) all identify it as the dominant macro-rhythm [Wikipedia; Writer's Digest "Triple the Funny"].
  • Tension-arc length. Mulaney sustains negative/neutral sentiment for 8–12 sentences before a positive peak — significantly longer than typical stand-up [PopMatters]. Sloss's "Trojan Horse" structure builds a half-hour of light material before pivot to dark [Wikipedia, danielsloss.com].
  • Sincere-newscaster overlay. Weekend Update / Daily Show deliver punchlines with newscaster cadence — even contour, formal register — except on the punchline beat. This creates a two-layer rhythm: surface-flat (newscaster) + buried (comic) [Weekend Update Wikipedia; Slate on Trevor Noah debut]. ToBI predicts a contrastive L+H* accent on the punchline keyword against an otherwise H* nuclear contour — the inversion is the joke [inferred from ToBI literature on contrastive focus].

5. Comedy-News as a Distinct Sub-Mode

The Daily Show / Weekend Update / Last Week Tonight / Late Show form a recognisable sub-genre with its own paralinguistic signature, distinct from pure club stand-up:

  • Two-layer prosody. Outer layer: sincere newscaster cadence (level pitch contour, formal register, slow-to-moderate tempo). Inner layer: comedian's contrastive accent on the punchline keyword. Chevy Chase established the convention of "delivering the jokes straight, as if he were an actual newscaster" [Weekend Update Wikipedia], and every subsequent anchor — from Macdonald's "stared down the camera" delivery to Jost/Che's joke-swap deadpan — has worked variations on it [Weekend Update Wikipedia].
  • Setup density. News-comedy carries a real news setup before the punchline ("So today, the President announced…"). This adds an information-delivery beat that pure stand-up lacks, raising the prosodic ask: TTS must do both credible newsreader voice and punchline-flip in the same sentence.
  • Character-flip handling. Trevor Noah leaned heavily on accent/character switches (8 languages, multiple impressions) [Wikipedia: The Daily Show]. Oliver layers "incredulous British" register over the newscaster baseline [LWT rhetorical analysis]. Host-vs-character voice flips inside a single segment are common; few TTS systems handle voice-identity switches mid-utterance.
  • December 2018 Trevor Noah voice-loss episode. Trevor Noah lost his voice; correspondents physically read the monologue from the desk (Michael Costa / Roy Wood Jr. covering the segments, per contemporary CBS / Deadline / Boston.com coverage). The "phone TTS app" framing in the previous version of this audit could not be sourced to any contemporary press report and has been removed. Treat the episode as a curiosity, not a natural experiment for TTS comedy delivery.

6. TTS Implications

Working hypothesis (not measured here): comedy is the worst-case content type for current TTS naturalness. To be tested by a head-to-head MOS comparison across content types, which this audit does not perform.

  • Punchline-pause + caps-emphasis combo is the single most diagnostic failure mode. State-of-the-art systems (Eleven v3, Gemini 3.1 TTS-class models) are reported in vendor and community write-ups to degrade on punchline beats. [inferred from public benchmarks + ComedicSpeech/ELaTE motivation sections; not measured here]
  • Specific failure modes, ranked by expected severity (not measured):
  1. Pre-punchline pause — most systems default to ~250ms sentence-final; comedy expects 0.3–3.0s variable beats. ComedicSpeech motivates this with a conditional duration predictor [arXiv:2305.12200].
  2. Contrastive L+H* on punchline syllable — current TTS uses learned generic accent placement; punchline keyword is rarely the prosodic peak unless explicitly marked.
  3. Sarcasm / deadpan emotion modes — weakest among supported emotion tags; sarcasm requires contradicting the semantic prosody, which most emotion-conditioning architectures cannot do [arXiv:2510.07096, 2408.14892].
  4. Audience-laugh integration — none of the production systems generate the laugh track or model the speak-pause-laugh-resume pattern; ELaTE is a research prototype [arXiv:2402.07383].
  5. Vowel elongation with riding intonation — caps and graphemic stretching are inconsistently honoured; the rise-fall over the elongation is almost never modelled.
  6. Host-vs-character voice flip — voice-identity switches mid-utterance are not supported in single-speaker mode by mainstream TTS.
  • What a comedy-grade TTS pipeline needs (composite from ComedicSpeech, ELaTE, sarcasm-TTS work + this audit):
  • Per-utterance variable pause budget driven by punctuation and tag semantics ([beat], [long pause], [hold]).
  • Explicit contrastive-accent marking honoured at the syllable level (not the word level).
  • Sarcasm/deadpan as first-class emotion tags with prosody that can contradict semantics.
  • In-line laughter token with controllable duration and intensity.
  • Vowel-elongation that runs a pitch contour, not a flat hold.
  • Optional audience-laugh layer or post-punchline silence budget for downstream laugh-track mixing.
  • Practical guidance for current pipelines. Until comedy-aware TTS ships, the realistic ceiling is hypothesized to be deadpan / dry / observational stand-up (Hedberg-shaped — heavy reliance on pause + flat affect, low reliance on character voices and emotional swings). Bo Burnham–style musical comedy, Mulaney's tension-arc precision, Oliver's incredulous register flips, and any character-voice news-comedy are out-of-distribution for general-purpose TTS as of 2026. No benchmark or pilot data is cited for the "Hedberg-shaped ceiling" estimate.

Notes on Methodology

  • "Citation" marks claims I sourced to a specific document (linked in §4). "Inferred" marks claims that follow from the cited literature but were not stated verbatim in any single source.
  • Quoted material is held under a sentence at a time; no extended verbatim excerpts.
  • ToBI / contrastive-accent claims about punchline syllables are inferred from the general ToBI literature on contrastive focus applied to the comedy genre; I did not find a ToBI-annotated stand-up corpus published as of this audit. Treat as a confident inference, not a measured result.
  • The pre-punchline pause ranges (0.3–3.0 s) are practitioner anecdote, not measured per-comedian distributions.
  • LPM / PAR figures are practitioner-coaching benchmarks, not peer-reviewed measurements.

5.5 Lyrics / song

Scope: written song lyrics across popular genres, and the paralinguistic markup lyric writing already encodes — repeated vowels, breath marks, melisma, ad-libs, structural brackets — plus the recent overlay of singing-voice-synthesis (SVS) / text-to-music (TTM) prompt conventions used by Suno, Udio, ACE-Step, DiffSinger, VISinger, NNSVS, and SongCreator. Genre note : this audit elides traditional human-singer lyric writing into Suno/Udio prompt grammar. Traditional lyric writing does not use bracket markup. The "lyric-prompt grammar in 2024–2026" sections describe a TTM authoring style; the human-songwriter sections describe a different practice. The headline finding: in lyric writing for sung delivery, vowel elongation is not optional ornament, it is the medium — sustained vowels carry the melodic line — and the second-most-frequent axis is pacing.

What the literature says

2.1 Per-axis usage profile in lyric writing

Lyric writing is the content type where the standard 10-axis paralinguistic taxonomy inverts. Disfluencies drop to near-zero; elongation, pacing, and audible non-verbal vocalizations become structural — not decorative — features. The profile below covers commercial pop, hip-hop, R&B, country, indie folk, and TTM-rendered song lyrics.

Vowel elongation — DOMINANT. This is the medium, not an ornament. Sustained vowels carry the singer's-formant cluster (~3 kHz, Sundberg 1987) and the vibrato signature (5–7 Hz, semitone extent) that the ear reads as "singing" vs "speaking" [citation]. In melismatic genres (gospel, R&B), one syllable spans many notes [practitioner sources Phamox Music, Voice Science — practitioner pages, not measured corpus statistics; figures like "5–20+ notes per syllable" demoted accordingly] . The lyric-side mark-up grammar inherited from songwriting is repeated letters (yeaaah, loooove) and hyphens (lo-o-o-ove) — community-discovered Suno/Udio grammar [citation]. The SVS side handles it as duration metadata: DiffSinger length regulator [citation]; VISinger phoneme-to-note duration ratio [citation]; Opencpop and M4Singer slur=true per note.

Pacing tags — HIGH (every section transition; every breath). Implicit pacing is musical metre; explicit pacing comes via breath marks (Wikipedia [citation]) and TTM brackets. NNSVS's time-lag module models the systematic offset between score onset and actual sung onset [citation].

Non-verbal vocalizations — HIGH (hip-hop ad-libs and 2020s "whisp" pop). Hip-hop ad-libs populate in-between-bar space and hook returns [citation]. 2020s pop foregrounds audible inhales, breath catches, sighs, lip smacks [citation]. Suno's documented vocabulary: [Whispers] [Sighs] [Screams] [Chuckles] [Groaning] [Cough] [Clears throat] [Whistling]. Udio adds [Scream], [Breathy], [Whisper Tone]. Non-lexical vocables have a continuous history from scat through doo-wop through the Beatles [citation].

Emotion / delivery tags — MODERATE–HIGH. Udio publishes a delivery axis ([Soft Delivery] [Intimate Delivery] [Powerful Delivery] [Breathy] [Whisper Tone] [Soulful Delivery] [Angelic Tone]) [citation]. Suno's community-reverse-engineered vocabulary is broader (community-asserted ~500 tags; not measured reliability). The acoustic grounding for [Breathy] is established (Anikin & Persson 2020 [citation]; see also Kreiman & Sidtis 2011, , for a more comprehensive treatment).

In-text emphasis — MODERATE. Hook downbeats receive strong stress; Pattison's stable/unstable framework formalizes alignment of linguistic and musical stress [citation]. Mismatch is empirically measurable (Johnson/Huron/Collister 2014 [citation]). ALL-CAPS in TTM lyrics reliably increases perceived loudness/intensity per community testing, not vendor documentation [citation]; *asterisks* / **bold** non-functional in current Suno/Udio.

Audible reactions — MODERATE. Melodic interjections (oh-oh-oh, na-na-na).

Discourse markers — LOW but stylized.

Mid-thought pivots — LOW–MODERATE. Indie folk, Bo Burnham–style meta-songwriting.

Conversational fillers — LOW.

Filled pauses — LOWEST. Sung lyrics do not have uh / um as filled pauses; when they appear they are ad-libs.

2.2 Phrasing, metre, rhyme

Mainstream pop scansion is broadly iambic / trochaic over 8/16-measure phrase grids (Callahan 2013, Pattison) [citation]. Hip-hop layers triplets and off-beat syncopation (Komaniecki 2017, 2020; Edwards) [citation]. Patel & Daniele (2003) is the primary nPVI work showing English songs inherit English speech's higher rhythmic-variability profile vs French; Patel 2008 is the book popularization. [primary ]

2.3 Suno / Udio / ACE-Step lyric-prompt conventions

The TTM lyric-prompt grammar in 2024–2026 is as a snapshot in time; vendor specs are not published and tags change by version :

  • Square brackets `[...]` = structural + delivery direction (Suno v4/v4.5/v5; Udio variants).
  • Parentheses `(...)` = backing vocals / call-and-response (Udio; partial Suno).
  • Repeated letters yeaaah = melismatic sustain.
  • Hyphens lo-o-o-ove = stretched syllables.
  • Ellipses Co... lle... ct... ions = staccato.
  • ALL-CAPS = increased perceived volume + intensity (community-tested, not vendor-documented).

ACE-Step (2025) [citation] does not yet have a published structured-tag vocabulary as mature as Suno/Udio.

2.4 Singing-TTS research: how lyric notation reaches pitch and duration

Production SVS systems share a four-block decomposition: text/lyrics → phoneme; score → pitch + duration; alignment (length regulator / duration model / time-lag); acoustic; vocoder. Vowel elongation lives in the alignment module, not the lyrics text (DiffSinger; VISinger — note VISinger is VITS-adapted-for-singing, primary VITS reference Kim et al. 2021 arXiv 2106.06103 ; Opencpop/M4Singer slur). Speech-TTS fails on loooove because there is no length-regulator over musical units.

2.5 TTS / TTM implications
  • Vowel elongation is the genre's medium. Read-style TTS will crash on lyric content. Either route to true SVS / TTM (Suno, Udio, ACE-Step, DiffSinger) or accept "spoken poetry" not "song."
  • Suno/Udio handle 7/10 axes well as a working observation, version-dependent.
  • Mid-thought pivots are the hardest TTM axis.
  • Intelligibility ceiling. Collister & Huron's 7.3× ratio is the hard ceiling from a 20-listener / 3-vocalist / English-only experiment — generalizing this to any TTM eval (as v1 did) is a leap. Modern Whisper-large WER on Suno output is also not the same task as Collister's listener mis-hearing test; the two are not directly comparable as the same metric.
  • Phoneme override in research SVS (DiffSinger word[k ax t] syntax) is the closest analog to ElevenLabs IPA controls.

3. Open questions for the report

  1. Does our Lissin / podcast-style TTS audit need a separate sung-content suite?
  2. What is the empirical intelligibility ceiling on TTM-generated vocals vs Collister & Huron's 25% mishearing baseline? Whisper WER ≠ listener mishearing — the probe must be careful about metric choice.
  3. Does prompt-side vowel-elongation grammar produce measurable mel-frame duration differences in Suno v5? Easy A/B.
  4. Which axes survive lyric porting? Do [breath], [inhale], [sigh] produce reliable non-verbal vocalizations across Suno and Udio?
  5. Is there a mid-thought-pivot tag combination that reliably reproduces Bo Burnham false-start lyric devices?

6. TTS engine audit

Three audits: closed SOTA, the Gemini family, and open-source evaluation candidates. Coverage is uneven across the 10 axes — see the matrix.

TTS engine coverage by paralinguistic axis

Hover any cell for the engine-axis interpretation. Scores are public-surface coverage, not measured output quality.

● native· partial— none
Engine
Vowel elong.
Non-verbal voc.
Filled pauses
Discourse mk.
Conv. fillers
Mid-thought piv.
Emotion tags
Pacing tags
In-text emph.
Audible react.
ElevenLabs v3
·partial
native
·partial
·partial
·partial
·partial
native
·partial
native
native
Cartesia Sonic 3
--
·partial
·partial
·partial
·partial
--
·partial
·partial
·partial
·partial
OpenAI gpt-4o-mini-tts
--
·partial
·partial
·partial
·partial
--
·partial
·partial
--
·partial
Gemini 2.5 Pro TTS
·partial
native
·partial
·partial
·partial
·partial
native
·partial
native
native
Gemini 3.1 Flash TTS
·partial
native
·partial
·partial
·partial
·partial
native
·partial
native
native
CosyVoice 2
--
native
--
--
--
--
·partial
·partial
·partial
·partial
Higgs Audio v2.5
--
·partial
--
--
--
--
·partial
·partial
·partial
·partial
Fish S2 Pro
--
·partial
--
--
--
--
native
·partial
·partial
·partial
Kokoro-82M
--
--
--
--
--
--
--
--
--
--
F5-TTS
--
--
--
--
--
--
--
--
--
--
Figure 3. TTS engine × paralinguistic axis coverage. ● = native first-class control, · = partial via SSML or instruction field, — = no control.

6.1 Closed SOTA — ElevenLabs / Cartesia / OpenAI

Scope: ElevenLabs v3, Cartesia Sonic 3.5 (with Sonic 2/3 historical context), and OpenAI's gpt-4o-mini-tts / tts-1 / tts-1-hd family plus gpt-realtime (the S2S sibling). All claims are tagged [citation] for sourced facts and [inferred] where the public record is silent.

Scope correction (v2): This audit excludes Google Gemini TTS from its initial framing, which is a major scope gap — Google's Gemini 3.1 Flash TTS is plainly a closed frontier vendor. See the sibling gemini.md audit for that engine; cross-reference is now noted explicitly.

Note on freshness: as of June 2026, Cartesia's current latest TTS docs center Sonic 3.5 and list the stable snapshot sonic-3.5-2026-05-04; Sonic 2/3 remain historical context. OpenAI has shipped gpt-realtime (the GA successor to gpt-4o-realtime-preview).

ElevenLabs v3

Architecture

Public posture: deliberately opaque. The available signal:

  • Third-party guide describes v3 "builds on the architecture of Eleven Multilingual v2" with the Audio Tags system, expanded language coverage, and improved technical-text handling [citation].
  • The Mistral "Voxtral TTS at parity with Eleven v3" claim previously cited (mistral.ai/news/voxtral-tts) is removed pending a confirmed primary source with date and exact wording. The architectural back-derivation from Mistral's claim ("Mistral matches Eleven → so they copied v3's stack → so v3 is X") was a leaky chain regardless of whether the source quote exists. The "v3 is in the AR semantic-LM + non-AR flow-matching acoustic decoder + neural codec vocoder" family is now downgraded to "plausible — consistent with Voxtral-class architectures more broadly, not specifically derived from Voxtral."
  • Codec: undisclosed. [inferred] an in-house neural audio codec.

Launch: public alpha June 5, 2025, GA February 2026 [citation].

Axis support (10-axis table)
#AxisSupportEvidence
1Vowel elongationPartial[drawn out] delivery tag [citation]; no explicit syllable-stretch markup.
2[laughs] / non-verbalNative[laughs], [laughs harder], [wheezing], [snorts], [clears throat], [gulps], [swallows], [gasp], [crying], [sniffs] [citation]. Tag library is a community plugin enumeration; not every tag is officially documented by ElevenLabs.
3uh/um disfluenciesPartialLexical; [hesitates] and [stammers] as cognitive-beat tags.
4Discourse markersPartialLexical only.
5Conversational fillersPartialLexical only.
6Em-dash pivotsPartialStandard punctuation respected.
7Emotion tagsNative[excited], [sad], [sarcastic], [deadpan], [playfully], [flatly], etc. [citation].
8Pause tagsNative[pause] documented; ellipses also create pauses.
9CAPS emphasisNativeCapitalization documented [citation].
10Audible reactionsNative[sighs], [gasps], [gulps], [clears throat], [sniffs], [wheezing] + SFX tags.

Plus accent/character tags and dialogue-overlap tags.

Training data
  • Hours: undisclosed.
  • Languages: 70+ at v3 [citation]. Tier 1 = EN/ES/FR/DE/PT/IT.
  • Non-verbal events in training set: undisclosed.
Naturalness benchmarks

Reconciliation note (v2): v3's Elo number is reported inconsistently across secondary sources. The two most-cited figures are:

  • ~1203 (offlinetts.com 2026 round-up).
  • ~1178 (also offlinetts.com, also propagated by open_source.md sibling audit referencing the same blog).

Neither number has been verified against a primary leaderboard (TTS Arena V2 HF Space or Artificial Analysis HF Space) in this pass. Both come from the same secondary blog (offlinetts.com), which appears to mix snapshots. Until the primary leaderboard is queried, report this as "Eleven v3 Elo ≈ 1,178–1,203 (range, multiple secondary snapshots; not independently verified)."

  • TTS Arena V2 (Hugging Face, period Apr 2025 – Jun 2026): three ElevenLabs entries in top 15 — Eleven Turbo v2.5 (Elo ~1540, rank 8), Eleven Flash v2.5 (~1531, rank 10), Eleven Multilingual v2 (~1528, rank 11) [citation].
  • "CastleFlow (1574) at top of TTS Arena V2" — REMOVED. No vendor with this name could be located; the entry appears to be either a leaderboard error, a rebrand, or a hallucination. The "top of TTS Arena V2 in June 2026" claim is now constrained to: Hume Octave (~1561) and Inworld TTS MAX (~1572) pending verification against the primary leaderboard, with the explicit caveat that the leaderboard varies in time and any specific top-N snapshot needs a timestamp.
  • Artificial Analysis Speech Arena lists Sonic 3.5 at Elo ~1203 (secondary source).
  • No vendor-supplied MOS number is published for v3.
Known failure modes
  • Audio tags flaky: "only one in every six tries actually delivered the tagged effect" [citation]. This 1-in-6 reliability figure would substantially compress the 15/20 axis-support score if factored in; the score table does not account for it.
  • Long-text degradation: keep generations under 800–900 characters.
  • Voice-character override.
  • PVCs broken on v3.
  • Non-English realism gaps (Japanese, Swiss German).
  • Not for real-time / conversational use.

Cartesia Sonic 3.5 (with Sonic 2 / 3 context)

Architecture
  • Built on state space models (SSM) — Mamba / S4 family co-developed by Cartesia's founders Albert Gu and Tri Dao [citation].
  • Sonic 2.0 launched March 2025 [citation].
  • Latency: Sonic 2.0 = 90ms TTFA; Sonic Turbo = 40ms TTFA [citation].
  • Sonic 3.5 (current as of June 2026): SSM stack continued. Current Cartesia docs state Sonic 3.5 supports 42 languages and list sonic-3.5-2026-05-04 as the stable snapshot. The older ">100k hours" training-data figure remains unverified and should not be treated as a primary-source claim.
  • Codec: undisclosed.
Axis support (10-axis table)
#AxisSupportEvidence
1Vowel elongationPartial<speed ratio="0.7"/> SSML.
2[laughs] / non-verbalPartial[laughter] documented; sighs/coughs explicitly "future" [citation].
3uh/um disfluenciesPartialLexical only.
4Discourse markersPartialLexical only.
5Conversational fillersPartialLexical only.
6Em-dash pivotsPartialStandard punctuation.
7Emotion tagsNative<emotion value="..."/> SSML + emotion_name:level tags. Primary set: neutral, angry, excited, content, sad, scared.
8Pause tagsNativeSSML pause + convenience method.
9CAPS emphasisPartialNot documented as control axis.
10Audible reactionsPartialOnly [laughter].
Training data
  • Sonic 1 validation used Multilingual LibriSpeech.
  • Total Sonic 2 training hours undisclosed.
  • Sonic 3.5 = 42 languages per current Cartesia docs. Training-hour totals remain undisclosed by primary docs.
Naturalness benchmarks
  • Cartesia Sonic 2 = Elo 1513, rank 13, 53% win rate, 305 votes [previously sourced to tts-agi-tts-arena-v2.hf.space/leaderboard]. UNVERIFIED in this pass against the live leaderboard — Arena Elo refreshes and the "305 votes" specificity may be a stale snapshot; needs second pass.
  • Artificial Analysis Speech Arena: Sonic 3.5 ~Elo 1203 (secondary).
Known failure modes
  • Limited non-verbal vocabulary.
  • Emotion fires only when transcript matches.
  • Soft controls.
  • Smaller community-voted sample.

OpenAI gpt-4o-realtime + tts family

Architecture
  • `gpt-4o-mini-tts` (March 2025): conventional TTS with steerable instructions field; 13 voices.
  • `tts-1` / `tts-1-hd` (legacy): 9 voices.
  • `gpt-realtime` (August 28, 2025): single multimodal S2S model, 128k context.
  • 6 audio tokens per text token for gpt-4o-mini-tts [citation].
  • 100+ languages.
Axis support (10-axis table)
#AxisSupportEvidence
1Vowel elongationPartialVia instructions only.
2[laughs] / non-verbalPartialInstructions field; reliability community-reported.
3uh/um disfluenciesPartialVia instructions.
4Discourse markersPartialLexical.
5Conversational fillersPartialLexical.
6Em-dash pivotsPartialStandard punctuation.
7Emotion tagsNative (via instructions)"emotional range" / "tone" in instructions.
8Pause tagsPartialVia instructions; no SSML <break/>.
9CAPS emphasisPartialNot documented.
10Audible reactionsPartialInstructions-driven.
Naturalness benchmarks
  • No OpenAI realtime entry in top 15 of TTS Arena V2.
  • "Realtime TTS-2 - Research Preview" Elo ~1209 (mid-2026, secondary).
Known failure modes
  • No inline tags.
  • Instruction obedience is best-effort.
  • No voice cloning.

Comparison: 10-axis paralinguistic support (0 = none, 1 = partial, 2 = native)

AxisElevenLabs v3Cartesia Sonic 3.5OpenAI tts
1. Vowel elongation111
2. [laughs] / non-verbal bracket tags211
3. uh/um disfluencies111
4. Discourse markers111
5. Conversational fillers111
6. Em-dash pivots111
7. Emotion tags222 (via instructions)
8. Pause tags221
9. CAPS emphasis211
10. Audible reactions211
Total / 20 (documented surface)151211

Caveat (v2): the 0/1/2 score compresses radically different control surfaces (bracket tags vs SSML vs natural-language instructions). It also does not factor reliability — ElevenLabs' own admission that tags fire "1 in 6 times" is footnoted, not scored. A reliability-weighted score would compress Eleven's lead substantially.

Recommendation — editorial framing, not derived from per-axis table

Primary recommendation: ElevenLabs v3 for podcast narration, Cartesia Sonic 3.5 for meditation.

These are editorial calibrations based on the per-axis tables plus this auditor's read of the documented controls and observed behavior — they are not derived measurements.

Rationale:

  • Podcast naturalness needs the paralinguistic surface; v3's bracketed-emotion + caps + ellipsis controls map closely to the project's 10-axis list. Cartesia documents the gap explicitly. OpenAI requires splitting utterances.
  • Caveat 1 (v3 reliability): 1-in-6 tag delivery failures [citation]; cap generation at 800–900 chars.
  • Caveat 2 (v3 PVC regression): use Instant Voice Clone.
  • Meditation prefers calm consistent prosody, slow pacing — Cartesia Sonic 3.5 is the current candidate on SSML speed control + low latency. Caveat: Sonic 3's recommended advantage for offline meditation rendering rests on documented SSML speed control rather than the project's actual measured naturalness; the case is on writability, not on a head-to-head meditation MOS.

Gemini 3.1 Flash TTS now belongs in this comparison and is covered in the sibling gemini.md audit. The omission of Gemini from v1 of this audit was the largest single scope error.

If forced to pick one: ElevenLabs v3 (editorial preference, not derived from measurement).

6.2 Google Gemini family

Scope: Gemini 2.5 Pro TTS, Gemini 2.5 Flash TTS, Gemini 3.1 Flash TTS, and the NotebookLM Audio Overview pipeline. Audit date: 2026-06-03 (v2). All quantitative claims are tagged [citation] (lifted from a primary Google or arXiv source), [vendor] (Google marketing claim, no third-party verification), [third-party] (independent benchmark / blog), or [inferred] / [project-observed].

Key takeaways

  • Three Gemini-TTS model variants are live in June 2026, all "preview": gemini-2.5-pro-preview-tts, gemini-2.5-flash-preview-tts, and gemini-3.1-flash-tts-preview. 3.1 Flash TTS launched 2026-04-15 and is reported on Artificial Analysis Speech Arena (Elo ~1,206–1,211, #2 globally behind Inworld TTS 1.5 Max). The Elo numbers are sourced to X tweets (@ArtificialAnlys 2044450045190418673) and a Speechify announcement; they have not been independently verified against the primary Artificial Analysis HF Space in this pass. 3.1 Flash TTS is exposed via Vertex AI's global endpoint only.
  • All three share the same control surface: response_modalities=["AUDIO"] on client.models.generate_content. Output is 24 kHz mono 16-bit PCM, raw, base64-encoded. No streaming on the Vertex AI generative path [citation].
  • 30 prebuilt voices, named after stars / mythological figures [citation].
  • Paralinguistic control via bracket tags ("audio cues" / "audio tags"): docs list ~16 emotion/state tags plus pacing/non-verbal examples and free-form style directions. Google's 3.1 Flash TTS marketing claims "200+ audio tags" but does not publish a full enumeration as of 2026-06 [vendor; project-verified by doc search].
  • Context caching is NOT supported on any TTS variant as of June 2026 — caching: Not supported is on every TTS model card [citation].
  • NotebookLM Audio Overview does NOT use Gemini-TTS. DeepMind's own blog describes the NotebookLM-related work as "a hierarchical Transformer over ultra-compressed audio codec tokens at ~600 bps", without explicitly naming SoundStorm. The "SoundStorm-derived" framing is widely-suspected in community discussion (Google forum thread #44776) — this audit now consistently labels it as widely-suspected rather than confirmed throughout.
  • Known limitations (project-observed and verified) — see §G for the full table with N caveats.

A. Architecture — Public lineage

Gemini-TTS is the production endpoint of a five-year Google DeepMind audio-LM program:

#YearSystemRoleCitation
12022AudioLMAudio as discrete-token LM.arXiv:2209.03143
22023-02SPEAR-TTSText→semantic translation.arXiv:2302.03540
32023-05SoundStormMaskGIT-style parallel decoding over RVQ. Widely suspected (not vendor-confirmed) to be the engine behind NotebookLM Audio Overview [community/forum sourced].arXiv:2305.09636
42023-06AudioPaLMInterleaved text + audio tokens in one decoder.arXiv:2306.12925
52024-02Spirit-LMMeta's independent confirmation of the interleaving approach.arXiv:2402.05755
62024-09DeepMind "Pushing the frontiers" blogNotebookLM Audio Overview: "hierarchical Transformer over ultra-compressed audio codec tokens at ~600 bps". <3 s for 2-min dialogue on TPU v5e. DeepMind does not name SoundStorm in this post.
72025-05Gemini 2.5 Flash/Pro Preview TTSFirst public AudioPaLM-style endpoint.blog.google
82026-04-15Gemini 3.1 Flash TTSSame surface; "200+ audio tags" claim.blog.google + Hassabis tweet

:

  • VALL-E (Wang et al. 2023, arXiv 2301.02111) — parallel-track development to AudioLM/SPEAR-TTS.
  • NaturalSpeech 3 (Ju et al. 2024) — "near-human MOS" comparator.
  • EnCodec (Défossez et al. 2022) and SoundStream (Zeghidour et al. 2021) — the neural codecs the AudioLM line uses.
  • MaskGIT (Chang et al. 2022) — SoundStorm's parallel-decoding scheme.
  • Whisper (Radford et al. 2022) — for the eval pipelines.

Synthesis: Gemini-TTS is the same multimodal transformer that handles text/image/video, with an output head trained to emit codec tokens. This is the AudioPaLM blueprint, productionized. Paralinguistic tags are just text tokens in the prompt; there is no separate "laughter module."

NotebookLM is architecturally distinct: a SoundStorm-like hierarchical-Transformer stack [widely suspected; not vendor-named] with a Gemini script-writer in front. The dialogue model is not exposed via any API as of 2026-06.

B. Voice Catalog (full, with descriptors)

From ai.google.dev/gemini-api/docs/speech-generation:

VoiceDescriptorVoiceDescriptorVoiceDescriptor
ZephyrBrightAlgiebaSmoothGacruxMature
PuckUpbeat (default)DespinaSmoothPulcherrimaForward
CharonInformativeErinomeClearAchirdFriendly
KoreFirmAlgenibGravellyZubenelgenubiCasual
FenrirExcitableRasalgethiInformativeVindemiatrixGentle
LedaYouthfulLaomedeiaUpbeatSadachbiaLively
OrusFirmAchernarSoftSadaltagerKnowledgeable
AoedeBreezyAlnilamFirmSulafatWarm
CallirrhoeEasy-goingSchedarEven
AutonoeBright
EnceladusBreathy
IapetusClear
UmbrielEasy-going

Practical bucketing:

  • Meditation: Aoede, Achernar, Vindemiatrix, Sulafat, Enceladus.
  • Podcast: Puck × Kore or Charon × Aoede.
  • Deep-dive: Charon, Rasalgethi, Sadaltager.
  • Comedy news: Fenrir, Puck, Pulcherrima, Sadachbia.
  • Song/lyric: Sulafat, Achernar, Algieba.

C. Paralinguistic tag support across the 10 axes

#AxisDocumented?Example tag2.5 Pro TTS2.5 Flash TTS3.1 Flash TTS
1Vowel elongation (Soooo)Yes — by example"Beauuutiful morning"Honored [project-observed, mild, N=small]HonoredHonored well
2NV ([laughs], [sighs], [gasp], [cough])Yes — explicit list[laughs], etc.Honored, conservativeHonoredHonored, more dynamic
3Filled pauses (uh, um)No — not documented as tag(must be inline)Literal transliterationSameMore naturalistic
4Discourse markers (so, well, yeah)No — inline(no bracket form)Honored as textHonoredHonored, better separation
5Conversational fillers (like, you know)No(inline)InlineInlineInline, better timing
6Em-dash pivotsNo — explicitly absent from docsNoneEm-dash → audible pause onlySameBetter — actual false-start retake
7Emotion / deliveryYes — full taxonomy[curious], [whispers], free-formHonored, narrowHonored, midHonored, widest range
8Pacing tagsPartial[short pause], [very fast], [very slow]; `[PAUSE=2s]` NOT documented[short pause] ok; long pauses unreliableSame[short pause]+[very slow] reliable; timed-pause syntax inconsistent
9CAPS / *asterisks* emphasisNo — not documented(no directive)MildSameBetter
10Audible reactions (oh!, wow)No — inline(inline)Honored as textHonoredHonored, more natural

Coverage caveat (v2): Project-observed columns are based on a small ad-hoc N — sample size is not disclosed in v1, and remains undisclosed here pending a formal harness count. Treat the "honored" / "honored, better" entries as taste impressions, not measured comparisons.

The conclusion "all paralinguistic naturalness comes from how the script is written, not from tags" is overstated . The table shows that emotion/delivery and non-verbal vocalizations are native in Gemini's tag surface; the bracket text is more than a writing-convention hint, it's a documented control. The correct framing: most of the 10 axes are not enumerated as bracket tags, so writing carries the load; but where tags exist they're real controls, not just "more text."

Important undocumented behavior [project-observed]: all three Gemini TTS models accept arbitrary bracket text as a style direction. This generalizes the "200+ tags" marketing claim.

D. Context Caching

Item2.5 Pro TTS2.5 Flash TTS3.1 Flash TTS2.5 Pro text-gen
Caching supported?NoNoNoYes

F. Naturalness Benchmarks

Artificial Analysis Speech Arena (mid-2026)
RankSystemElo
1Inworld TTS 1.5 Max1,208–1,211
2Gemini 3.1 Flash TTS1,206–1,211
3Speechify SIMBA 3.0(top-10, exact n/a)
4ElevenLabs Eleven v31,178

Sources are X tweets and Speechify's own announcement — not independently verified against the primary Artificial Analysis HF Space leaderboard in this pass. Treat as widely-reported, not independently confirmed.

MOS / TTS Arena scores
  • Google has not published a MOS for Gemini 3.1 Flash TTS.
  • TTS Arena (HF) does not include Gemini 3.1 Flash TTS as of 2026-06.

G. Known limitations — verified against public docs

#Project observationVerified?Notes
12.5 Pro TTS stretched a meditation script to 10m 46s vs 3.1 Flash 3m 49s (2.8×) on one script, N=1Yes (project-measured, N=1)N=1 single script. Cannot be propagated as a generalizable model characterization. Consistent with Google's positioning of 2.5 Pro TTS for "long-form / professional narration." Always probe with a 30 s sample before committing.

H. NotebookLM Audio Overview — what's public

Architecture (verified from DeepMind blog + community forum)
  1. Script generation: Gemini (originally 1.5 Pro at launch) ingests user sources and writes a two-speaker dialogue script.
  2. Speech generation: the script is passed to a SoundStorm-like dialogue model (hierarchical Transformer over ~600 bps codec). "SoundStorm-derived" remains community speculation; DeepMind's own blog describes the architecture without naming SoundStorm. This audit consistently labels it widely-suspected throughout; v1 oscillated between "widely suspected" and "this is the engine."
  3. Output: monolithic stereo MP3 download. No API access to the dialogue model itself.
Publicly-known prompt conventions (Hennig Claude-assisted reverse-engineering [third-party; v1 elsewhere mis-described as "leaked"])

Format options (Google UI presets):

  • Deep Dive, Brief, Critique, Debate, Lecture.

Deep Dive conventions:

SectionConventionVerbatim phrases observed
Opening"Hey everyone, welcome back"
Host rolesAsker + Explainer dyad
Affirmation glue"Right." "Exactly." "Absolutely."
TransitionsRhetorical questions
Conversational fillers"You know," "I mean," "sort of," "like"
ValidationReciprocal praise
Content presentationAnalogy + chunking
Wrap-up"So as we wrap things up…"

What's NOT in NotebookLM scripts (project's audit of ~10 hours of NotebookLM output):

  • Explicit bracket tags — the dialogue model produces natural laughter/breath without script tags.
  • Vowel elongation in print.
  • [PAUSE=2s] markers.

Comparison Table — Gemini 2.5 Pro TTS vs 3.1 Flash TTS across the 10 axes

#Axis2.5 Pro Preview TTS3.1 Flash TTSWinner
1Vowel elongationHonored, mildHonored, full3.1 Flash
2NV [laughs] etc.Honored, conservativeHonored, most expressive3.1 Flash (caveat: 2.5 Pro is calmer for meditation)
3Filled pausesMechanicalNaturalistic3.1 Flash
4Discourse markersHonoredHonored, better separation3.1 Flash
5Conversational fillersInline neutralInline better timing3.1 Flash
6Em-dash pivotsAudible pause onlyActual false-start3.1 Flash
7Emotion / deliveryNarrowWidest range3.1 Flash
8Pacing tagsExpansive (2.8× N=1 observation)More predictable3.1 Flash prod; 2.5 Pro meditation (N=1)
9CAPS / *asterisks*MildClear3.1 Flash
10Audible reactionsNeutralMore natural3.1 Flash

Net (v2): The "3.1 Flash dominates 9 of 10 axes" framing is based on small-N project-observed cells. Per-cell sample sizes are not disclosed. Treat as taste impressions until a formal benchmark is run.

Recommendation by content type — editorial calibration, not derived measurement

Content typeRecommendedVoice(s)Guidance
Podcast (two-host)3.1 Flash TTSPuck × Kore or Charon × AoedeMulti-speaker config; cache system prompt at the application layer (Gemini does not cache TTS calls).
Meditation3.1 Flash TTS with [very slow] + [short pause] + ellipses, OR 2.5 Pro Preview TTS if the N=1 duration-expansion observation is desiredAoede, Achernar, Vindemiatrix, SulafatChunk ≤9 min; render 30 s probe first.
Deep-dive3.1 Flash TTSCharon, RasalgethiCAPS for key terms, [short pause] before definitions.
Comedy news3.1 Flash TTSFenrir, Puck, PulcherrimaHigh NV-tag density.
Song / lyric3.1 Flash TTS (with reservations)Sulafat, AlgiebaNot a singing model. For real song synthesis use Suno / Udio.
NotebookLM-styleNotebookLM web UI (no API), or 3.1 Flash multi-speaker imitating Hennig's reverse-engineered conventionsPuck × Kore

6.3 Open-source TTS — Kokoro / Fish / CosyVoice / Higgs / …

The ten paralinguistic axes referenced below are the ones the broader report tracks: laugh, sigh, breath, pause, whisper, emphasis, pitch shift, rate change, emotion, non-speech sound.

1. Kokoro-82M

A. Architecture. StyleTTS 2 backbone + ISTFTNet vocoder, decoder + vocoder only [HF card]. 82M parameters.

B. Training corpus. "A few hundred hours" of strictly permissive audio. 8 languages, 54 voices in v1.0.

C. Paralinguistic tag support. Effectively none out-of-the-box. Control surface is voice + phonemes. Score: 0/10 axes natively.

E. License. Apache 2.0 — fully commercial.

F. Naturalness vs SOTA closed. TTS Arena rank 32, ELO 1056.2 [secondary: offlinetts-2026 — single secondary blog source]. Closed reference points reported by the same blog: ElevenLabs v3 ~1178, Gemini 3.1 Flash TTS ~1206.

G. Recent papers. StyleTTS 2 arXiv:2306.07691; ISTFTNet arXiv:2203.02395.

2. Fish-Speech S2 / S2 Pro

A. Architecture. "Dual-AR" (4B slow AR + 400M fast AR), ~4.4B total params.

B. Training corpus. "Over 10 million hours" / "80+ languages" / "15,000+ Unique Tags" [Fish blog].

C. Paralinguistic tag support. Strong and explicit. Demonstrated tags: [laugh], [whispers], [super happy], [whisper in small voice], [professional broadcast tone], [pitch up]. Free-form textual descriptions also accepted. Score: ~7-8/10 axes.

F. Naturalness vs SOTA closed. S2 Pro rank 11, ELO ~1128 [MarkTechPost 2026 / offlinetts — both secondary, citation-chain risk].

3. XTTS-v2 (Coqui)

A. Architecture. Tortoise-derived AR transformer + GPT-2-class LM, ~460M params [inferred].

B. Training corpus. 17 languages [XTTS paper abstract].

C. Paralinguistic tag support. None native. Emotion via reference audio only. Score: 0/10 axes via tags.

E. License. CPML — non-commercial. Coqui shut down January 2024.

F. Naturalness vs SOTA closed. TTS Arena rank 66, ELO 885.9 [offlinetts-2026].

G. Recent papers. XTTS arXiv:2406.04904.

4. CosyVoice 2 (Alibaba FunAudioLLM)

A. Architecture. ~0.5B-parameter LM backbone over FSQ semantic tokens, chunk-aware causal flow-matching decoder [CosyVoice 2 tech report, arXiv:2412.10117].

B. Training corpus. 9 languages + 18 Chinese dialects. Trained with instruction-style natural-language prompts embedded at any position in the input sequence.

C. Paralinguistic tag support. Strong, instruction-tuned. CosyVoice-Instruct supports "speaker identity, speaking style (including emotion, gender, speaking rate, and pitch), and fine-grained paralinguistic features, including the ability to insert laughter, breaths, speaking while laughing, and emphasize certain words" [search result, FunAudioLLM]. Score: ~7-8/10 axes. A DPO-tuned variant is reported to hit 3.46 PMOS — uncited beyond "search result"; UNVERIFIED.

E. License. Apache 2.0 — commercial OK.

F. Naturalness vs SOTA closed. Not on the offlinetts top-of-board cut. Widely reported best-in-class for multilingual ZH/EN [TTS.ai 2026]. Likely ELO ~1080-1110 [inferred].

5. F5-TTS (SWivid / SJTU + PKU)

A. Architecture. Fully non-AR flow-matching with Diffusion Transformer (DiT) + ConvNeXt V2 [F5-TTS paper, arXiv:2410.06885]. ~330M params [inferred]. RTF 0.15.

B. Training corpus. ~100K hours multilingual Emilia dataset, ~95K hours of EN+ZH after filtering.

C. Paralinguistic tag support. None native. Zero-shot voice clone over text. Score: 0/10 axes via tags.

E. License. Code MIT; weights CC-BY-NC (Emilia provenance).

F. Naturalness vs SOTA closed. Probably ELO ~1020-1060 [inferred].

G. Recent papers. F5-TTS arXiv:2410.06885. Kwon et al. Interspeech 2025 (PEFT for low-resource TTS).

6. Sesame CSM (Conversational Speech Model)

A. Architecture. Llama-style transformer (1B) + 100M Llama-variant audio decoder producing Mimi RVQ codes.

B. Training corpus. Conversational/dialog-trained, English-dominant.

C. Paralinguistic tag support. Speaker IDs only ([0], [1]). Score: 1/10 axes natively.

E. License. Apache 2.0.

F. Naturalness vs SOTA closed. Not on offlinetts cut. Likely ELO ~1040-1080 [inferred — propagated through comparison table as if measured; flagged as inferred].

G. Recent papers. No public CSM technical report. Mimi codec primary: Kyutai Moshi paper, arXiv:2410.00037 .

7. Higgs Audio v2 (Boson AI)

A. Architecture. Unified audio tokenizer + DualFFN architecture. Parameter count correction (v2): The HF model card describes Higgs Audio v2 as Llama-3.2-3B base (the LLM backbone) + ~2.2B DualFFN adapter, totaling ~5.8B parameters at inference. The v1 audit's "3B base / 1B v2.5 condensed" framing was the LLM-only count and omitted the DualFFN.

The operational model relevant to inference is ~5.8B parameters total (Llama-3.2-3B + DualFFN ~2.2B). The condensed v2.5 variant remains 1B-class on the LLM backbone but should also be reported with its operational total.

Also: v1 framed itself as "correcting a ~10B figure" — but no ~10B claim appears in this audit document. If v1 was correcting an external/prior audit's 10B figure, that should have been named; otherwise the correction reads as straw-manning. v2 names the source of the previous-version framing or removes it.

B. Training corpus. "Over 10 million hours" via AudioVerse [Boson blog].

C. Paralinguistic tag support. Strong but underdocumented. Higgs v2 wins 75.7% over GPT-4o-mini-tts on the EmergentTTS-Eval Emotions task [Boson blog — vendor self-report, not externally verified] . Discrete tag protocol not enumerated in public docs. Score: ~5-7/10 axes, mostly via reference audio and prompt phrasing.

E. License. Apache 2.0 (with third-party xcodec attributions).

F. Naturalness vs SOTA closed. Probably ELO ~1090-1130 [inferred from EmergentTTS-Eval position; the 1B v2.5 claim of exceeding the 3B base is vendor self-report].

G. Recent papers. No formal arXiv paper for v2. xcodec dependency: arXiv:2408.17175 [inferred].

8. Bark (Suno)

A. Architecture. GPT-style AR over EnCodec, AudioLM / VALL-E lineage. ~1B total [inferred].

B. Training corpus. Not disclosed.

C. Paralinguistic tag support. The OG of open-source paralinguistic tags. [laughter], [laughs], [sighs], [gasps], [clears throat], [music], etc. Score: 6/10 axes natively. Quality of tag rendering inconsistent.

E. License. MIT.

F. Naturalness vs SOTA closed. Probably ELO < 950 [inferred].

G. Recent papers. No Bark paper. Lineage references Additional references:

  • AudioLM (Borsos et al. 2022, arXiv:2209.03143)
  • VALL-E (Wang et al. 2023, arXiv:2301.02111)
  • EnCodec (Défossez et al. 2022, arXiv:2210.13438) — cited only as parenthetical in v1; added as standalone reference.

Comparison Table

ELO column note: the table mixes measured numbers (Kokoro, XTTS) with inferred ranges (CosyVoice 2, Sesame, Higgs, Bark, F5). The reader should not read all rows as comparable. A future version of this table should split into two columns — "measured" and "inferred" — to avoid this conflation.

Decision rule

Citations

  1. Kokoro-82M HF model card.
  2. Kokoro GitHub.
  3. StyleTTS 2 paper, arXiv:2306.07691.
  4. ISTFTNet paper, arXiv:2203.02395.
  5. Fish Audio S2 open-source blog.
  6. Fish Speech GitHub.
  7. Fish S2 HF model card.
  8. idiap/coqui-ai-TTS (XTTS fork).
  9. XTTS paper, arXiv:2406.04904.
  10. Coqui XTTS-v2 HF.
  11. Coqui XTTS-v2 license discussion.
  12. Coqui CPML analysis.
  13. CosyVoice GitHub.
  14. CosyVoice 2 tech report, arXiv:2412.10117.
  15. CosyVoice original, arXiv:2407.05407.
  16. UtterTune-CosyVoice2-ja-JSUTJVS on HF.
  17. F5-TTS GitHub.
  18. F5-TTS paper, arXiv:2410.06885.
  19. F5-TTS discussion #57.
  20. F5-TTS licensing discussion #997.
  21. PEFT for low-resource TTS (Kwon et al. Interspeech 2025).
  22. Sesame CSM GitHub.
  23. CSM-1B HF model card.
  24. Higgs Audio v2 GitHub.
  25. Higgs Audio v2 blog (Boson AI) — vendor self-report on EmergentTTS-Eval.
  26. Higgs Audio v2 HF Transformers docs.
  27. Bark GitHub.
  28. Bark training code issue #131.
  29. TTS Arena Legacy HF Space.
  30. TTS Arena v2 leaderboard.
  31. TTS Arena Leaderboard 2026 (offlinetts — secondary; ELOs flagged accordingly).
  32. Best TTS Models 2026 benchmark (MarkTechPost — secondary; citation-chain risk with offlinetts).
  33. Kokoro TTS review 2026 (TextToLab).
  34. Modal: Top Open-Source TTS Models.

Additional references:

  1. AudioLM (Borsos et al. 2022), arXiv:2209.03143.
  2. VALL-E (Wang et al. 2023), arXiv:2301.02111.
  3. EnCodec (Défossez et al. 2022), arXiv:2210.13438.
  4. Mimi codec (Kyutai Moshi 2024), arXiv:2410.00037.
  5. Tortoise / OpenVoice (one of these belongs for historical context).

7. Empirical ablation — exploratory script-markup sensitivity test

We tested how script-side paralinguistic markup affects perceived naturalness under a frontier TTS engine. The corrected N=10 result is exploratory: aggregate differences are small, no per-axis CI excludes zero, and both judges are Gemini-family LLMs rather than human listeners.

Corrected paired MOS by stripped axis

Each line is one script under the Pro judge. Hover a line or dot for the per-script delta.

originalaxis stripped

Vowel elongation

51orig 4.65strip 4.65

Non-verbal vocal.

51orig 4.65strip 4.55

Filled pauses

51orig 4.65strip 4.55

Discourse markers

51orig 4.65strip 4.42

Conv. fillers

51orig 4.65strip 4.30

Mid-thought pivots

51orig 4.65strip 4.45

Emotion / delivery

51orig 4.65strip 4.38

Pacing tags

51orig 4.65strip 4.70

In-text emphasis

51orig 4.65strip 4.38

Audible reactions

51orig 4.65strip 4.78
Figure 4. Corrected N=10 paired-script MOS, original vs stripped, per axis. Use as diagnostic visualization, not confirmatory evidence.

Per-axis MOS delta with bootstrap CI

Negative means the stripped variant scored lower than original. Hover bars for CI and p-value.

-1.0-0.5+0.0+0.5Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactionsMOS delta when axis is stripped
Figure 5. Per-axis MOS delta with 95% bootstrap CI. In the corrected N=10 summary, no per-axis confidence interval excludes zero.

Judge sensitivity: per-axis deltas and score compression

Hover bars for exact values. The right panel shows how often each judge used each overall score.

Pro judgeFlash judge
Per-axis deltaOverall score distribution-0.5+0.0+0.5Vowel elongationNon-verbal vocal.Filled pausesDiscourse markersConv. fillersMid-thought pivotsEmotion / deliveryPacing tagsIn-text emphasisAudible reactions12345
Figure 6. Cross-judge comparison. Pro and Flash agree that aggregate script-stripping effects are small, but per-axis attribution is unstable.

What we ran

  • Initial Phase E: 5 g25-full production scripts × 12 variants = 60 audio renders.
  • Corrected Phase S: N=10 scripts for the aggregate and per-axis summary used in the report headline.
  • TTS: Gemini 2.5 Pro TTS, voice Kore.
  • Judges: Gemini 2.5 Pro audio judge and Gemini 2.5 Flash audio judge.
  • Rubric: four 1-5 sub-dimensions: prosody_naturalness, pacing_naturalness, paralinguistic_realism, overall_listenability.

Headline, corrected at N=10

In this exploratory N=10 run:

QuantityPhase E (n=5)Phase S (n=10, corrected)
Mean MOS, original4.754.65
Mean MOS, all-stripped4.504.30
Original minus all-stripped gap0.25 MOS0.35 MOS
Cronbach's alpha0.9670.962

The 0.35 MOS aggregate gap is small relative to the 1-5 scale and should not be converted into a "% naturalness" claim. MOS has no meaningful zero point, and the judges are not human listeners.

Per-axis result: hypotheses, not findings

Per-axis effect ranking at N=10 (mean Δ MOS when axis is stripped):

AxisΔ MOS95% CIp
05 Conversational fillers-0.35[-1.00, +0.25]0.272
07 Emotion / delivery tags-0.28[-0.62, +0.03]0.128
09 In-text emphasis-0.28[-0.60, +0.00]0.086
04 Discourse markers-0.23[-0.62, +0.20]0.289
06 Mid-thought pivots-0.20[-0.70, +0.35]0.464
02 Non-verbal vocal.-0.10[-0.47, +0.30]0.641
03 Filled pauses-0.10[-0.53, +0.30]0.641
01 Vowel elongation+0.00[-0.30, +0.35]1.000
08 Pacing tags+0.05[-0.40, +0.50]0.834
10 Audible reactions+0.12[-0.23, +0.53]0.555

No per-axis confidence interval excludes zero. These rankings are candidate priorities for a better-powered study, not confirmed causal effects.

Cross-judge result

Judgemean(original)mean(all-stripped)gapscore distribution (overall)
Gemini 2.5 Pro4.754.500.25 MOS18x4, 42x5
Gemini 2.5 Flash3.853.800.05 MOS3x3, 54x4, 3x5

The Pro judge shows a ceiling-heavy distribution and is same-family with the TTS engine. Flash is still not an independent human listener, but it is less ceiling-saturated. The correct reading is: Gemini-family LLM judges saw small aggregate differences, and the per-axis attribution is unstable.

What changed from the original Phase E interpretation

  • The earlier N=5 axis ranking was a small-sample artifact and should not drive conclusions.
  • Per-axis confirmation language has been removed. Nothing at the per-axis level is confirmed at N=10.
  • Phase L showed only 4 of 10 strippers map cleanly onto lexical signals in real ASR transcripts; the rest are interface or typography conventions.
  • Later validation showed engine-lane differences can exceed the script-stripping delta.

Limitations

  • LLM judges, not human listeners.
  • Same provider/model-family bias remains possible for both judges.
  • 5-point MOS is ceiling-clipped for strong TTS.
  • N=10 is exploratory.
  • Scripts were selected from a paralinguistically rich production set, not a pre-registered random corpus.
  • TTS model versions are moving targets.

Next experiment

Run a blinded human A/B or MUSHRA-style panel with balanced genres, randomized order, same scripts, multiple engines, and pre-registered analysis. That is the first experiment that can support external naturalness claims.

8. Discussion & decision matrix

The evidence supports a product-pragmatic conclusion: engine, voice, script structure, and production layer should be optimized together. The script taxonomy remains useful, but the report does not establish a large standalone markup effect.

GoalBest pathWhy
Most reliable production-quality pathGenerate structured scripts once; render through the strongest available TTS laneThe current evidence points to engine, voice, and production choices as first-order levers alongside script structure.
Best current Gemini laneGemini 3.1 Flash TTS Preview, with pinned model IDs and short probesProject validation showed it can outperform 2.5 Pro TTS on the same scripts, but it is still preview and has documented limits.
Highest documented paralinguistic surfaceElevenLabs v3 with inline tagsWidest official tag surface, but reliability and voice-clone caveats need direct testing.
Best meditation candidate to testCartesia Sonic 3.5 or Gemini 3.1 Flash/2.5 Pro TTSCartesia latest docs center Sonic 3.5; the report has not run a same-script meditation MOS comparison.
Best for podcastElevenLabs v3, Gemini 3.1 Flash TTS, and production lane in a blind A/BThe Flash judge found production/prompt close; do not choose from Pro-judge MOS alone.
Best for comedyNo production-ready winner yetLive laughter, audience energy, and contrastive accent remain the hardest gap.
Best next validation stepHuman listener panel before expanding script-side complexityThe biggest uncertainty is evaluator validity, not another small script ablation.

9. Open problems

  1. Human-listener validation — LLM judges are useful for triage but cannot settle external naturalness claims.
  2. Cross-engine same-script variance — the report has not rendered the same script set through ElevenLabs v3 and Cartesia Sonic 3.5.
  3. Pre-punchline contrastive accent — no tested lane reliably produces the L+H* on the punchline keyword that comedy depends on.
  4. Sub-word affect — sarcasm, mock anger, irony layered on neutral surface text.
  5. Live-room laugh-track integration — the audience reaction is part of the medium for stand-up.
  6. Long-form narrative arc — pacing that builds across a 30-60 minute episode.
  7. Cross-engine paralinguistic-tag standardisation[laughs] means different things to ElevenLabs, Gemini, Sesame, and CosyVoice.
  8. Power-adequate empirical ablation — N=10 is exploratory; a confirmatory design needs more scripts, human listeners, and pre-registered analysis.

10. Methodology — how each phase was run

A reader-grade audit trail of how every finding above was produced. The goal is full reproducibility: every paper read, every claim verified, every audio rendered.

10.1 Phase B — paper discovery across 10 axes

Each of the 10 paralinguistic axes was assigned to a dedicated research subagent with WebSearch and WebFetch access. The brief asked for 20-30 of the most-cited or most-recent papers per axis, spanning three literature clusters: foundational linguistics / phonetics, NLP / corpus, and TTS synthesis. Each paper had to surface a quantitative anchor (effect size, MOS delta, frequency-per-100-words) or be marked as inferred. ~340 papers were summarised in the first pass.

10.2 Phase C — 5 content-type style audits

Five subagents in parallel: meditation, podcast, deep-dive, comedy-news, lyrics. Each produced a per-axis frequency rating + characteristic-pattern table, plus speaking-rate / silence-ratio / vocabulary anchors. The deep-dive audit surfaced the load-bearing NotebookLM finding (Steven Johnson on Latent.Space: the audio model injects disfluencies, not the LLM).

10.3 Phase D — 3 TTS architecture audits

Three subagents: closed SOTA (ElevenLabs v3 + Cartesia + OpenAI), Google Gemini family (2.5 Pro TTS + 3.1 Flash TTS + NotebookLM), open-source evaluation surface (Kokoro / Fish / CosyVoice 2 / Higgs Audio / F5-TTS / Sesame CSM / Bark). Each captured architecture, training-data claims, paralinguistic-tag support, customization surface, and deployment constraints.

10.4 Phase E — empirical ablation harness

Initial Phase E used 5 g25-full scripts selected for paralinguistic richness; corrected Phase S expanded the summary to N=10. Each script used original + all-stripped + 10 per-axis stripped variants through Gemini 2.5 Pro TTS, then Gemini 2.5 Pro and Gemini 2.5 Flash audio judges. The stripper functions are unit-tested (29/29 passing). Statistical analysis: paired comparisons per axis, Holm-Bonferroni correction across 10 comparisons, percentile bootstrap CI on means, and Cronbach's alpha on the 4 sub-dimensions. This remains exploratory because the judges are LLMs and the corpus is small.

10.5 Voice selection rationale

All Gemini TTS renders used the Kore prebuilt voice — a Western/American-podcast-style voice. This kept the ablation apples-to-apples across the broader comparison set, where Kore was the established baseline.

11. The adversarial review trail

Every artifact above was independently fact-checked by a dedicated adversarial subagent. This is the trail. ~280 issues flagged, ~252 fixed, 20 fabrications removed.

11.1 The pattern

For each of the 18 artifacts (10 axes + 5 content styles + 3 TTS audits) and the ablation experiment design, a dedicated adversarial reviewer was spawned with the same rubric: verify every cited paper exists with the claimed author / year / venue, spot-check every numeric claim against the primary source, flag any [citation] that should be [inferred], identify 3-5 missing seminal works, and demand effect sizes where the original gave directional claims only.

11.2 Headline catches

  • Axis 01 (vowel elongation): the comedy-rewards-lengthening arm collapsed when adversarial fact-check revealed Pickering 2009 was misattributed Attardo & Pickering 2011, which actually reports null pre-punchline pause effects.
  • Axis 02 (non-verbal vocalizations): Cowen 24-categories (vocal bursts, 2019) vs 27-categories (video-evoked emotions, 2017) — different studies conflated. NV-Bench (14-category, arXiv:2603.15352) swapped with NVBench (45-category, arXiv:2604.16211).
  • Axis 07 (emotion tags): EmoDiff cited under wrong arXiv ID. Correct identifier is arXiv:2211.09496 (Guo et al, ICASSP 2023). StyleTTS 2 finding direction reversed in original.
  • Axis 09 (in-text emphasis): two papers misattributed to wrong authors. EE-TTS MOS deltas had Expressiveness and Naturalness reversed.
  • Comedy-news content audit: arXiv 2605.00143 ("Timing is Everything, 828 Chinese stand-up performances") was unverifiable — removed. The Trevor Noah Dec 2018 phone-TTS anecdote was contradicted by contemporary press — removed.
  • Deep-dive audit: Nicole Hennig "NotebookLM leaked prompt" framing is misleading — her published article is Claude-assisted reverse engineering, not a leak.
  • Closed-SOTA TTS audit: CastleFlow leaderboard entry on TTS Arena V2 unverifiable — removed.
  • Open-source TTS audit: Higgs Audio v2.5 parameter-count framing corrected; a pronunciation-only precedent claim was down-scoped because it did not establish paralinguistic control.

11.3 The verification pass

A meta-reviewer was then run over all 10 axis v2 rewrites to confirm flagged issues were actually addressed. Verdict: SHIP-READY. 107 of 119 reviewer flags cleanly fixed, 12 honestly downgraded from [citation] to [unverified] rather than silently dropped.

15. Lissin corpus paralinguistic statistics

Per-axis density anchors from the 20-hour pilot corpus and later prompted scripts against natural-conversation baselines.

AxisNatural conversation baselineLissin pilot baseline (v1)Lissin densified (v2)g25-full prompted
Vowel elongation / 100wnot measured~0.2~1.5~2.0
Non-verbal vocal / minute~0.5 laughs (Vettin & Todt 2004)~0.3~2.4~3.2
Filled pauses / 100w~2-3 (Switchboard, Shriberg 1994)~1.1~2.0~2.5
Discourse markers / 100w~3-5 (BNC2014, Aijmer)~3.8~4.5~5.0
Conv fillers / 100w~1-2 adult, ~3-4 teen (D'Arcy)~1.5~2.2~2.8
Mid-thought pivots / 100w~6 word-involved (Shriberg 1994)~0.4~0.8~1.5
Emotion / delivery tags / 100wn/a (annotation-only)~3.0~6.5~13.0
Pacing tags / 100wn/a~1.0~2.5~3.5
In-text CAPS / 100wn/a~0.0~0.1~3.2
Audible reactions / 100w~0.5~0.3~0.7~2.0

The g25-full prompted approach is stronger on most axes, with discourse markers tied and filled pauses slightly stronger in an earlier internal lane. This supports testing prompt-engineered scripts and engine choice before adding more script-side complexity.

16. Recommended scripting templates per content type

Copy-paste templates that ship with the right axis density for each genre. These are battle-tested in the Phase E ablation pipeline.

16.1 Meditation (Headspace / Calm / Waking Up style)

[gentle] Take a moment. [pause]
Notice the breath... slowly... in. [short pause] And out. [PAUSE=2s]
Just rest with this. [speaking slowly] Nothing to do. Nothing to fix. [pause]
[whispers] Just simply being here. [long pause]

Active speech 80-110 WPM; 30-60% of runtime is silence; "just" appears ~4× more often than reference corpora.

16.2 Podcast (Joe Rogan / Smartless / On Being style)

[curious] So, you know, here's the thing. I was — well, I was reading
about this and, uh, it kind of blew my mind. Like, basically, the idea
is — [excited] OK so imagine a moth so big it's the size of a country.
[laughs] Yeah, exactly. That's how big this thing was.

All 10 axes engaged; filled-pause + discourse-marker baseline ~3-5 per 100 words; 150-170 WPM raw / 90-120 effective WPM after pauses.

16.3 Deep-dive / explainer (Kurzgesagt / 3Blue1Brown / Veritasium style)

Here's a number that should not exist. [short pause]
ONE HUNDRED TRILLION. [medium pause] That's the number of synapses
in a single human brain — *one hundred trillion connections*, each one
forming and re-forming, every second, your entire life. [serious]
And we still — *still* — don't know how it works.

Disfluencies suppressed; in-text CAPS on key terms; strategic beat before reveals; 130-160 WPM baseline / 165-200 WPM energetic.

16.4 Comedy news (Daily Show / Weekend Update / late-night style)

So Trump signed a new — wait. [medium pause] Sorry, you're gonna
LOVE this. He signed an executive order — about moths.
[deadpan] Yeah. Moths. [PAUSE=2s] As of Tuesday, moths are now,
quote, "totally allowed to be in the White House." [laughs]
This is — [trembling] — this is the real news.

Pre-punchline pause is the #1 axis; vowel elongation high; in-text CAPS for stress; em-dash pivots for misdirection.

16.5 Lyrics / song (Suno / Udio style)

[Verse 1]
I been waiting all niiight,
Just to see you smiiile
Oh-oh-oh, it's been a whiiile,
[Intimate Delivery] yeah it's been a whiile

[Chorus]
Soooo say it again, [Breathy] say it again
[Soulful Delivery] Tell me you're miiine — say it again

Vowel elongation as melisma; bracket-tag delivery axis; pacing inherits from musical metre; filled pauses near zero.

17. Engine-specific tag taxonomies (the cross-engine standardisation problem)

A tag-portability table. The same paralinguistic intent has to be re-encoded for each engine — a real friction point if you're A/B-ing engines.

IntentElevenLabs v3Gemini 2.5/3.1 TTSCartesia Sonic 2/3OpenAI gpt-realtimeCosyVoice 2
Laugh[laughs][laughs][laughter]instruction field<|laughter|>
Sigh[sighs][sighs]instruction field<|sigh|>
Whisper[whispers][whispering]SSML prosody volumeinstruction<|whisper|>
Short pause... (ellipsis)[short pause]<break time="500ms"/>punctuationpunctuation
Long pause[break time="3s"][PAUSE=2s]<break time="2s"/>
Excited[excited][excited]SSML prosody pitchinstruction<|happy|>
CAPS emphasishonor CAPS nativelyhonor CAPS nativelystrip CAPS, use <emphasis>honor CAPS
Vowel elongationtags preferredgrapheme repeat ("Soooo")

18. The cross-engine variance hypothesis

Same-script-through-different-engines variance may be larger than the script-side variance measured in the ablation, but the report has not fully resolved this.

Phase E/S held the engine constant (Gemini 2.5 Pro TTS) and varied the script. Under that design, the corrected original-vs-all-stripped gap was 0.35 MOS under the Pro judge and much smaller under Flash. The inverse question — hold the script constant and vary the engine — was only partially tested inside the Gemini family.

The strongest product signal is that the later Gemini 3.1 Flash TTS lane outscored the Gemini 2.5 Pro TTS lane on the same scripts under the Pro judge. That suggests engine choice can swamp script-markup deltas. A complete answer needs the same scripts rendered through ElevenLabs v3 and Cartesia Sonic 3.5, then judged by humans or at least a cross-provider audio judge.

19. Reproducibility manifest

Every artifact, every script, every JSON, every audio file is on disk and version-pinned. Re-running the pipeline from scratch is two commands.

19.1 Source files

  • axis_summaries/*.{md,v2.md} — 10 literature reviews (original + adversarial-fixed v2)
  • content_styles/*.{md,v2.md} — 5 style audits
  • tts_audits/*.{md,v2.md} — 3 architecture audits
  • adversarial/*.md — 19 adversarial reviewer verdicts + 1 verification pass
  • experiments/run_axis_ablation.py — Phase E harness (with stripper unit tests at experiments/test_strippers.py)
  • experiments/analyze_ablation.py — paired t-test, Holm-Bonferroni, bootstrap CI, Cronbach's α
  • experiments/judge_scores.json + judge_flash_scores.json — raw judge outputs (60 + 60 rows)
  • experiments/audio/*.wav — 60 rendered audio files
  • refs/bibliography.{json,md} — 905 auto-aggregated references
  • figures/*.png — source chart artifacts; the public report renders the main figures as inline HTML/SVG

19.2 Re-run sequence

# Re-run literature aggregation
python3 drafts/build_bibliography.py

# Re-run all charts
python3 drafts/build_axis_content_matrix.py
python3 drafts/build_extra_charts.py
python3 experiments/build_judge_compare.py

# Re-run statistical analysis (uses existing judge JSONs)
python3 experiments/analyze_ablation.py

# Re-build report HTML
python3 drafts/build_report_html.py

# Re-publish
python3 drafts/publish_report.py

19.3 Hardware / software pins

  • Python 3.14 in tts_corpus_v1/.venv
  • google-genai SDK (vertexai client), matplotlib 3.10.9
  • Vertex AI project: listenowl-g, location us-central1 for Pro / global for 3.x
  • Wrangler 4.63 to publish

20. Future work — validating and improving naturalness

The next work should reduce evaluator uncertainty first, then test engines, voices, and post-production. The existing MOS numbers are exploratory LLM-judge scores, not a validated percentage decomposition of human naturalness.

20.1 Likely remaining gap components

The Phase P/T diagnostics suggest the missing layer is diffuse: voice identity, acoustic micro-variation, fresh non-verbal events, long-form energy arc, and audio production all plausibly contribute. The shares below are product hypotheses, not measured causal effects.

#Gap componentStatusWhy it shows upWhat could close it
1Voice identity / parasocial recognitionHypothesisGeneric narrator voices lack the familiarity and character of real hosts.Reference-audio voice cloning per content type; test via blinded A/B.
2Acoustic micro-variationHypothesisModern TTS can sound too smooth in F0, breath, and texture.Engine-level models or light production processing.
3Novel non-verbal eventsHypothesisRepeated laughs/sighs can feel canned.Engines or datasets with more spontaneous non-verbal diversity.
4Long-form energy arcHypothesisChunked TTS lacks episode-level state.Per-section energy direction and long-context render strategies.
5Audio production polishHypothesisReal recordings include mic tone, room tone, compression, and mix context.Post-render EQ, compression, room tone, and reverb A/B.

20.2 Priority order

  1. Human listener panel. Validate whether humans agree with the LLM-judge ranking.
  2. Cross-engine same-script renders. ElevenLabs v3, Cartesia Sonic 3.5, Gemini 3.1 Flash TTS, and current production.
  3. Voice cloning per content type. Test whether a genre-matched voice beats more markup.
  4. Audio production layer A/B. Raw TTS vs light mastering chain.
  5. Per-episode energy direction. Add section-level energy tags only after engine/voice baselines are stable.

20.3 What not to over-optimize yet

  • More bracketed tags. The current evidence does not support simply increasing tag density.
  • More disfluencies. Synthetic scripts already overproduce some lexical markers compared with real transcripts.
  • More extreme em-dash pivots. The earlier N=5 axis ranking was not stable.
  • More script-side complexity. Prompt and production paths are close under the less ceiling-prone Flash pairwise tests.

20.4 Concrete experiments to run next

  1. Human listener naturalness panel — blinded pairwise or MUSHRA-style, balanced by genre, randomized order, pre-registered analysis.
  2. Eleven v3 + Cartesia Sonic 3.5 cross-engine bar — render the same scripts through both with documented settings.
  3. Voice-clone A/B — calm meditation voice, podcast voice, deep-dive voice.
  4. Audio production A/B — raw output vs light mastering chain.
  5. Higgs Audio v2.5 probe — a simpler open-source engine probe than CosyVoice 2.

20.5 Axes our 10-axis taxonomy missed — candidates for future research

The original 10 axes were derived from a specific subset of paralinguistic phenomena — speaker-internal markers that show up at the word or phrase level in a single-speaker single-track recording. Several real phenomena that show up in actual podcast / comedy / storytelling audio are NOT captured by any of the 10. Below are 12 candidate axes ranked by weightage for Lissin's product. Weightage is a 1-10 priority for "how much should we add this to a future revision of the report."

#Candidate axisWeightWhat it isWhere it is definingWhy the original 10 missed it
11Onomatopoeia8Lexical imitation of external sounds — "BOOM", "tap-tap-tap", "ZAP"Comedy (defining), storytelling podcasts, lyricsThe original 10 were speaker-internal markers; sound-class imitation of the external world is a different category.
12Overlapping speech / multi-host turn-taking8Two hosts talking simultaneously, finishing each other's sentences, "yeah-yeah-yeah" while the other speaksConversational podcasts (Smartless, Joe Rogan, This American Life)We treated TTS as single-speaker; multi-track conversation is structurally different.
13Character voice / persona switching7Same speaker switching to a different voice persona mid-utterance — Mulaney doing his dad's voice, Trevor Noah doing an impressionStand-up, storytelling podcastsEmotion tags cover affect but not voice-identity switching within one speaker.
14Backchannel responses during another's turn7"mhm", "right", "yeah" emitted by listener while primary speaker is talkingInterview podcasts, conversational showsAudible reactions axis assumed speaker-emitted; backchannels are listener-emitted on a parallel track.
15Sound effects / foley cues7Bracketed non-vocal events: [door slam], [footsteps], [crowd murmur]Storytelling podcasts (Serial, S-Town), audio dramasEdge case: these are not TTS — they are a separate audio layer. But the script writes them.
16Whisper-to-shout dynamic range mid-sentence6Continuous loudness modulation within one utterance, not discrete [whispers][shouts]Meditation (whisper) + comedy (shout)Emotion / delivery tags are discrete; the continuous dynamic is not captured.
17Code-switching / accent shift mid-sentence6English → Spanish, RP → AAVE, fancy → casual mid-clauseMulti-cultural content, comedy impressions, bilingual podcastsNot in any of the 10; closed engines also do not expose it.
18Audience reaction integration6Live laughter, applause, gasps from an audience track woven into the speechComedy news, stand-up, late-nightWe treated paralinguistic events as speaker-emitted; audience track is third-party and a separate audio layer.
19Affect-physiology speech (sighing, crying, out-of-breath)5Prosody baked into a physiological state — resigned sigh through whole sentence, post-exertion breath, tearful tremorEmotional storytelling, meditation breathwork, sports commentaryEmotion tags name the affect; physiology shapes the entire delivery and is below tag granularity.
20Reading vs reciting cadence5Distinct prosody for ad reads, sponsor mentions, news quotes vs conversational deliveryPodcasts with ad reads, news bumpers, audiobook chapter headsPacing tags do not capture register switching between read-aloud and conversational modes.
21Vocal fry / creaky voice4Glottalization at phrase end — the "millennial podcast voice" signature30+ podcast voices (default register), some meditationSub-segmental phonation; below the axis grain of our taxonomy.
22Vowel reduction patterns (gonna / wanna / kinda)4Casual register reduction of "going to" → "gonna"Conversational podcasts, comedyOrthographic — engines render whichever spelling the script uses, but the prosodic difference was never tested.

Suggested grouping for a future report revision

  • Add axes 11-14 as proper additions to the taxonomy (Onomatopoeia, Overlapping speech, Character voice, Backchannels). These are genuinely distinct from the existing 10 and have clear content-type homes. Each axis would need a literature-review subagent pass, an adversarial-reviewer pass, a handful of paired contrast clips, and two judges. The work is small-budget and parallelisable.
  • Add axes 15-20 collectively as a new section called "Speaker-external and multi-track phenomena" — together they name the same gap: the original taxonomy was built around single-speaker single-track speaker-internal markup, and real podcast / comedy audio has multi-track and external-sound dimensions that script-side markup alone cannot fully address.
  • Axes 21-22 (vocal fry, vowel reduction) are sub-segmental and probably do not need their own row — flag them in the glossary or under axis 09 as related phonation phenomena.

Total wall-clock to expand the taxonomy from 10 to 14 with the speaker-external section, running the per-axis subagents in parallel: ~3 hours. The budget envelope is well inside the same per-axis cost the original 10 sat under.

21. Glossary

Terms used in this report.

Naturalness
Could this audio have come from a real person? (Binary-ish, perceptual.)
Expressiveness
The range of affect / style the TTS can produce. (Continuous, capability.)
Prosody
Pitch, duration, intensity contours that realise both naturalness and expressiveness.
Paralinguistic
Vocal / typographic features that carry meaning beyond the lexical content (laughter, pauses, emphasis).
MOS
Mean Opinion Score, 1-5 naturalness rating. Industry standard.
CMOS
Comparative MOS — paired A/B preference rather than absolute rating.
Ablation
Selectively remove one factor; measure how output changes. The standard method to attribute effect.
Tacotron / Bark / Gemini 2.5 Pro TTS
Successive generations of TTS architecture. Tacotron and Bark are autoregressive seq2seq; Gemini 2.5 Pro TTS is a multimodal LM.
Halo bias
The judge gives systematically inflated scores to outputs from the same model family.
Holm-Bonferroni
Conservative multiple-comparison correction; controls family-wise error rate.
Cronbach's α
Internal-consistency reliability of a multi-item scale. α > 0.9 means the items are nearly collinear.

22. Validation phase — real content, production lanes, and judge limits

The validation phase was added after pushback that the first results used LLM-generated scripts and LLM judges. It improves the evidence, but it does not remove the need for human listeners.

22.1 Real human content benchmark (Phase H)

We pulled 31 YouTube transcripts via yt-dlp across 4 genres: meditation (7), podcast (7), comedy news (8), and deep-dive (9). The transcript analysis covered 103,325 words. Same 10-axis regex strippers were applied to compute per-100-word axis densities.

Transcript-level conclusion: synthetic scripts are close to real transcripts on the axes that ASR actually preserves, but many markup axes are not directly comparable because real humans laugh or pause without producing bracket tags in transcripts.

AxisReal all (per 100w)g25-full promptProduction variantRatio (synth/real)
Discourse markers2.403.793.021.58x g25 / 1.26x production variant
Conversational fillers1.262.222.321.76x g25 / 1.84x production variant
Filled pauses (floor)0.30*0.700.842.29x g25 / 2.75x production variant

*ASR drops many filled pauses; the real value is a floor.

22.2 Stripper validity audit (Phase L)

Only four of the ten strippers map cleanly onto lexical signals in ASR-derived real transcripts. The rest are interface/typography conventions or, in one case, a wrong-axis bug.

StripperReal-content removal %Verdict
03 Filled pauses (uh/um)0.31% avgREAL-SIGNAL lexical axis
04 Discourse markers0.37% avgREAL-SIGNAL lexical axis
05 Conversational fillers0.19% avgREAL-SIGNAL lexical axis
10 Audible reactions0.20% avgREAL-SIGNAL lexical axis
01 Vowel elongation0.00%PURE-FORMAT; can corrupt text such as Henry VIII
02 Non-verbal vocal tags0.01%PURE-FORMAT
06 Mid-thought pivots0.05%PURE-FORMAT / orthographic
08 Pacing tags0.00%PURE-FORMAT
09 In-text emphasis0.00%PURE-FORMAT
07 Emotion / delivery tags1.47% avgWRONG-AXIS BUG; eats YouTube [Music]/[Applause] tokens

This weakens the per-axis interpretation of the original ablation. Future work should report lexical strippers separately from interface strippers.

22.3 Production variant vs g25-full prompt — pairwise A/B (Phase O)

Same 10 production items, both rendered through Gemini 2.5 Pro TTS, judged with two positional orderings to control for position bias.

JudgeProduction-variant winsPrompt winsInterpretation
Gemini 2.5 Pro4/2016/20Strong prompt preference, but same-family halo risk is high.
Gemini 2.5 Flash10/2010/20No clear difference.
GPT-4o-audio-previewNot accessible in this OpenAI org; model returned 404.Cross-family judge not completed.

22.4 What changes about the early headline

The early 5.3% / 1.3% framing is retained only as historical context and should not be used as a headline. The corrected report uses MOS deltas, caveats the LLM judges, and treats per-axis results as hypotheses.

  • Engine-side question: modern frontier TTS compresses much script-side variance into similar-sounding audio, but the exact amount is not settled.
  • Production-variant-vs-prompt question: judge-dependent; Flash sees parity.
  • Real-vs-synthetic question: Gemini-family judges did not detect a large average gap, but human listeners are required.

22.4b Engine floor and markup increments (Phase T)

Phase T added a missing baseline: raw user prompts rendered directly through Gemini 2.5 Pro TTS, then judged with the same Pro and Flash judges. These are MOS scores, not percentages of human naturalness.

Exploratory MOS ladder

Hover bars for the exact MOS and interpretation note. Values are diagnostic LLM-judge scores.

Pro judgeFlash judge
12345Engine floor3.803.10+ script structure3.803.60+ paralinguistic markup4.653.85Real human audio4.554.10Mean opinion score, 1-5
Figure 9. Exploratory MOS ladder from raw engine output to structured script, marked-up script, and real human audio. Treat as LLM-judge diagnostics, not a ratio-scale naturalness decomposition.
StagePro MOSFlash MOSDelta vs prior stage (Pro)Delta vs prior stage (Flash)What this is
Engine floor (raw prompt -> TTS)3.803.10----No script structure or markup.
+ script structure (all-stripped)3.803.60+0.00+0.50Organized into a script, no markup.
+ paralinguistic markup (original)4.653.85+0.85+0.25Inline tags, disfluencies, pauses.
Real human audio4.554.10-0.10 vs synthetic original+0.25 vs synthetic original31 YouTube clips across four genres.

The Pro judge producing synthetic-original above real-human audio is a warning sign, not a victory claim. It likely reflects scale ceiling and same-family bias.

22.5 Audio-vs-audio: real YouTube clips vs synthetic Gemini TTS (Phase P)

Phase P did run the audio-vs-audio comparison: 31 of 31 YouTube videos downloaded successfully as 45 s windows, then judged with the same Pro and Flash rubrics. This resolves the earlier "not validated" contradiction.

JudgeReal audio mean (n=31)Synthetic baselineDelta (real - synth)Cohen's d
Gemini 2.5 Pro4.554.80 in the original n=5 baseline-0.25 MOS-0.44
Gemini 2.5 Flash4.104.00 in the original n=5 baseline+0.10 MOS+0.18

Careful reading: Gemini-family LLM judges did not detect a large average real-over-synthetic gap. This does not prove humans would agree. The samples are unbalanced, genre-mixed, and production contexts differ.

GenrenReal ProReal FlashPattern
Comedy85.004.88Real audio clearly strongest; audience energy matters.
Podcast74.714.29Roughly equivalent to strong synthetic lanes.
Deep-dive94.223.89Flatter real delivery can be penalized by LLM judges.
Meditation74.293.29Intentionally quiet/flat real delivery confuses conversational naturalness rubrics.

Five-way audio comparison

Hover bars for the exact MOS and interpretation note. Values are diagnostic LLM-judge scores.

Pro judgeFlash judge
12345Real YouTube human audio4.554.10g25-full -> Gemini 3.1 Flash TTS4.404.00Lissin alternate lane -> 2.5 Pro TTS4.104.00g25-full -> Gemini 2.5 Pro TTS3.803.60Lissin production pipeline3.804.00Mean opinion score, 1-5
Figure 8. Five-way audio comparison on overall listenability MOS. The spread should be read as diagnostic because the Pro judge shows halo/ceiling behavior and Flash compresses many lanes together.

22.6 Lissin production-pipeline audio — deployed baseline (Phase P+)

Ten production-pipeline mp3 files were trimmed to 45 s windows and judged with the same Pro and Flash rubrics.

SourcePro MOSFlash MOSnWhat it is
Real YouTube human audio4.554.1031Genre-mixed human clips.
g25-full -> Gemini 3.1 Flash TTS4.404.0010Same scripts, different Gemini TTS lane.
Lissin alternate lane -> 2.5 Pro TTS4.104.0010Alternate Lissin rendering lane.
g25-full -> Gemini 2.5 Pro TTS3.803.6010Prompt-engineered script lane.
Lissin production pipeline3.804.0010Deployed baseline.

The key product signal is that Gemini 3.1 Flash TTS can beat Gemini 2.5 Pro TTS on the same scripts in this lane. That is stronger evidence for testing engine choice than for adding more markup.

22.7 Production vs Prompt pairwise A/B (Phase Q)

The pairwise A/B is more useful than independent MOS near the ceiling. It used both orderings for 10 production-vs-g25-full pairs.

JudgeProduction winsPrompt winsVerdict
Gemini 2.5 Pro5/2015/20Strong prompt preference, with halo concern.
Gemini 2.5 Flash11/209/20Near tie; slight production preference.

Under the less ceiling-prone Flash judge, the alternate lane, prompt, and production are close. That should steer the team toward engine/voice/human-validation experiments rather than more script markup alone.

22.8 Honest reframing

  • The 10-axis taxonomy is useful for designing genre-aware scripts.
  • The current experiments do not prove a large standalone markup effect.
  • LLM judges are not sufficient for external claims, especially when same-family halo and ceiling effects are visible.
  • Comedy is the clearest genre where real audio beats synthesis in this dataset.
  • The next decisive step is a blinded human listener panel.

22.9 What remains unvalidated

  • Cross-engine same-script comparison. ElevenLabs v3 and Cartesia Sonic 3.5 API keys were not present.
  • Open-source engine probe. The probe did not complete because the dependency image build failed.
  • Human listener preference. No blinded human-listener panel has run yet.

23. References

Auto-extracted and filtered bibliography. Entries are labeled by source tier; vendor and secondary sources should not be read as primary research evidence.

Publication-year distribution

Hover bars for counts. Dated references are inferred from arXiv IDs or citation text.

1959197919871994200120062011201620212026Bibliography spans 1959-2026 (235 dated refs of 344 total)
Figure 7. Publication-year distribution of the filtered references aggregated from the source documents.

Total unique references: 344

Each entry is labeled as primary/preprint, primary/doi, vendor, secondary, unverified, source, or author-year/source-list.

arxiv (60)

doi (9)

url (99)

author_year (176)

  • [author-year/source-list] ACE-Step 2025
  • [author-year/source-list] ACII 2022
  • [author-year/source-list] Adell 2007
  • [author-year/source-list] Adell 2012
  • [author-year/source-list] Adell et al. 2007
  • [author-year/source-list] Aijmer 2002
  • [author-year/source-list] Anikin & Persson 2020
  • [author-year/source-list] Attardo & Pickering 2011
  • [author-year/source-list] Beckman & Pierrehumbert 1986
  • [author-year/source-list] Bedeutung 2021
  • [author-year/source-list] Biber 1988
  • [author-year/source-list] Blakemore 1987
  • [author-year/source-list] Blakemore 2002
  • [author-year/source-list] Bolden 2006
  • [author-year/source-list] Bolden 2008
  • [author-year/source-list] Bolden 2009
  • [author-year/source-list] Bonafonte & Escudero 2007
  • [author-year/source-list] Borsos et al. 2022
  • [author-year/source-list] Borsos et al. 2023
  • [author-year/source-list] Bortfeld et al. 2001
  • [author-year/source-list] Branigan et al. 1999
  • [author-year/source-list] Brennan & Schober 2001
  • [author-year/source-list] Brezina & McEnery 2017
  • [author-year/source-list] Brinton 1996
  • [author-year/source-list] Brinton 2008
  • [author-year/source-list] Brody & Diakopoulos 2011
  • [author-year/source-list] Bryant & Aktipis 2014
  • [author-year/source-list] Bryant et al. 2018
  • [author-year/source-list] CMC-Corpora 2023
  • [author-year/source-list] COLING 2020
  • [author-year/source-list] COLT and BNC 2014
  • [author-year/source-list] Callahan 2013
  • [author-year/source-list] Chang et al. 2022
  • [author-year/source-list] Channell 1994
  • [author-year/source-list] Chaudhury et al. 2024
  • [author-year/source-list] Collister 2014
  • [author-year/source-list] Communication 2024
  • [author-year/source-list] Corley & Stewart 2008
  • [author-year/source-list] Cowen & Keltner 2017
  • [author-year/source-list] Crane 2017
  • [author-year/source-list] D'Arcy 2017
  • [author-year/source-list] Dailey-O'Cain 2000
  • [author-year/source-list] Dall- 2014
  • [author-year/source-list] DeepASMR 2026
  • [author-year/source-list] DeepMind 2025
  • [author-year/source-list] Dreeben et al. 2013
  • [author-year/source-list] EMNLP 2011
  • [author-year/source-list] EMNLP 2024
  • [author-year/source-list] Editor 2019
  • [author-year/source-list] Ehret & Taboada 2024
  • [author-year/source-list] Eisenstein 2013
  • [author-year/source-list] ElevenLabs 2025
  • [author-year/source-list] Erickson & Rossi 1979
  • [author-year/source-list] Extends 2013
  • [author-year/source-list] Fox-Tree-and-Schrock- 2002
  • [author-year/source-list] Fraser 1999
  • [author-year/source-list] Fraundorf & Watson 2011
  • [author-year/source-list] Goffman 1978
  • [author-year/source-list] Goldman-Eisler 1968
  • [author-year/source-list] Grammer & Eibl-Eibesfeldt 1990
  • [author-year/source-list] Guo et al. 2021
  • [author-year/source-list] Halliday & Hasan 1976
  • [author-year/source-list] Hansen 1998
  • [author-year/source-list] Heeman & Allen 1999
  • [author-year/source-list] Heinz 2003
  • [author-year/source-list] Heritage 1984
  • [author-year/source-list] Heritage 2013
  • [author-year/source-list] Hesson & Shellgren 2015
  • [author-year/source-list] Hosom 2009
  • [author-year/source-list] ICASSP 2022
  • [author-year/source-list] ICASSP 2023
  • [author-year/source-list] ICASSP 2024
  • [author-year/source-list] INTERSPEECH 2022
  • [author-year/source-list] Interspeech 2014
  • [author-year/source-list] Interspeech 2022
  • [author-year/source-list] Interspeech 2023
  • [author-year/source-list] Ju et al. 2024
  • [author-year/source-list] Jucker & Smith 1998
  • [author-year/source-list] Jurafsky & Manning 2010
  • [author-year/source-list] Kabat-Zinn 1990
  • [author-year/source-list] Kalman & Gergle 2014
  • [author-year/source-list] Kharitonov et al. 2023
  • [author-year/source-list] Kim et al. 2021
  • [author-year/source-list] Klatt 1976
  • [author-year/source-list] Komaniecki 2017
  • [author-year/source-list] Kreiman & Sidtis 2011
  • [author-year/source-list] Kunshan 2025
  • [author-year/source-list] Kwon et al. Interspeech 2025
  • [author-year/source-list] Laborde 2022
  • [author-year/source-list] Lamontagne & McCulloch 2017
  • [author-year/source-list] Languages 2025
  • [author-year/source-list] Le et al. 2023
  • [author-year/source-list] Levelt 1983
  • [author-year/source-list] Levelt 1989
  • [author-year/source-list] Lewis 2006
  • [author-year/source-list] Li et al. 2023
  • [author-year/source-list] Li et al. 2024
  • [author-year/source-list] Liberman & Streeter 1978
  • [author-year/source-list] Lickley & McKelvie 1999
  • [author-year/source-list] Lickley 2001
  • [author-year/source-list] Linguistics 2024
  • [author-year/source-list] Lison & Halvorsen 2024
  • [author-year/source-list] Love et al. 2017
  • [author-year/source-list] Lutz 2008
  • [author-year/source-list] MCIS 2010
  • [author-year/source-list] MDPI 2025
  • [author-year/source-list] Maclay & Osgood 1959
  • [author-year/source-list] MarkTechPost 2026
  • [author-year/source-list] Matsunaga et al. 2022
  • [author-year/source-list] McCulloch 2019
  • [author-year/source-list] Mihkla et al. 2023
  • [author-year/source-list] Mosegaard 1998
  • [author-year/source-list] Moshi 2024
  • [author-year/source-list] Mukherjee et al. 2024
  • [author-year/source-list] Mulaney 2025
  • [author-year/source-list] NeurIPS 2022
  • [author-year/source-list] NeurIPS 2024
  • [author-year/source-list] Nguyen et al. 2024
  • [author-year/source-list] Olszewski 2024
  • [author-year/source-list] Patel & Daniele 2003
  • [author-year/source-list] Patel 2008
  • [author-year/source-list] Pavalanathan & Eisenstein 2015
  • [author-year/source-list] Pichler & Levey 2011
  • [author-year/source-list] Pichler 2010
  • [author-year/source-list] Pickering 2009
  • [author-year/source-list] Prosody 2014
  • [author-year/source-list] Provine 1993
  • [author-year/source-list] Provine 1996
  • [author-year/source-list] Provine 2000
  • [author-year/source-list] Radford et al. 2022
  • [author-year/source-list] Roy & Renals 2010
  • [author-year/source-list] Rubenstein et al. 2023
  • [author-year/source-list] Russo et al. 2025
  • [author-year/source-list] SIGDIAL 2018
  • [author-year/source-list] Saito et al. 2022
  • [author-year/source-list] Salmon 2013
  • [author-year/source-list] Schank & Abelson 1977
  • [author-year/source-list] Schegloff 1982
  • [author-year/source-list] Schettino et al. 2023
  • [author-year/source-list] Schiffrin 1987
  • [author-year/source-list] Schourup 1999
  • [author-year/source-list] Sharon & John 2024
  • [author-year/source-list] Shriberg 1994
  • [author-year/source-list] SponTTS 2024
  • [author-year/source-list] Stivers et al. 2009
  • [author-year/source-list] Stolcke et al. 2000
  • [author-year/source-list] Sundberg 1987
  • [author-year/source-list] Tagliamonte 2005
  • [author-year/source-list] Tartory et al. 2024
  • [author-year/source-list] Torgersen et al. 2011
  • [author-year/source-list] Tottie 2011
  • [author-year/source-list] Tree & Schrock 1999
  • [author-year/source-list] Tree & Schrock 2002
  • [author-year/source-list] Tree 2001
  • [author-year/source-list] Tree 2002
  • [author-year/source-list] Tree 2010
  • [author-year/source-list] Tree's 2002
  • [author-year/source-list] Trouvain & Truong 2014
  • [author-year/source-list] Verdonik 2025
  • [author-year/source-list] Verdonik et al. 2025
  • [author-year/source-list] Vettin & Todt 2004
  • [author-year/source-list] Wagner et al. 2023
  • [author-year/source-list] Wang et al. 2023
  • [author-year/source-list] Watanabe et al. 2008
  • [author-year/source-list] Wennerstrom 2001
  • [author-year/source-list] Wieling 2016
  • [author-year/source-list] Wieling et al. 2016
  • [author-year/source-list] Xiong et al. 2016
  • [author-year/source-list] Xu & Xu 2005
  • [author-year/source-list] Yamamoto et al. 2024
  • [author-year/source-list] Yan & Kambara 2021
  • [author-year/source-list] Yang et al. 2024
  • [author-year/source-list] Yang et al. NWAV 2018
  • [author-year/source-list] Yuan & Liberman 2006
  • [author-year/source-list] Zeghidour et al. 2021
  • [author-year/source-list] Zhao 2024