Building a Local AI Audiobook Pipeline with Python and Kokoro

 · 3 min read
 · Nima Moradi
Table of contents

During the holidays, I wanted to listen to a specific book series. However, I found myself in a common predicament: there were no official audiobooks available, and the existing mobile text-to-speech accessibility tools sounded too robotic for a novel.

So I decided to engineer a solution myself. I built a Python pipeline that ingests a PDF, parses the structure, and uses a high-quality local AI model to generate audio files.

Here is the technical breakdown of how I built it using PyMuPDF and Kokoro-82M.

The Architecture

The pipeline consists of three main stages: 1. Extraction: Reading the raw text from the PDF file. 2. Structuring: Using Regular Expressions to split the continuous text into logical chapters. 3. Synthesis: Feeding the text into the Kokoro-82M model to generate WAV files.

Step 1: Text Extraction with PyMuPDF

I chose PyMuPDF (imported as fitz) for this task because of its speed and accuracy in preserving text order.

import fitz

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    full_text = ""
    for page in doc:
        full_text += page.get_text("text") + "\n"
    return full_text

Step 2: Intelligent Chapter Splitting

The raw text from a PDF is a single continuous stream. To make the audiobook navigable, I needed to split it by chapter. I used Python's re module to identify chapter headings (e.g., "Chapter 1" or "Chapter IV").

This function splits the text and returns a list of dictionaries, allowing me to process one chapter at a time.

import re

def split_into_chapters(full_text):
    # Split by "Chapter X" or "Chapter RomanNum"
    pattern = r"(?i)(Chapter\s+\d+|Chapter\s+[IVX]+)"
    parts = re.split(pattern, full_text)

    chapters = []

    # Check for Intro (text before the first "Chapter 1")
    if parts[0].strip():
        chapters.append({"title": "Intro", "content": parts[0]})

    # Group the rest (Title + Content)
    for i in range(1, len(parts), 2):
        header = parts[i]
        content = parts[i+1] if i+1 < len(parts) else ""
        chapters.append({"title": header, "content": content})

    return chapters

Step 3: Audio Generation with Kokoro

For the audio generation, I used Kokoro-82M (specifically the prince-canuma/Kokoro-82M implementation). This model is highly efficient and optimized for local inference, offering voice quality that rivals commercial cloud APIs without the cost or latency.

I configured the pipeline to save each chapter as a distinct WAV file.

def process_chapter(chapter, output_folder="output_audio"):
    """
    Generates audio for the section.
    """
    if not chapter:
        print("No text found.")
        return

    title = chapter['title'].strip()
    content = chapter['content']

    print(f"Processing: {title}...")

    # MLX-Audio generation
    # Using 'af_heart' voice profile for a natural tone
    generate_audio(
        text=content,
        model_path="prince-canuma/Kokoro-82M", 
        voice="af_heart",
        output_file=f"{output_folder}/{title}.wav"
    )

    print(f"Done. Saved to {output_folder}/{title}.wav")

Conclusion

I built this simply because I wanted to listen to a book and no audio version existed.

It serves as a reminder that we don't always have to wait for official tools or features. With the current state of Python libraries and local AI, building your own custom automation is often easier than it seems.