Aug 22, 2025

Refresh Your YouTube Videos using OpenAI and ElevenLabs

Refresh Your YouTube Videos using OpenAI and ElevenLabs

Transform your livestreams and presentations into polished YouTube content using AI tools like ElevenLabs, GPT-4, and FFmpeg. Complete tutorial with code.

Introduction

Creating polished, professional video content has never been more accessible. In today’s fast-paced digital landscape, content creators are constantly seeking ways to improve their production quality while streamlining their workflow. We’ve developed an innovative approach to transforming existing long-form presentations and livestreams into refined, professional-grade content using AI-powered tools.

This comprehensive tutorial explores a complete pipeline for re-recording and refreshing your live presentations using modern AI tools into sleek videos that will make the YouTube algorithm happy. We’ll walk through every step of the process, from extracting and polishing transcripts to generating synthetic narration using ElevenLabs and creating professional slide presentations. By the end of this guide, you’ll have the knowledge and tools to transform your content into polished, engaging material that resonates with broader audiences.

As usual, you can follow along with either the video (now re-recorded and refreshed) or the written tutorial.

Video Refresh Motivation

Audience Shift Live presentations work for real-time engagement, but YouTube viewers expect different pacing and fewer interruptions. Converting live content requires removing meeting delays, technical issues, and Q&A segments that don't serve asynchronous viewers.

Technical Updates Content becomes outdated quickly in tech fields. Video refresh preserves core insights while updating delivery and correcting obsolete information, rather than recreating from scratch.

Production Quality Natural conversational speech doesn't always work for recorded content. Refreshing allows audio cleanup and more professional delivery while maintaining authentic messaging.

Global Reach AI tools like ElevenLabs enable multi-language content creation without requiring native fluency, expanding audience reach, and democratizing content distribution.

The Complete Pipeline Overview

The transformation process involves six key stages, each designed to build upon the previous step while maintaining content integrity and improving presentation quality.

  1. Transcript Extraction and Processing

  2. Content Summarization and Enhancement

  3. Redesign Slides

  4. Synthetic Voice Generation and Cloning

  5. Re-record Audio

  6. Final Assembly and Distribution

You can download all the code we are using for this tutorial at https://github.com/godfreynolan/rerecording.

Stage 1: Transcript Extraction and Processing

The foundation begins with extracting accurate transcripts from existing YouTube videos. Using the YouTube Transcript API, we can programmatically download both the text content and timing information for every segment of our original video. This creates the raw material for our transformation process. Open up step1.py in the example code, and you should see this block.

def generate_summaries(video_id, xlsx_path, output_folder="summaries"):
    os.makedirs(output_folder, exist_ok=True)
    
    # Load and prepare data    
    df = pd.read_excel(xlsx_path)
    df['start_sec'] = df['Start time'].apply(time_to_seconds)
    
    # Get YouTube transcript    
    transcript = YouTubeTranscriptApi.get_transcript(video_id)
    
    # Build time ranges for all slides    
    slide_ranges = []
    for i in range(len(df)):
        start = df.loc[i, 'start_sec']
        if i + 1 < len(df):
            end = df.loc[i + 1, 'start_sec']
        else:
            end = float('inf')
        slide_ranges.append((start, end))
    
    # Process each slide    
    for idx, (start, end) in enumerate(slide_ranges):
        slide_number = df.loc[idx, 'Slide Number']
        keep = df.loc[idx, 'skip/keep'] == 1        
        lines = [entry['text'] for entry in transcript if start <= entry['start'] < end]
        raw_text = "\\n".join(lines)
        
        if keep:
          summary = summarize(raw_text)
          summary_path = os.path.join(output_folder, f"slide_{slide_number}.txt")
          with open(summary_path, "w", encoding="utf-8") as f:
              f.write(summary)
          print(f"slide_{slide_number} summary saved.")
	      else:
          print(f"Skipped slide_{slide_number}")

With a handy use of the youtube_transcript_api you can see it is very easy to pull in the transcript via the URL, but in order to make this useful for the rest of the generation and compositing process, we will also have to prep an Excel file with the timestamps for our slides. The rest of the code is then slotting our transcript into the appropriate slides so it can be utilized for our script generation and pacing later.

Slide-to-Transcript Mapping

The Excel spreadsheet serves as the project’s organizational backbone, containing crucial information about slide timing, content inclusion decisions, and file naming conventions. Key columns include:

  • Slide Number: Sequential identification for organization

  • Start Time: Timestamp marking when each slide begins in the original video

  • Skip/Keep Column: Binary decision flag for content inclusion

  • PNG Name: Generated filename for corresponding slide image

This structured approach allows for selective content refinement, enabling creators to focus improvement efforts on the most valuable segments while excluding outdated or less relevant material.

The final section of code in the block from the last section takes the slides marked keep, and then buses them off to ChatGPT to clean them up for narration.

Stage 2: Content Summarization and Enhancement

Now we can leverage ChatGPT to transform these transcripts into polished, concise narration blocks before passing them back.

def summarize(text, prompt="Rewrite this text as narration:"):
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=[
                {"role": "system", "content": "Create a summary for the following slide as if you were the one presenting. Get rid of any 'um's and 'ah's."},
                {"role": "user", "content": f"{prompt}\\n\\n{text}"}
            ]
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"[OpenAI Error] {e}")
        return text

The GPT-4 call removes filler words, tightens language, and maintains the core message. This step is crucial for creating content that feels intentional and polished rather than spontaneous. The output from this will be what we pass off to ElevenLabs for our audio generation.

Stage 3: Visual Enhancement with Modern Design Tools

To make our new slides for the video, we decided to go with gamma.app, which provides AI-powered slide recreation that transforms basic presentations into visually compelling, professionally designed content.

The process involves uploading existing slide content to Gamma and allowing the AI to suggest modern layouts, color schemes, and visual elements. While this step requires manual review and refinement, the dramatic improvement in visual quality justifies the effort investment.

Once you have a workable presentation, open up step2.py in the example files.

Navigate to the convert_pptx_to_pdf_and_images() to see how we will get our slides into our pipeline. Because PowerPoint files don’t like to be directly converted to .png easily, we first need to convert them to PDF as a transition.

def convert_pptx_to_pdf_and_images(pptx_path, pdf_path, image_dir):
    print("Converting PPTX → PDF...")
    subprocess.run([
        'soffice', '--headless', '--convert-to', 'pdf', pptx_path,
        '--outdir', WORK_DIR
    ], check=True)

    print("Converting PDF → PNGs...")

    subprocess.run([
        'convert', '-density', '300', pdf_path,
        '-quality', '100', os.path.join(image_dir, 'slide_%02d.png')
    ], check=True)
    print(f"Slide images saved in: {image_dir}")

You’ll notice the next function is generate_audio_for_slide(), but first we are going to need a voice for our narration. Off to ElevenLabs we go.

Stage 4: Synthetic Voice Generation and Cloning

ElevenLabs enables the creation of custom voice clones that maintain the presenter’s authentic sound while delivering cleaner, more consistent audio quality. The professional voice cloning service requires a minimum of 30 minutes of source audio, though several hours of content produce superior results.

Once you have a voice that sounds authentic, it’s relatively easy to generate the audio for each of the slides. If you remember from before, we had GPT-4 clean up our original transcript, per slide. Moving down our step2.py file, find the generate_audio_for_slide() again, so we can look at it in more detail.

def generate_audio_for_slide(row):
    png_file = row['png_name']
    slide_idx = os.path.splitext(png_file)[0].split('_')[1]
    text_path = os.path.join(TEXT_FOLDER, f"slide_{row['Slide Number']}.txt")
    audio_path = os.path.join(WORK_DIR, f"audio_{slide_idx}.mp3")
    
    if not os.path.exists(text_path):
        raise FileNotFoundError(f"Text not found for slide {row['Slide Number']}: {text_path}")
    
    with open(text_path, 'r') as sf:
        rewritten_text = sf.read()
		    
    response = requests.post(
        f"<https://api.elevenlabs.io/v1/text-to-speech/{ELEVEN_VOICE_ID}>",
        headers={"xi-api-key": ELEVEN_API_KEY, "Content-Type": "application/json"},
        json={
            "text": rewritten_text,
            "model_id": "eleven_multilingual_v2",
            "voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
        }
    )
    
    if response.status_code != 200:
        raise Exception(f"Audio generation failed for slide {row['Slide Number']}: {response.text}")
    
    with open(audio_path, 'wb') as out:
        out.write(response.content)
    print(f"Audio saved: {audio_path}")

The call to ElevenLabs is mostly generalized, but {ELEVEN_VOICE_ID} is how you will select your specific voice. For each of the slide .png files from the previous step, we are generating audio with our new voice using the narration text for each corresponding GPT-4 summaries from step1.py.

Stage 5: Video Assembly and Synchronization

With polished audio and updated visuals, we combine these elements into synchronized video segments. The grandaddy of all video editing software FFmpeg, will handle the technical aspects of merging static slide images with corresponding audio tracks into a seamless video.

def generate_video_for_slide(row):
    png_file = row['png_name']
    slide_idx = os.path.splitext(png_file)[0].split('_')[1]
    image_path = os.path.join(IMAGE_DIR, png_file)
    audio_path = os.path.join(WORK_DIR, f"audio_{slide_idx}.mp3")
    video_path = os.path.join(WORK_DIR, f"slide_{slide_idx}.mp4")
    if not os.path.exists(image_path) or not os.path.exists(audio_path):
        print(f"Skipping video for slide {row['Slide Number']}: missing {image_path if not os.path.exists(image_path) else audio_path}")
        return None    print(f"Generating video for slide {row['Slide Number']}...")
    subprocess.run([
        FFMPEG_PATH, '-y',
        '-loop', '1', '-i', image_path,
        '-i', audio_path,
        '-c:v', 'libx264', '-tune', 'stillimage',
        '-c:a', 'aac', '-b:a', '192k',
        '-pix_fmt', 'yuv420p', '-shortest', video_path
    ], check=True)
    print(f"Video saved: {video_path}")
    return video_path

This function creates individual video files for each slide, ensuring proper encoding and synchronization between visual and audio elements. The FFmpeg parameters optimize for static images with audio overlay, creating smooth transitions and consistent quality.

There’s one more intermediary step before assembly. We generate a list of our video files so the can be combined in the right order with the right timing.

def generate_videos(keep_dataframe):
    videos = []
    for _, row in keep_dataframe.iterrows():
        vid = generate_video_for_slide(row)
        if vid:
            videos.append(vid)
    return videos

Stage 6: Final Assembly and Distribution

The final step concatenates all individual slide videos into a complete presentation, ready for upload to YouTube or other distribution platforms.

def concatenate_videos(video_paths, output_file):
    print("Concatenating videos...")
    concat_txt = os.path.join(WORK_DIR, 'concat_list.txt')
    with open(concat_txt, 'w') as f:
        for path in video_paths:
            f.write(f"file '{os.path.abspath(path)}'\\n")
    cmd = [FFMPEG_PATH, '-f', 'concat', '-safe', '0', '-i', concat_txt, '-c', 'copy', output_file]
    subprocess.run(cmd, check=True)
    os.remove(concat_txt)
    print(f"Final video: {output_file}")

The concatenation process seamlessly joins individual segments while maintaining video quality and ensuring smooth playback across the entire presentation. FFmpeg is truly awesome and takes a lot of the pain out of video encoding.

Quality Control and Human Oversight

Error Prevention and Correction

Common issues include transcript misalignment, slide timing discrepancies, and technical term mispronunciation. Implementing checkpoints throughout the process prevents these issues from propagating to the final output. The systematic approach allows for easy identification and correction of problems at each stage.

Cost Considerations and Resource Planning

Monthly Subscription Requirements

The complete pipeline requires several paid services:

  • ElevenLabs Professional - $22/month (voice cloning)

  • Gamma Pro - $10/month (advanced design)

  • OpenAI API - Usage-based (minimal cost for summarization)

Time Investment Analysis

Initial Overhead: 4-6 hours processing time to condense a 47-minute presentation into 10 minutes of polished content.

Efficiency Gains: Setup time decreases with experience. Established workflows streamline the process.

Value Returns: Higher content quality, multi-language reach, and updatable content without full recreation justify the time investment.

Current Limitations and Areas for Improvement

Technical Constraints

Voice cloning occasionally retains some original speech patterns, including the “ums” and “ahs” the process aims to eliminate. This occurs when the training data heavily features these patterns. Future improvements in AI training methodology may address this limitation.

Manual Process Requirements

Several pipeline stages still require manual intervention, limiting scalability for high-volume content production. Automation opportunities exist in slide timing detection, content quality assessment, and visual design optimization.

Conclusion

By completing this tutorial, you've learned to transform live presentations into polished YouTube content using AI tools - extracting transcripts, cleaning narration with GPT-4, generating synthetic audio with ElevenLabs, redesigning slides with Gamma, and assembling everything with FFmpeg. You now have a complete pipeline to refresh existing content, expand to new languages, and create professional videos without starting from scratch. Take these techniques and adapt them to your own presentations to take your YouTube content game to the next level.

Additional Resources

If you had fun with this tutorial, be sure to join the OpenAI Application Explorers Meetup Group to learn more about awesome apps you can build with AI.