Automate the creation of handout notes using Amazon Bedrock Data Automation

Organizations across various sectors face significant challenges when converting meeting recordings or recorded presentations into structured documentation. The process of creating handouts from presentations requires lots of manual effort, such as reviewing recordings to identify slide transitions, transcribing spoken content, capturing and organizing screenshots, synchronizing visual elements with speaker notes, and formatting content. These challenges impact productivity and scalability, especially when dealing with multiple presentation recordings, conference sessions, training materials, and educational content.

In this post, we show how you can build an automated, serverless solution to transform webinar recordings into comprehensive handouts using Amazon Bedrock Data Automation for video analysis. We walk you through the implementation of Amazon Bedrock Data Automation to transcribe and detect slide changes, as well as the use of Amazon Bedrock foundation models (FMs) for transcription refinement, combined with custom AWS Lambda functions orchestrated by AWS Step Functions. Through detailed implementation details, architectural patterns, and code, you will learn how to build a workflow that automates the handout creation process.

Amazon Bedrock Data Automation

Amazon Bedrock Data Automation uses generative AI to automate the transformation of multimodal data (such as images, videos and more) into a customizable structured format. Examples of structured formats include summaries of scenes in a video, unsafe or explicit content in text and images, or organized content based on advertisements or brands. The solution presented in this post uses Amazon Bedrock Data Automation to extract audio segments and different shots in videos.

Solution overview

Our solution uses a serverless architecture orchestrated by Step Functions to process presentation recordings into comprehensive handouts. The workflow consists of the following steps:

The workflow begins when a video is uploaded to Amazon Simple Storage Service (Amazon S3), which triggers an event notification through Amazon EventBridge rules that initiates our video processing workflow in Step Functions.
After the workflow is triggered, Amazon Bedrock Data Automation initiates a video transformation job to identify different shots in the video. In our case, this is represented by a change of slides. The workflow moves into a waiting state, and checks for the transformation job progress. If the job is in progress, the workflow returns to the waiting state. When the job is complete, the workflow continues, and we now have extracted both visual shots and spoken content.
These visual shots and spoken content feed into a synchronization step. In this Lambda function, we use the output of the Amazon Bedrock Data Automation job to match the spoken content to the correlating shots based on the matching of timestamps.
After function has matched the spoken content to the visual shots, the workflow moves into a parallel state. One of the steps of this state is the generation of screenshots. We use a FFmpeg-enabled Lambda function to create images for each identified video shot.
The other step of the parallel state is the refinement of our transformations. Amazon Bedrock processes and improves each raw transcription section through a Map state. This helps us remove speech disfluencies and improve the sentence structure.
Lastly, after the screenshots and refined transcript are created, the workflow uses a Lambda function to create handouts. We use the Python-PPTX library, which generates the final presentation with synchronized content. These final handouts are stored in Amazon S3 for distribution.

The following diagram illustrates this workflow.

AWS Step Functions workflow diagram for data automation process

If you want to try out this solution, we have created an AWS Cloud Development Kit (AWS CDK) stack available in the accompanying GitHub repo that you can deploy in your account. It deploys the Step Functions state machine to orchestrate the creation of handout notes from the presentation video recording. It also provides you with a sample video to test out the results.

To deploy and test the solution in your own account, follow the instructions in the GitHub repository’s README file. The following sections describe in more detail the technical implementation details of this solution.

Video upload and initial processing

The workflow begins with Amazon S3, which serves as the entry point for our video processing pipeline. When a video is uploaded to a dedicated S3 bucket, it triggers an event notification that, through EventBridge rules, initiates our Step Functions workflow.

Shot detection and transcription using Amazon Bedrock Data Automation

This step uses Amazon Bedrock Data Automation to detect slide transitions and create video transcriptions. To integrate this as part of the workflow, you must create an Amazon Bedrock Data Automation project. A project is a grouping of output configurations. Each project can contain standard output configurations as well as custom output blueprints for documents, images, video, and audio. The project has already been created as part of the AWS CDK stack. After you set up your project, you can process content using the InvokeDataAutomationAsync API. In our solution, we use the Step Functions service integration to execute this API call and start the asynchronous processing job. A job ID is returned for tracking the process.

The workflow must now check the status of the processing job before continuing with the handout creation process. This is done by polling Amazon Bedrock Data Automation for the job status using the GetDataAutomationStatus API on a regular basis. Using a combination of the Step Functions Wait and Choice states, we can ask the workflow to poll the API on a fixed interval. This not only gives you the ability to customize the interval depending on your needs, but it also helps you control the workflow costs, because every state transition is billed in Standard workflows, which this solution uses.

When the GetDataAutomationStatus API output shows as SUCCESS, the loop exits and the workflow continues to the next step, which will match transcripts to the visual shots.

Matching audio segments with corresponding shots

To create comprehensive handouts, you must establish a mapping between the visual shots and their corresponding audio segments. This mapping is crucial to make sure the final handouts accurately represent both the visual content and the spoken narrative of the presentation.

A shot represents a series of interrelated consecutive frames captured during the presentation, typically indicating a distinct visual state. In our presentation context, a shot corresponds to either a new slide or a significant slide animation that adds or modifies content.

An audio segment is a specific portion of an audio recording that contains uninterrupted spoken language, with minimal pauses or breaks. This segment captures a natural flow of speech. The Amazon Bedrock Data Automation output provides an audio_segments array, with each segment containing precise timing information such as the start and end time of each segment. This allows for accurate synchronization with the visual shots.

The synchronization between shots and audio segments is critical for creating accurate handouts that preserve the presentation’s narrative flow. To achieve this, we implement a Lambda function that manages the matching process in three steps:

The function retrieves the processing results from Amazon S3, which contains both the visual shots and audio segments.
It creates structured JSON arrays from these components, preparing them for the matching algorithm.
It executes a matching algorithm that analyzes the different timestamps of the audio segments and the shots, and matches them based on these timestamps. This algorithm also considers timestamp overlaps between shots and audio segments.

For each shot, the function examines audio segments and identifies those whose timestamps overlap with the shot’s duration, making sure the relevant spoken content is associated with its corresponding slide in the final handouts. The function returns the matched results directly to the Step Functions workflow, where it will serve as input for the next step, where Amazon Bedrock will refine the transcribed content and where we will create screenshots in parallel.

Screenshot generation

After you get the timestamps of each shot and associated audio segment, you can capture the slides of the presentation to create comprehensive handouts. Each detected shot from Amazon Bedrock Data Automation represents a distinct visual state in the presentation—typically a new slide or significant content change. By generating screenshots at these precise moments, we make sure our handouts accurately represent the visual flow of the original presentation.

This is done with a Lambda function using the ffmpeg-python library. This library acts as a Python binding for the FFmpeg media framework, so you can run FFmpeg terminal commands using Python methods. In our case, we can extract frames from the video at specific timestamps identified by Amazon Bedrock Data Automation. The screenshots are stored in an S3 bucket to be used in creating the handouts, as described in the following code. To use ffmpeg-python in Lambda, we created a Lambda ZIP deployment containing the required dependencies to run the code. Instructions on how to create the ZIP file can be found in our GitHub repository.

The following code shows how a screenshot is taken using ffmpeg-python. You can view the full Lambda code on GitHub.

## Taking a screenshot at a specific timestamp 
ffmpeg.input(video_path, ss=timestamp).output(screenshot_path, vframes=1).run()

Transcript refinement with Amazon Bedrock

In parallel with the screenshot generation, we refine the transcript using a large language model (LLM). We do this to improve the quality of the transcript and filter out errors and speech disfluencies. This process uses an Amazon Bedrock model to enhance the quality of the matched transcription segments while maintaining content accuracy. We use a Lambda function that integrates with Amazon Bedrock through the Python Boto3 client, using a prompt to guide the model’s refinement process. The function can then process each transcript segment, instructing the model to do the following:

Fix typos and grammatical errors
Remove speech disfluencies (such as “uh” and “um”)
Maintain the original meaning and technical accuracy
Preserve the context of the presentation

In our solution, we used the following prompt with three example inputs and outputs:

prompt=""'This is the result of a transcription. 
I want you to look at this audio segment and fix the typos and mistakes present. 
Feel free to use the context of the rest of the transcript to refine (but don't leave out any info). 
Leave out parts where the speaker misspoke. 
Make sure to also remove works like "uh" or "um". 
Only make change to the info or sentence structure when there are mistakes. 
Only give back the refined transcript as output, don't add anything else or any context or title. 
If there are no typos or mistakes, return the original object input. 
Do not explain why you have or have not made any changes; I just want the JSON object. 

These are examples: 
Input:  
Output: 

Input:  
Output: 

Input:  
Output: 

Here is the object: ''' + text

The following is an example input and output:

Input: Yeah. Um, so let's talk a little bit about recovering from a ransomware attack, right?

Output: Yes, let's talk a little bit about recovering from a ransomware attack.

To optimize processing speed while adhering to the maximum token limits of the Amazon Bedrock InvokeModel API, we use the Step Functions Map state. This enables parallel processing of multiple transcriptions, each corresponding to a separate video segment. Because these transcriptions must be handled individually, the Map state efficiently distributes the workload. Additionally, it reduces operational overhead by managing integration—taking an array as input, passing each element to the Lambda function, and automatically reconstructing the array upon completion.The Map state returns the refined transcript directly to the Step Functions workflow, maintaining the structure of the matched segments while providing cleaner, more professional text content for the final handout generation.

Handout generation

The final step in our workflow involves creating the handouts using the python-pptx library. This step combines the refined transcripts with the generated screenshots to create a comprehensive presentation document.

The Lambda function processes the matched segments sequentially, creating a new slide for each screenshot while adding the corresponding refined transcript as speaker notes. The implementation uses a custom Lambda layer containing the python-pptx package. To enable this functionality in Lambda, we created a custom layer using Docker. By using Docker to create our layer, we make sure the dependencies are compiled in an environment that matches the Lambda runtime. You can find the instructions to create this layer and the layer itself in our GitHub repository.

The Lambda function implementation uses python-pptx to create structured presentations:

import boto3
from pptx import Presentation
from pptx.util import Inches
import os
import json

def lambda_handler(event, context):
    # Create new presentation with specific dimensions
    prs = Presentation()
    prs.slide_width = int(12192000)  # Standard presentation width
    prs.slide_height = int(6858000)  # Standard presentation height
    
    # Process each segment
    for i in range(num_images):
        # Add new slide
        slide = prs.slides.add_slide(prs.slide_layouts[5])
        
        # Add screenshot as full-slide image
        slide.shapes.add_picture(image_path, 0, 0, width=slide_width)
        
        # Add transcript as speaker notes
        notes_slide = slide.notes_slide
        transcription_text = transcription_segments[i].get('transcript', '')
        notes_slide.notes_text_frame.text = transcription_text
    
    # Save presentation
    pptx_path = os.path.join(tmp_dir, "lecture_notes.pptx")
    prs.save(pptx_path)

The function processes segments sequentially, creating a presentation that combines visual shots with their corresponding audio segments, resulting in handouts ready for distribution.

The following screenshot shows an example of a generated slide with notes. The full deck has been added as a file in the GitHub repository.

Conclusion

In this post, we demonstrated how to build a serverless solution that automates the creation of handout notes from recorded slide presentations. By combining Amazon Bedrock Data Automation with custom Lambda functions, we’ve created a scalable pipeline that significantly reduces the manual effort required in creating handout materials. Our solution addresses several key challenges in content creation:

Automated detection of slide transitions, content changes, and accurate transcription of spoken content using the video modality capabilities of Amazon Bedrock Data Automation
Intelligent refinement of transcribed text using Amazon Bedrock
Synchronized visual and textual content with a custom matching algorithm
Handout generation using the ffmpeg-python and python-pptx libraries in Lambda

The serverless architecture, orchestrated by Step Functions, provides reliable execution while maintaining cost-efficiency. By using Python packages for FFmpeg and a Lambda layer for python-pptx, we’ve overcome technical limitations and created a robust solution that can handle various presentation formats and lengths. This solution can be extended and customized for different use cases, from educational institutions to corporate training programs. Certain steps such as the transcript refinement can also be improved, for instance by adding translation capabilities to account for diverse audiences.

To learn more about Amazon Bedrock Data Automation, refer to the following resources:

About the authors

Laura Verghote is the GenAI Lead for PSI Europe at Amazon Web Services (AWS), driving Generative AI adoption across public sector organizations. She partners with customers throughout Europe to accelerate their GenAI initiatives through technical expertise and strategic planning, bridging complex requirements with innovative AI solutions.

Elie Elmalem is a solutions architect at Amazon Web Services (AWS) and supports Education customers across the UK and EMEA. He works with customers to effectively use AWS services, providing architectural best practices, advice, and guidance. Outside of work, he enjoys spending time with family and friends and loves watching his favorite football team play.

Source link

Hot topics

Finance

Marketing

Politics

Strategy