[Programming question]: pdf2pptx #105

haesleinhuepf · 2024-11-23T05:58:24Z

Question: I would like to convert PDF files of slides into pptx files. A python script should do it.

Python Tools

List of Python libraries we should use for answering this:
- python-pptx
- a PDF library that is permissively licensed.

Git-bob think about this

github-actions · 2024-11-23T06:01:03Z

^{This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.}

Hey there! 👋

For converting PDF to PPTX, we can use pdf2image (based on poppler) to convert PDF pages to images and then use python-pptx to create slides. Here's a Jupyter notebook that does this:

# Convert PDF to PPTX
import os
from pdf2image import convert_from_path
from pptx import Presentation
from pptx.util import Inches

def pdf_to_pptx(pdf_path, output_path):
    # Convert PDF to images
    images = convert_from_path(pdf_path)
    
    # Create presentation
    prs = Presentation()
    
    # For each image/page, create a slide
    for image in images:
        slide = prs.slides.add_slide(prs.slide_layouts[6])  # blank layout
        
        # Save image temporarily
        temp_image_path = "temp_slide.png"
        image.save(temp_image_path)
        
        # Add image to slide
        slide.shapes.add_picture(temp_image_path, 0, 0, 
                               width=prs.slide_width, 
                               height=prs.slide_height)
        
        # Clean up temp file
        os.remove(temp_image_path)
    
    # Save presentation
    prs.save(output_path)

# Example usage
pdf_to_pptx("input.pdf", "output.pptx")

Note: You'll need to install the required packages:

pip install pdf2image python-pptx

Also make sure you have poppler installed:

On Windows: Download and install poppler
On Mac: brew install poppler
On Linux: apt-get install poppler-utils

Let me know if you need help with anything else! 😊

haesleinhuepf · 2024-11-23T12:24:24Z

Instead of converting the slides to images, I'd like to identify text and image objects on each slide and iterate through them using a for-loop.

git-bob think about how to do this

github-actions · 2024-11-23T12:26:33Z

^{This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.}

Hey! 👋

For extracting text and image objects from PDF slides, we can use PyMuPDF (fitz) which has good support for PDF parsing. Here's a notebook showing how to do this:

import fitz  # PyMuPDF
from pptx import Presentation
from pptx.util import Inches, Pt
import io
from PIL import Image

def pdf_to_pptx_with_objects(pdf_path, output_path):
    # Open PDF
    pdf = fitz.open(pdf_path)
    
    # Create presentation
    prs = Presentation()
    
    for page in pdf:
        # Add blank slide
        slide = prs.slides.add_slide(prs.slide_layouts[6])
        
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        for block in text_blocks:
            x0, y0, x1, y1, text, block_no, block_type = block
            textbox = slide.shapes.add_textbox(
                Inches(x0/72), Inches(y0/72), 
                Inches((x1-x0)/72), Inches((y1-y0)/72)
            )
            textbox.text_frame.text = text
            
        # Extract images
        images = page.get_images()
        for img_index, img_info in enumerate(images):
            xref = img_info[0]
            base_image = pdf.extract_image(xref)
            image_bytes = base_image["image"]
            
            # Convert to PIL Image for size info
            image = Image.open(io.BytesIO(image_bytes))
            
            # Save temporarily and add to slide
            temp_path = f"temp_img_{img_index}.png"
            image.save(temp_path)
            
            # Get image location from PDF
            image_rect = page.get_image_rects()[img_index]
            slide.shapes.add_picture(
                temp_path,
                Inches(image_rect.x0/72),
                Inches(image_rect.y0/72),
                width=Inches((image_rect.x1-image_rect.x0)/72)
            )
            
    prs.save(output_path)

# Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx")

You'll need to install PyMuPDF:

pip install PyMuPDF

Note: The positioning might need some fine-tuning depending on your specific PDF layout.

haesleinhuepf · 2024-11-23T12:54:05Z

PyMuPDF is not permissively licensed. Are there alternatives?

git-bob comment

github-actions · 2024-11-23T12:56:52Z

^{This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.}

Looking at permissively licensed PDF libraries, we could use pdfminer.six (MIT license) instead. Here's a revised notebook:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTImage, LTFigure
from pptx import Presentation
from pptx.util import Inches, Pt

def pdf_to_pptx_with_objects(pdf_path, output_path):
    # Create presentation
    prs = Presentation()
    
    # Iterate through PDF pages
    for page_layout in extract_pages(pdf_path):
        slide = prs.slides.add_slide(prs.slide_layouts[6])
        
        # Process each element on the page
        for element in page_layout:
            if isinstance(element, LTTextBox):
                # Add text
                textbox = slide.shapes.add_textbox(
                    Inches(element.x0/72), 
                    Inches(element.y0/72),
                    Inches((element.x1-element.x0)/72), 
                    Inches((element.y1-element.y0)/72)
                )
                textbox.text_frame.text = element.get_text()
            
            elif isinstance(element, (LTImage, LTFigure)):
                # Note: Image extraction with pdfminer is limited
                # Consider falling back to pdf2image for images
                pass
    
    prs.save(output_path)

### Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx")

Note: For better image handling, we might need to combine this with the pdf2image approach for images only.

haesleinhuepf self-assigned this Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Programming question]: pdf2pptx #105

[Programming question]: pdf2pptx #105

haesleinhuepf commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

haesleinhuepf commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

haesleinhuepf commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

[Programming question]: pdf2pptx #105

[Programming question]: pdf2pptx #105

Comments

haesleinhuepf commented Nov 23, 2024

Question: I would like to convert PDF files of slides into pptx files. A python script should do it.

Python Tools

github-actions bot commented Nov 23, 2024

haesleinhuepf commented Nov 23, 2024

github-actions bot commented Nov 23, 2024

haesleinhuepf commented Nov 23, 2024

github-actions bot commented Nov 23, 2024