Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Programming question]: pdf2pptx #105

Open
haesleinhuepf opened this issue Nov 23, 2024 · 5 comments
Open

[Programming question]: pdf2pptx #105

haesleinhuepf opened this issue Nov 23, 2024 · 5 comments
Assignees

Comments

@haesleinhuepf
Copy link
Owner

Question: I would like to convert PDF files of slides into pptx files. A python script should do it.

Python Tools

  • List of Python libraries we should use for answering this:
    • python-pptx
    • a PDF library that is permissively licensed.

Git-bob think about this

@haesleinhuepf haesleinhuepf self-assigned this Nov 23, 2024
Copy link
Contributor

This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

Hey there! 👋

For converting PDF to PPTX, we can use pdf2image (based on poppler) to convert PDF pages to images and then use python-pptx to create slides. Here's a Jupyter notebook that does this:

# Convert PDF to PPTX
import os
from pdf2image import convert_from_path
from pptx import Presentation
from pptx.util import Inches

def pdf_to_pptx(pdf_path, output_path):
    # Convert PDF to images
    images = convert_from_path(pdf_path)
    
    # Create presentation
    prs = Presentation()
    
    # For each image/page, create a slide
    for image in images:
        slide = prs.slides.add_slide(prs.slide_layouts[6])  # blank layout
        
        # Save image temporarily
        temp_image_path = "temp_slide.png"
        image.save(temp_image_path)
        
        # Add image to slide
        slide.shapes.add_picture(temp_image_path, 0, 0, 
                               width=prs.slide_width, 
                               height=prs.slide_height)
        
        # Clean up temp file
        os.remove(temp_image_path)
    
    # Save presentation
    prs.save(output_path)

# Example usage
pdf_to_pptx("input.pdf", "output.pptx")

Note: You'll need to install the required packages:

pip install pdf2image python-pptx

Also make sure you have poppler installed:

  • On Windows: Download and install poppler
  • On Mac: brew install poppler
  • On Linux: apt-get install poppler-utils

Let me know if you need help with anything else! 😊

@haesleinhuepf
Copy link
Owner Author

Instead of converting the slides to images, I'd like to identify text and image objects on each slide and iterate through them using a for-loop.

git-bob think about how to do this

Copy link
Contributor

This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

Hey! 👋

For extracting text and image objects from PDF slides, we can use PyMuPDF (fitz) which has good support for PDF parsing. Here's a notebook showing how to do this:

import fitz  # PyMuPDF
from pptx import Presentation
from pptx.util import Inches, Pt
import io
from PIL import Image

def pdf_to_pptx_with_objects(pdf_path, output_path):
    # Open PDF
    pdf = fitz.open(pdf_path)
    
    # Create presentation
    prs = Presentation()
    
    for page in pdf:
        # Add blank slide
        slide = prs.slides.add_slide(prs.slide_layouts[6])
        
        # Extract text blocks
        text_blocks = page.get_text("blocks")
        for block in text_blocks:
            x0, y0, x1, y1, text, block_no, block_type = block
            textbox = slide.shapes.add_textbox(
                Inches(x0/72), Inches(y0/72), 
                Inches((x1-x0)/72), Inches((y1-y0)/72)
            )
            textbox.text_frame.text = text
            
        # Extract images
        images = page.get_images()
        for img_index, img_info in enumerate(images):
            xref = img_info[0]
            base_image = pdf.extract_image(xref)
            image_bytes = base_image["image"]
            
            # Convert to PIL Image for size info
            image = Image.open(io.BytesIO(image_bytes))
            
            # Save temporarily and add to slide
            temp_path = f"temp_img_{img_index}.png"
            image.save(temp_path)
            
            # Get image location from PDF
            image_rect = page.get_image_rects()[img_index]
            slide.shapes.add_picture(
                temp_path,
                Inches(image_rect.x0/72),
                Inches(image_rect.y0/72),
                width=Inches((image_rect.x1-image_rect.x0)/72)
            )
            
    prs.save(output_path)

# Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx")

You'll need to install PyMuPDF:

pip install PyMuPDF

Note: The positioning might need some fine-tuning depending on your specific PDF layout.

@haesleinhuepf
Copy link
Owner Author

PyMuPDF is not permissively licensed. Are there alternatives?

git-bob comment

Copy link
Contributor

This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

Looking at permissively licensed PDF libraries, we could use pdfminer.six (MIT license) instead. Here's a revised notebook:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTImage, LTFigure
from pptx import Presentation
from pptx.util import Inches, Pt

def pdf_to_pptx_with_objects(pdf_path, output_path):
    # Create presentation
    prs = Presentation()
    
    # Iterate through PDF pages
    for page_layout in extract_pages(pdf_path):
        slide = prs.slides.add_slide(prs.slide_layouts[6])
        
        # Process each element on the page
        for element in page_layout:
            if isinstance(element, LTTextBox):
                # Add text
                textbox = slide.shapes.add_textbox(
                    Inches(element.x0/72), 
                    Inches(element.y0/72),
                    Inches((element.x1-element.x0)/72), 
                    Inches((element.y1-element.y0)/72)
                )
                textbox.text_frame.text = element.get_text()
            
            elif isinstance(element, (LTImage, LTFigure)):
                # Note: Image extraction with pdfminer is limited
                # Consider falling back to pdf2image for images
                pass
    
    prs.save(output_path)

### Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx")

Note: For better image handling, we might need to combine this with the pdf2image approach for images only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant