-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Programming question]: pdf2pptx #105
Comments
This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully. Hey there! 👋 For converting PDF to PPTX, we can use # Convert PDF to PPTX
import os
from pdf2image import convert_from_path
from pptx import Presentation
from pptx.util import Inches
def pdf_to_pptx(pdf_path, output_path):
# Convert PDF to images
images = convert_from_path(pdf_path)
# Create presentation
prs = Presentation()
# For each image/page, create a slide
for image in images:
slide = prs.slides.add_slide(prs.slide_layouts[6]) # blank layout
# Save image temporarily
temp_image_path = "temp_slide.png"
image.save(temp_image_path)
# Add image to slide
slide.shapes.add_picture(temp_image_path, 0, 0,
width=prs.slide_width,
height=prs.slide_height)
# Clean up temp file
os.remove(temp_image_path)
# Save presentation
prs.save(output_path)
# Example usage
pdf_to_pptx("input.pdf", "output.pptx") Note: You'll need to install the required packages: pip install pdf2image python-pptx Also make sure you have poppler installed:
Let me know if you need help with anything else! 😊 |
Instead of converting the slides to images, I'd like to identify text and image objects on each slide and iterate through them using a for-loop. git-bob think about how to do this |
This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully. Hey! 👋 For extracting text and image objects from PDF slides, we can use import fitz # PyMuPDF
from pptx import Presentation
from pptx.util import Inches, Pt
import io
from PIL import Image
def pdf_to_pptx_with_objects(pdf_path, output_path):
# Open PDF
pdf = fitz.open(pdf_path)
# Create presentation
prs = Presentation()
for page in pdf:
# Add blank slide
slide = prs.slides.add_slide(prs.slide_layouts[6])
# Extract text blocks
text_blocks = page.get_text("blocks")
for block in text_blocks:
x0, y0, x1, y1, text, block_no, block_type = block
textbox = slide.shapes.add_textbox(
Inches(x0/72), Inches(y0/72),
Inches((x1-x0)/72), Inches((y1-y0)/72)
)
textbox.text_frame.text = text
# Extract images
images = page.get_images()
for img_index, img_info in enumerate(images):
xref = img_info[0]
base_image = pdf.extract_image(xref)
image_bytes = base_image["image"]
# Convert to PIL Image for size info
image = Image.open(io.BytesIO(image_bytes))
# Save temporarily and add to slide
temp_path = f"temp_img_{img_index}.png"
image.save(temp_path)
# Get image location from PDF
image_rect = page.get_image_rects()[img_index]
slide.shapes.add_picture(
temp_path,
Inches(image_rect.x0/72),
Inches(image_rect.y0/72),
width=Inches((image_rect.x1-image_rect.x0)/72)
)
prs.save(output_path)
# Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx") You'll need to install PyMuPDF: pip install PyMuPDF Note: The positioning might need some fine-tuning depending on your specific PDF layout. |
PyMuPDF is not permissively licensed. Are there alternatives? git-bob comment |
This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully. Looking at permissively licensed PDF libraries, we could use from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextBox, LTImage, LTFigure
from pptx import Presentation
from pptx.util import Inches, Pt
def pdf_to_pptx_with_objects(pdf_path, output_path):
# Create presentation
prs = Presentation()
# Iterate through PDF pages
for page_layout in extract_pages(pdf_path):
slide = prs.slides.add_slide(prs.slide_layouts[6])
# Process each element on the page
for element in page_layout:
if isinstance(element, LTTextBox):
# Add text
textbox = slide.shapes.add_textbox(
Inches(element.x0/72),
Inches(element.y0/72),
Inches((element.x1-element.x0)/72),
Inches((element.y1-element.y0)/72)
)
textbox.text_frame.text = element.get_text()
elif isinstance(element, (LTImage, LTFigure)):
# Note: Image extraction with pdfminer is limited
# Consider falling back to pdf2image for images
pass
prs.save(output_path)
### Usage
pdf_to_pptx_with_objects("input.pdf", "output.pptx") Note: For better image handling, we might need to combine this with the pdf2image approach for images only. |
Question: I would like to convert PDF files of slides into pptx files. A python script should do it.
Python Tools
Git-bob think about this
The text was updated successfully, but these errors were encountered: