Skip to content

A pure python based utility to extract text and images from docx files.

License

Notifications You must be signed in to change notification settings

badbye/docxpy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docxpy

image0 PyPI

This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.

It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.

How to install?

pip install docxpy

How to run?

  1. From command line:
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
  1. From python:
import docxpy

file = 'file.docx'

# extract text
text = docxpy.process(file)

# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")


# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process()  # process file
hyperlinks = doc.data['links']

About

A pure python based utility to extract text and images from docx files.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%