This project is forked from ankushshah89/python-docx2txt. A new feature is added: extract the hyperlinks and its corresponding texts.
It is a pure python-based utility to extract text from docx files. The code is taken and adapted from python-docx. It can however also extract text from header, footer and hyperlinks. It can now also extract images.
pip install docxpy
- From command line:
# extract text
docx2txt file.docx
# extract text and images
docx2txt -i /tmp/img_dir file.docx
- From python:
import docxpy
file = 'file.docx'
# extract text
text = docxpy.process(file)
# extract text and write images in /tmp/img_dir
text = docxpy.process(file, "/tmp/img_dir")
# if you want the hyperlinks
doc = docxpy.DOCReader(file)
doc.process() # process file
hyperlinks = doc.data['links']