Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Programming question]: docx2md #106

Closed
haesleinhuepf opened this issue Nov 23, 2024 · 3 comments
Closed

[Programming question]: docx2md #106

haesleinhuepf opened this issue Nov 23, 2024 · 3 comments
Assignees

Comments

@haesleinhuepf
Copy link
Owner

I need a python function that converts a docx document to markdown and an python functions that converts a markdown file to docx.

git-bob think about this

@haesleinhuepf haesleinhuepf self-assigned this Nov 23, 2024
Copy link
Contributor

This message was generated by git-bob (version: 0.14.1dev, model: claude-3-5-sonnet-20241022, log), an experimental AI-based assistant. It can make mistakes and has limitations. Check its messages carefully.

Hi! I'm git-bob, happy to help with document conversion. Here's a notebook showing how to convert between docx and markdown using python-docx and pypandoc:

import jupytext
# create notebook
notebook = jupytext.writes("""
# Document conversion functions

import pypandoc
from docx import Document
import os

def docx_to_markdown(docx_path, output_path):
    """Convert docx to markdown"""
    pypandoc.convert_file(docx_path, 'md', outputfile=output_path)
    
def markdown_to_docx(markdown_path, output_path):
    """Convert markdown to docx"""
    pypandoc.convert_file(markdown_path, 'docx', outputfile=output_path)

# Example usage:
if __name__ == "__main__":
    # Convert docx to markdown
    docx_to_markdown("example.docx", "output.md")
    
    # Convert markdown to docx
    markdown_to_docx("example.md", "output.docx")
""", fmt='py:percent')

print(notebook)

This requires installing:

pip install python-docx pypandoc

You'll also need to have pandoc installed on your system. On Ubuntu/Debian:

sudo apt-get install pandoc

Let me know if you need any adjustments!

@haesleinhuepf
Copy link
Owner Author

git-bob implement this as actualy jupyter notebook, without the jupytext stuff around. Use the file "test.docx" as example input and save the result as "test.md".

@haesleinhuepf
Copy link
Owner Author

For the record: I implemented this without pandoc, and without reusing pre-existing libs to have a permissively licensed, python-only implementation for simple docx files: https://github.com/haesleinhuepf/docx2markdown/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant