Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comments in DOCX files are not preserved during conversion #31

Closed
coatless opened this issue Dec 15, 2024 · 2 comments · Fixed by #38
Closed

Comments in DOCX files are not preserved during conversion #31

coatless opened this issue Dec 15, 2024 · 2 comments · Fixed by #38
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.

Comments

@coatless
Copy link

Overview

The current implementation of the DocxConverter class does not preserve comments when converting DOCX files to Markdown. When a DOCX file containing comments is processed, the comments are completely lost in the output Markdown text.

(This is a great initiative. I've had fun exploring it this weekend).

Steps to Reproduce

  1. Run the script
  2. Observe that the output Markdown contains only the main text, without any comments
  3. Open the original DOCX file in Word to confirm the comment exists
from docx import Document
from markitdown import MarkItDown
import docx
from docx.oxml.shared import qn, OxmlElement

def add_comment(doc, paragraph, comment_text):
    # Get the paragraph element
    p = paragraph._p
    
    # Create comment
    comment = OxmlElement("w:comment")
    comment.set(qn("w:id"), "1")
    comment.set(qn("w:author"), "Author")
    comment.set(qn("w:date"), "2024-01-01T12:00:00")
    comment.set(qn("w:initials"), "A")
    
    # Add comment text
    comment_p = OxmlElement("w:p")
    comment_r = OxmlElement("w:r")
    comment_t = OxmlElement("w:t")
    comment_t.text = comment_text
    comment_r.append(comment_t)
    comment_p.append(comment_r)
    comment.append(comment_p)
    
    # Make sure we have a comments part
    if not doc.part.comments_part:
        doc.part.add_comments_part()
    
    # Add comment to document
    doc.part.comments_part._element.append(comment)
    
    # Create comment range start
    comment_start = OxmlElement("w:commentRangeStart")
    comment_start.set(qn("w:id"), "1")
    p.addprevious(comment_start)
    
    # Create comment reference
    comment_ref = OxmlElement("w:commentReference")
    comment_ref.set(qn("w:id"), "1")
    r = p.find(qn("w:r"))
    if r is not None:
        r.append(comment_ref)
    
    # Create comment range end
    comment_end = OxmlElement("w:commentRangeEnd")
    comment_end.set(qn("w:id"), "1")
    p.addnext(comment_end)

# Create document
doc = Document()
doc.add_heading('Document with Comments', 0)
p = doc.add_paragraph('This is the main text. It should have a comment attached.')

# Add comment
add_comment(doc, p, "This is a comment on the paragraph.")

# Save document
doc.save('test_with_comments2.docx')

# Convert and print
converter = MarkItDown()
result = converter.convert('test_with_comments2.docx')
print("Converted content:")
print(result.text_content)

Expected Behavior

Comments from the DOCX file should be preserved in the Markdown output, possibly in a format like:

This is the main text. It should have a comment attached.[^1]

[^1]: Comment by Author: This is a comment on the paragraph.

Actual Behavior

Comments are completely stripped from the output, resulting in only the main text being preserved.

@MarkEdmondson1234
Copy link

Very pertinent for legal documents as discussions around clauses is often within comments

@VillePuuska
Copy link
Contributor

Looks like Mammoth, which is used to do docx -> html conversion in this project, does support keeping comments. You just need to specify a custom style_map to specify how to format the comments; see https://pypi.org/project/mammoth/#comments I think this kwarg just needs to be passed through from the MarkItDown-converter to Mammoth and comments will be preserved. Did this in PR #38

Might be useful to add some more specific parameter to use and should add it to CLI options, but this is at least a quick fix.

@gagb gagb added enhancement New feature or request open for contribution Invites open-source developers to contribute to the project. labels Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.
Projects
None yet
4 participants