Skip to content

Commit

Permalink
Fix:#3230 When parsing a docx file using the Book parsing method, to_…
Browse files Browse the repository at this point in the history
…page is always -1, resulting in a block count of 0 even if parsing is successful (#3249)

### What problem does this PR solve?

When parsing a docx file using the Book parsing method, to_page is
always -1, resulting in a block count of 0 even if parsing is successful

Fix:#3230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
  • Loading branch information
kuschzzp and KevinHuSh authored Nov 8, 2024
1 parent 7c0d28b commit 9c6cc20
Show file tree
Hide file tree
Showing 2 changed files with 3 additions and 3 deletions.
2 changes: 1 addition & 1 deletion api/db/db_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -840,7 +840,7 @@ class Task(DataBaseModel):
doc_id = CharField(max_length=32, null=False, index=True)
from_page = IntegerField(default=0)

to_page = IntegerField(default=-1)
to_page = IntegerField(default=100000000)

begin_at = DateTimeField(null=True, index=True)
process_duation = FloatField(default=0)
Expand Down
4 changes: 2 additions & 2 deletions deepdoc/parser/docx_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ def blockType(b):
return lines
return ["\n".join(lines)]

def __call__(self, fnm, from_page=0, to_page=100000):
def __call__(self, fnm, from_page=0, to_page=100000000):
self.doc = Document(fnm) if isinstance(
fnm, str) else Document(BytesIO(fnm))
pn = 0 # parsed page
Expand All @@ -130,7 +130,7 @@ def __call__(self, fnm, from_page=0, to_page=100000):
if 'lastRenderedPageBreak' in run._element.xml:
pn += 1

secs.append(("".join(runs_within_single_paragraph), p.style.name)) # then concat run.text as part of the paragraph
secs.append(("".join(runs_within_single_paragraph), p.style.name if hasattr(p.style, 'name') else '')) # then concat run.text as part of the paragraph

tbls = [self.__extract_table_content(tb) for tb in self.doc.tables]
return secs, tbls

0 comments on commit 9c6cc20

Please sign in to comment.