Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: When parsing a docx file using the Book parsing method, to_page is always -1, resulting in a block count of 0 even if parsing is successful #3230

Closed
kuschzzp opened this issue Nov 6, 2024 · 2 comments
Labels
bug Something isn't working question Further information is requested

Comments

@kuschzzp
Copy link
Contributor

kuschzzp commented Nov 6, 2024

Describe your problem

docx with Bookto_page always -1

I don't know what code can be changed to make to_page correct

def __call__(self, fnm, from_page=0, to_page=100000):
self.doc = Document(fnm) if isinstance(
fnm, str) else Document(BytesIO(fnm))
pn = 0 # parsed page
secs = [] # parsed contents
for p in self.doc.paragraphs:
if pn > to_page:
break
runs_within_single_paragraph = [] # save runs within the range of pages
for run in p.runs:
if pn > to_page:
break
if from_page <= pn < to_page and p.text.strip():
runs_within_single_paragraph.append(run.text) # append run.text first
# wrap page break checker into a static method
if 'lastRenderedPageBreak' in run._element.xml:
pn += 1
secs.append(("".join(runs_within_single_paragraph), p.style.name)) # then concat run.text as part of the paragraph
tbls = [self.__extract_table_content(tb) for tb in self.doc.tables]
return secs, tbls

When parsing a docx file using the Book parsing method, to_page is always -1, resulting in a block count of 0 even if parsing is successful

to fix another error info , i update the below code

secs.append(("".join(runs_within_single_paragraph), p.style.name)) # then concat run.text as part of the paragraph

to

secs.append(("".join(runs_within_single_paragraph), p.style.name if hasattr(p.style, 'name') else ''))

i found here is the default value

to_page = IntegerField(default=-1)

@kuschzzp kuschzzp added the question Further information is requested label Nov 6, 2024
@KevinHuSh
Copy link
Collaborator

Coud you submit a PR about this issue?
Change the default value of to_page to 100000000.

@kuschzzp
Copy link
Contributor Author

kuschzzp commented Nov 7, 2024

Coud you submit a PR about this issue? Change the default value of to_page to 100000000.

I have submitted it, please check whether it meets the requirements

KevinHuSh added a commit that referenced this issue Nov 8, 2024
…page is always -1, resulting in a block count of 0 even if parsing is successful (#3249)

### What problem does this PR solve?

When parsing a docx file using the Book parsing method, to_page is
always -1, resulting in a block count of 0 even if parsing is successful

Fix:#3230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
jhaiq pushed a commit to jhaiq/ragflow that referenced this issue Nov 30, 2024
…ethod, to_page is always -1, resulting in a block count of 0 even if parsing is successful (infiniflow#3249)

### What problem does this PR solve?

When parsing a docx file using the Book parsing method, to_page is
always -1, resulting in a block count of 0 even if parsing is successful

Fix:infiniflow#3230

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Kevin Hu <kevinhu.sh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants