-
Notifications
You must be signed in to change notification settings - Fork 359
Add possibility to pass additional PDFMiner parameters for get_page_layout() #170
Comments
Having an option to specify kwargs for PDFMiner sounds good. Can you show me the structure of one of these PDFs? Just curious. |
hi @vinayak-mehta ,
This is result with
Line 1 for example appends the "L" after "es Blancs". Compared to this output with
|
I've quickly looked at the underlying issue with letters in the wrong order in the cells in the example above. I believe it's because the x-position is not taken into account when building text in cells (at least for my virtually all-horizontal data). When debugging, I noticed that
) |
Thanks for the detailed report and looking into the text setter method! You're correct, it doesn't compare the x-position of horizontal and vertical text when assigning text to a cell. This behavior should be corrected. At the same time, users should be able to pass in pdfminer kwargs to get the best parsing results. Let me look into this. |
@redapple You can now pass PDFMiner LAParam kwargs using |
Thanks for the heads up @vinayak-mehta ! |
On some PDFs, PDFMiner has issues when
detect_vertical
is passed asTrue
and hence the generation of rows is wrong, with some letters not following reading order.On a local version of camelot-py, I'm getting better results by forcing
detect_vertical=False
here.Would it be possible to have an argument in
.read_pdf()
to setdetect_vertical
? Just like there is amargins
argument.The text was updated successfully, but these errors were encountered: