Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Add support for GBK2K cmaps #2385

Merged
merged 1 commit into from
Jan 2, 2024
Merged

Conversation

stefan6419846
Copy link
Collaborator

This adds support for the GBK2K-H and GBK2K-V cmaps mentioned in #2356 where I stumbled upon directly.

I do not consider #2356 solved with this for now as it is not clear to me why we do not add the complete mapping once and how to get nice test data. For now, I unfortunately have private test data only.

Copy link

codecov bot commented Jan 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b8a877c) 94.35% compared to head (01fe0e0) 94.35%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2385   +/-   ##
=======================================
  Coverage   94.35%   94.35%           
=======================================
  Files          43       43           
  Lines        7584     7584           
  Branches     1519     1519           
=======================================
  Hits         7156     7156           
  Misses        265      265           
  Partials      163      163           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma
Copy link
Member

Huh, interesting. Did that really solve your problem for that specific PDF? (I wouldn't have thought that it's so easy 😅 )

@stefan6419846
Copy link
Collaborator Author

Yes, the warning went away and the corresponding characters have been displayed, although I have not yet fully verified that the correct characters are emitted as this is hard for me as a non-Asian speaker.

As already mentioned above and in the original issue report #2356, I have been surprised by this as well as I would have expected that we would already provide these simple mappings by default.

@MartinThoma MartinThoma merged commit b085798 into py-pdf:main Jan 2, 2024
15 checks passed
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Jan 2, 2024
@MartinThoma
Copy link
Member

Thanks!

Please note that the next release might take a bit longer. I want to ensure that we do all breaking changes we want to do.

It will be in January, but might be in the end of January.

@stefan6419846 stefan6419846 deleted the gbk2k branch January 2, 2024 10:10
@stefan6419846
Copy link
Collaborator Author

Please note that the next release might take a bit longer. I want to ensure that we do all breaking changes we want to do.

No worries. If this issue becomes too annoying in our monitoring in the meantime, I am sure that I manage to find a temporary solution ;) We/I just want to make sure that such issues have been properly analyzed and reported/fixed here before simply ignoring them.

MartinThoma added a commit that referenced this pull request Jan 19, 2024
## What's new

pypdf==4.0.0 is a big milestone forward:

* We finally have a layout-mode text extraction.
  This enables users who want to detect / extract tables
  with heuristics to give it a try.
* We deprecated a lot of the old PyPDF2 API that was either
  not following PEP8 naming styles or was not using a
  property. Users comming from PyPDF2 might want to switch
  first to pypdf<4.0.0 to get helpful error messages
  that show the new API in their speicific cases.

A big 'Thank you!' the the whole pypdf community for your
work. Thanks to you, pypdf is better than ever.

Kudos to @shartzog who added the layout-mode with his first
contribution!

### Deprecations (DEP)
-  Drop Python 3.6 support (#2369) by @MartinThoma
-  Remove deprecated code (#2367) by @MartinThoma
-  Remove deprecated XMP properties (#2386) by @stefan6419846

### New Features (ENH)
-  Add "layout" mode for text extraction (#2388) by @shartzog
-  Add Jupyter Notebook integration for PdfReader (#2375) by @MartinThoma
-  Improve/rewrite PDF permission retrieval (#2400) by @stefan6419846

### Bug Fixes (BUG)
-  PdfWriter.add_uri was setting the wrong type (#2406) by @pmiller66
-  Add support for GBK2K cmaps (#2385) by @stefan6419846

### Documentation (DOC)
-  Add pmiller66 for #2406 as a contributor by @MartinThoma
-  Add missing expand parameter (#2393) by @Atomnp
-  Resolve build warnings (#2380) by @stefan6419846
-  Fix testing prerequisites (#2381) by @stefan6419846
-  Improve formatting of contributors page (#2383) by @stefan6419846
-  Add Tobeabellwether as a contributor for #2341 by @MartinThoma

### Developer Experience (DEV)
-  Make dependabot aware of our PR prefixes (#2415) by @stefan6419846
-  Fail on Sphinx issues (#2405) by @stefan6419846
-  Move title check to own workflow (#2384) by @MasterOdin
-  Write to temporary files instead of the working directory (#2379) by @stefan6419846
-  Ensure that the PR titles have the correct format (#2378) by @stefan6419846

### Maintenance (MAINT)
-  Complete FileSpecificationDictionaryEntries constants (#2416) by @MartinThoma
-  Return None instead of -1 when page is not attached (#2376) by @MartinThoma
-  Replace warning with logging.error (#2377) by @MartinThoma

### Testing (TST)
-  Add missing pytest.mark.samples annotations (#2412) by @kitterma
-  Correctly close temporary files (#2396) by @stefan6419846
-  Fix  side effect #2379 (#2395) by @pubpub-zz
-  Add test for layout extraction mode (#2390) by @MartinThoma

### Code Style (STY)
-  Use the UserAccessPermissions enum (#2398) by @MartinThoma
-  Run black (#2370) by @MartinThoma

[Full Changelog](3.17.4...4.0.0)
@cppntn
Copy link

cppntn commented Oct 8, 2024

@MartinThoma please add support for this error as well, thanks.

Advanced encoding /StandardEncoding not implemented yet

@stefan6419846
Copy link
Collaborator Author

@cppntn Please open a proper bug report with the necessary code and an example file to reproduce this.

@cppntn
Copy link

cppntn commented Oct 8, 2024

@stefan6419846 file is confidential and I cannot share, the complete error is:

Advanced encoding /StandardEncoding not implemented yet

as /StandardEncoding is not present in the _cmap, I guess

@stefan6419846
Copy link
Collaborator Author

This is indeed a warning, not an exception. We are not able to do anything about this without analyzing the actual file nevertheless to check what the expected behavior should be. (I have seen exactly this message in the past as well, but this requires me digging through the logs to identify the culprit, which I honestly do not have the time for at the moment.)

@cppntn
Copy link

cppntn commented Oct 8, 2024

@stefan6419846 this causes my task to catch the exception and exit, I don't know why. Could it be you are raising the warning with logger_error, instead of warnings.warn?

https://github.com/py-pdf/pypdf/blob/main/pypdf/_utils.py#L405C8-L428C1

@stefan6419846
Copy link
Collaborator Author

The message is apparently logged using logger_error:

logger_error(f"Advanced encoding {enc} not implemented yet", __name__)
AFAIK there is no exception. In this case, I consider it up to your code to gracefully handle such cases - it seems like you have some custom configuration which propagates logger.error to an exception.

@cppntn
Copy link

cppntn commented Oct 8, 2024

just put the correct logger_warning instead of logger_error, as it is indeed a warning, no?

@stefan6419846
Copy link
Collaborator Author

This depends on how you interpret it. As far as I am aware of the code I have seen, we tend to use logger_warning when something can be fixed, while logger_error is about something which cannot and needs further attention - for example adding a new encoding as in this case. IMHO every logger (starting at level warning) can be seen as warning about something not being correct - if there would be a hard error making it impossible to continue, we would not issue a log statement. In this case with the encoding, there might be some issues with incomplete or partially wrong text layers being extracted.

Apart from this, I do not consider the comments of an old PR suitable for such discussions. You are always invited to open a corresponding issue or discussion for this. For me, the current approach is sufficiently correct enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants