Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDF2 cleanup #658

Closed
MartinThoma opened this issue Apr 3, 2022 · 9 comments
Closed

PyPDF2 cleanup #658

MartinThoma opened this issue Apr 3, 2022 · 9 comments

Comments

@MartinThoma
Copy link
Member

MartinThoma commented Apr 3, 2022

I'd be interested in (organizing/doing) a cleanup for Python 3.7+.

The breaking changes I'd like to see:

  • PEP8 naming conventions: getPage should be get_page
  • Using magic methods: Instead of pdf_reader.get_page(0) I'd rather use pdf_reader[0] - that needs a lot of discussion!
  • Type annotations (breaking support for Python 3.5 and lower; preferably I'd like to drop support for Python 3.6 as well due to type annotation features)

Step 1: Code changes

  1. Copy the current code in a new repository
  2. Setup unit tests / CI
  3. Release a new package with a clean interface (and the current implementation of PyPDF2)
  4. Wait and see if it actually works

New package name

It would be pretty amazing if we could get pdf on PyPI. I've just contacted the current author; let's see 😇

Then the new package could simply be called pdf and PyPDF2 would be the adapter package using pdf

Step 2: Use PyPDF2 as an adapter

Make PyPDF2 use the new package (dropping <= Python 3.6 support!)

In order to make sure that people move to the new project, there could be a drop-in adapter package for People who are on Python 3.7+. That package could then use the new package under the hood, but provide the old interface to its users (additionally to the new one, maybe with deprecation warnings?).

If possible, that adapter package would be PyPDF2 itself.

Step 3: Spread the word

To make sure people actually use the new PDF package

  1. Usable documentation (e.g. on https://readthedocs.org/ )
  2. Tutorials for common use-cases (e.g. get text from PDF, rotate a single page, merge two PDFs, get number of pages, ... )
  3. Looking on Stackoverflow for python + pdf questions: https://stackoverflow.com/questions/tagged/python+pdf - for old ones and new ones
@MasterOdin
Copy link
Member

Assuming ownership of PyPDF2, why not just make any and all breaking changes as part of this repository, bumping the major version? For some stuff like the pep8 function names, could take a phased approach where you wrap getPage around get_page and deprecate the former, but let it get used for some time on PyPDF2 v2, and then remove it on PyPDF2 v3.

If the intention is that no one should use PyPDF2, but instead some theoretical future package, then why not just archive this repo / package, and point it to the new one and just have the ecosystem migrate? Much less effort to try an easy to follow migration guide than keep up an adapter you don't want people to actually use.

@MartinThoma
Copy link
Member Author

why not just make any and all breaking changes as part of this repository, bumping the major version

Good point! I'm hesitant with making breaking changes to a widely used library, but if it's properly communicated I guess it's OK 🤔

could take a phased approach

Yes, I like that a lot! Scipy / flask do this pretty well

If the intention is that no one should use PyPDF2

The problem I have with PyPDF is that it was not maintained for such a long time. I have recommended multiple times not to use it because of that.

On the other hand, a change to a new package only improves the situation of most of the community moves and if that project keeps getting progress.

If we would get the "pdf" package on pypi, that could work.

I've just started fiddling around with a new package that should define a clean and modern interface to interact with pdf : https://pypi.org/project/pdffile
Let's see how long I stay motivated / how long it takes me to go through the core usecases 😁 input is welcome, but please be gentle : I do have a day job that is not related to pdf most of the time 😊

@MasterOdin
Copy link
Member

What in particular in 3.7+ is required in your opinion to drop 3.6 support for a new major version?

@MartinThoma
Copy link
Member Author

Python 3.6 is deprecated since 23.12.2021: https://peps.python.org/pep-0494/

While I would not break compatibility on purpose, this means to me that we can use new features. While I'm not sure if it's necessary for PyPDF2, posponed annotation evaluation is a feature I typically miss in Python 3.6 (link)

@LightningMan711
Copy link

I'm sure as a full blown programmer, you like what you refer to as magic methods, but I like knowing what is being called. If I couldn't have understood what process was being accessed (such as getPage or get_page in your example) I would have had no idea what the code was doing.

@johns1c
Copy link

johns1c commented Apr 7, 2022

A search for PDF and Python shows an amazing number of projects that use PyPDF2 and therefore the existing interface. I think that we could go a long way without changing the interface by applying the latest standards to the internals.

@MasterOdin MasterOdin pinned this issue Apr 7, 2022
@MasterOdin
Copy link
Member

While I would not break compatibility on purpose, this means to me that we can use new features. While I'm not sure if it's necessary for PyPDF2, posponed annotation evaluation is a feature I typically miss in Python 3.6 (link)

I feel like the right course might be that once work for a 2.0 release happens, we try for 3.6 compatibility (as indicated in #659, a number of people still use it), in that if it turns out that some feature from 3.7 would make our lives drastically better, so be it, 3.6 is dropped. I'd just like to avoid dropping 3.6 support (as it is still somewhat widely used yet) just for the sake of dropping support.

I do agree on dropping anything before 3.6 for both f-strings and annotations though. 👍

PEP8 naming conventions: getPage should be get_page

This could be rolled out without breaking BC and also not duplicating function names by doing something like:

class Foo:
  def getPage(self, num):
    print(num)

  def _get_attribute_error(self, name):
    return AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
    
  def _get_func(self, name):
    words = name.split('_')
    new_name = words[0] + ''.join(word.title() for word in words[1:])
    try:
      return getattr(self, new_name)
    except AttributeError as exc:
      pass
    raise self._get_attribute_error(name)
  
  def __getattr__(self, name):
    if '_' not in name:
      raise self._get_attribute_error(name)

    return self._get_func(name)

a = Foo()
a.getPage(1)
a.get_page(2)
a.invalid_func(3)

Then in 2.0, we reverse some logic here (so that we have def get_page(self, name) and __getattr__ rewrites getPage to get_page, and also maybe issues a deprecation warning?), and then in 3.0, remove the __getattr__ shim altogether.

@MartinThoma
Copy link
Member Author

Sounds awesome @MasterOdin ! Let's do it like that!

By the way: I've just released 1.27.0 with several tiny changes and a few bugfixes 🎉

@MartinThoma
Copy link
Member Author

If we want to support it, we need to test it: #667 - Python 3.6 is now in the CI

@MartinThoma MartinThoma added this to the PyPDF2 version 2.0.0 milestone Apr 9, 2022
@py-pdf py-pdf locked and limited conversation to collaborators Apr 9, 2022
@MartinThoma MartinThoma converted this issue into discussion #684 Apr 9, 2022
@MartinThoma MartinThoma unpinned this issue Apr 16, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants