PyPDF2 cleanup #658

MartinThoma · 2022-04-03T15:10:58Z

I'd be interested in (organizing/doing) a cleanup for Python 3.7+.

The breaking changes I'd like to see:

PEP8 naming conventions: getPage should be get_page
Using magic methods: Instead of pdf_reader.get_page(0) I'd rather use pdf_reader[0] - that needs a lot of discussion!
Type annotations (breaking support for Python 3.5 and lower; preferably I'd like to drop support for Python 3.6 as well due to type annotation features)

Step 1: Code changes

Copy the current code in a new repository
Setup unit tests / CI
Release a new package with a clean interface (and the current implementation of PyPDF2)
Wait and see if it actually works

New package name

It would be pretty amazing if we could get pdf on PyPI. I've just contacted the current author; let's see 😇

Then the new package could simply be called pdf and PyPDF2 would be the adapter package using pdf

Step 2: Use PyPDF2 as an adapter

Make PyPDF2 use the new package (dropping <= Python 3.6 support!)

In order to make sure that people move to the new project, there could be a drop-in adapter package for People who are on Python 3.7+. That package could then use the new package under the hood, but provide the old interface to its users (additionally to the new one, maybe with deprecation warnings?).

If possible, that adapter package would be PyPDF2 itself.

Step 3: Spread the word

To make sure people actually use the new PDF package

Usable documentation (e.g. on https://readthedocs.org/ )
Tutorials for common use-cases (e.g. get text from PDF, rotate a single page, merge two PDFs, get number of pages, ... )
Looking on Stackoverflow for python + pdf questions: https://stackoverflow.com/questions/tagged/python+pdf - for old ones and new ones

The text was updated successfully, but these errors were encountered:

MasterOdin · 2022-04-04T15:41:39Z

Assuming ownership of PyPDF2, why not just make any and all breaking changes as part of this repository, bumping the major version? For some stuff like the pep8 function names, could take a phased approach where you wrap getPage around get_page and deprecate the former, but let it get used for some time on PyPDF2 v2, and then remove it on PyPDF2 v3.

If the intention is that no one should use PyPDF2, but instead some theoretical future package, then why not just archive this repo / package, and point it to the new one and just have the ecosystem migrate? Much less effort to try an easy to follow migration guide than keep up an adapter you don't want people to actually use.

MartinThoma · 2022-04-04T21:20:59Z

why not just make any and all breaking changes as part of this repository, bumping the major version

Good point! I'm hesitant with making breaking changes to a widely used library, but if it's properly communicated I guess it's OK 🤔

could take a phased approach

Yes, I like that a lot! Scipy / flask do this pretty well

If the intention is that no one should use PyPDF2

The problem I have with PyPDF is that it was not maintained for such a long time. I have recommended multiple times not to use it because of that.

On the other hand, a change to a new package only improves the situation of most of the community moves and if that project keeps getting progress.

If we would get the "pdf" package on pypi, that could work.

I've just started fiddling around with a new package that should define a clean and modern interface to interact with pdf : https://pypi.org/project/pdffile
Let's see how long I stay motivated / how long it takes me to go through the core usecases 😁 input is welcome, but please be gentle : I do have a day job that is not related to pdf most of the time 😊

MasterOdin · 2022-04-06T18:23:07Z

What in particular in 3.7+ is required in your opinion to drop 3.6 support for a new major version?

MartinThoma · 2022-04-06T18:28:12Z

Python 3.6 is deprecated since 23.12.2021: https://peps.python.org/pep-0494/

While I would not break compatibility on purpose, this means to me that we can use new features. While I'm not sure if it's necessary for PyPDF2, posponed annotation evaluation is a feature I typically miss in Python 3.6 (link)

LightningMan711 · 2022-04-06T21:11:01Z

I'm sure as a full blown programmer, you like what you refer to as magic methods, but I like knowing what is being called. If I couldn't have understood what process was being accessed (such as getPage or get_page in your example) I would have had no idea what the code was doing.

johns1c · 2022-04-07T14:00:38Z

A search for PDF and Python shows an amazing number of projects that use PyPDF2 and therefore the existing interface. I think that we could go a long way without changing the interface by applying the latest standards to the internals.

MasterOdin · 2022-04-07T18:18:58Z

While I would not break compatibility on purpose, this means to me that we can use new features. While I'm not sure if it's necessary for PyPDF2, posponed annotation evaluation is a feature I typically miss in Python 3.6 (link)

I feel like the right course might be that once work for a 2.0 release happens, we try for 3.6 compatibility (as indicated in #659, a number of people still use it), in that if it turns out that some feature from 3.7 would make our lives drastically better, so be it, 3.6 is dropped. I'd just like to avoid dropping 3.6 support (as it is still somewhat widely used yet) just for the sake of dropping support.

I do agree on dropping anything before 3.6 for both f-strings and annotations though. 👍

PEP8 naming conventions: getPage should be get_page

This could be rolled out without breaking BC and also not duplicating function names by doing something like:

class Foo:
  def getPage(self, num):
    print(num)

  def _get_attribute_error(self, name):
    return AttributeError(f"'{self.__class__.__name__}' object has no attribute '{name}'")
    
  def _get_func(self, name):
    words = name.split('_')
    new_name = words[0] + ''.join(word.title() for word in words[1:])
    try:
      return getattr(self, new_name)
    except AttributeError as exc:
      pass
    raise self._get_attribute_error(name)
  
  def __getattr__(self, name):
    if '_' not in name:
      raise self._get_attribute_error(name)

    return self._get_func(name)

a = Foo()
a.getPage(1)
a.get_page(2)
a.invalid_func(3)

Then in 2.0, we reverse some logic here (so that we have def get_page(self, name) and __getattr__ rewrites getPage to get_page, and also maybe issues a deprecation warning?), and then in 3.0, remove the __getattr__ shim altogether.

MartinThoma · 2022-04-07T19:17:16Z

Sounds awesome @MasterOdin ! Let's do it like that!

By the way: I've just released 1.27.0 with several tiny changes and a few bugfixes 🎉

MartinThoma · 2022-04-07T19:42:14Z

If we want to support it, we need to test it: #667 - Python 3.6 is now in the CI

MartinThoma mentioned this issue Apr 3, 2022

PyPDF2 Ownership/Collaborators Available #657

Closed

This was referenced Apr 5, 2022

pypdf is back! sfneal/PyPDF3#18

Open

PyPDF4 is dead claird/PyPDF4#103

Open

MasterOdin pinned this issue Apr 7, 2022

MartinThoma added this to the PyPDF2 version 2.0.0 milestone Apr 9, 2022

py-pdf locked and limited conversation to collaborators Apr 9, 2022

MartinThoma converted this issue into discussion #684 Apr 9, 2022

MartinThoma unpinned this issue Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

PyPDF2 cleanup #658

PyPDF2 cleanup #658

MartinThoma commented Apr 3, 2022 •

edited

Loading

MasterOdin commented Apr 4, 2022

MartinThoma commented Apr 4, 2022

MasterOdin commented Apr 6, 2022

MartinThoma commented Apr 6, 2022

LightningMan711 commented Apr 6, 2022

johns1c commented Apr 7, 2022

MasterOdin commented Apr 7, 2022

MartinThoma commented Apr 7, 2022

MartinThoma commented Apr 7, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

PyPDF2 cleanup #658

PyPDF2 cleanup #658

Comments

MartinThoma commented Apr 3, 2022 • edited Loading

Step 1: Code changes

New package name

Step 2: Use PyPDF2 as an adapter

Step 3: Spread the word

MasterOdin commented Apr 4, 2022

MartinThoma commented Apr 4, 2022

MasterOdin commented Apr 6, 2022

MartinThoma commented Apr 6, 2022

LightningMan711 commented Apr 6, 2022

johns1c commented Apr 7, 2022

MasterOdin commented Apr 7, 2022

MartinThoma commented Apr 7, 2022

MartinThoma commented Apr 7, 2022

This issue was moved to a discussion.

MartinThoma commented Apr 3, 2022 •

edited

Loading