From 38d2d2080601855aaafca7ddd1cd1b6997bca8bc Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 21:25:49 +0000 Subject: [PATCH 1/7] Draft PEP for Simple JSON API Co-authored-by: Cooper Lees Co-authored-by: Donald Stufft Co-authored-by: Pradyun Gedam --- pep-9999.rst | 452 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 452 insertions(+) create mode 100644 pep-9999.rst diff --git a/pep-9999.rst b/pep-9999.rst new file mode 100644 index 00000000000..606bc4a34dd --- /dev/null +++ b/pep-9999.rst @@ -0,0 +1,452 @@ +PEP: 9999 +Title: JSON-based Simple API for Python Package Indexes +Author: Donald Stufft , + Pradyun Gedam , + Cooper Lees , + Dustin Ingram +Status: Draft +Type: Informational +Content-Type: text/x-rst +BDFL-Delegate: Donald Stufft +Discussions-To: https://discuss.python.org/t/AAAAAA/999999 +Created: 04-May-2022 + + +Abstract +======== + +The "Simple Repository API" that was defined in :pep:`503` (and was in use much +longer than that) has served us reasonably well for a very long time. However, +the reliance on using HTML as the data exchange mechanism has several +shortcomings. + +There are two major issues with an HTML-based API: + +- While HTML5 is a standard, it's an incredibly complex standard and ensuring + completely correct parsing of it involves complex logic that does not + currently exist within the Python standard library (nor the standard library + of many other languages). + + This means that to actually accept everything that is technically valid, tools + have to pull in large dependencies or they have to rely on the standard library's + ``html.parser`` library, which is lighter weight but potentially doesn't + fully support HTML5. + +- HTML5 is primarily designed as a markup language to present documents for human + consumption. Our use of it is driven largely for historical reasons and accidental + reasons, and it's unlikely anyone would design an API that relied on it if + they were starting from scratch. + + The primary issue with using a markup format designed for human consumption + is that there's not a great way to actually encode data within HTML. We've + gotten around this by limiting the data we put in this API and being creative + with how we can cram data into the API (for instance, hashes are embedded as + URL fragments, adding the ``data-yanked`` attribute in :pep:`592`). + +:pep:`503` was largely an attempt to standardize what was already in use, so it +did not propose any large changes to the API. + +In the intervening years, we've regularly talked about an "API V2" that would +re-envision the entire API of PyPI. However, due to limited time constraints, +that effort has not gained much if any traction beyond people thinking that it +would be nice to do it. + +This PEP attempts to take a different route. It doesn't fundamentally change +the overall API structure, but instead specifies a new representation of the +existing data contained in existing :pep:`503` responses in a format that is +easier for software to parse rather than using a human centric document format. + +Goals +===== + +- **Enable zero configuration discovery.** Clients of the simple API **MUST** be + able to gracefully determine whether a target repository supports this PEP + without relying on any form of out of band communication (configuration, prior + knowledge, etc). Individual clients **MAY** choose to require configuration + to enable the use of this API, however. +- **Enable clients to drop support for "legacy" HTML parsing.** While it is expected + that most clients will keep supporting HTML-only repositories for a while, if not + forever, it should be possible for a client to choose to support only the new + API formats and no longer invoke an HTML parser. +- **Enable repositories to drop support for "legacy" HTML formats.** Similar to + clients, it is expected that most repositories will continue to support HTML + responses for a long time, or forever. It should be possible for a repository to + choose to only support the new formats. +- **Maintain full support for existing HTML-only clients.** We **MUST** not break + existing clients that are accessing the API as a strictly :pep:`503` API. The only + exception to this, is if the repository itself has chosen to no longer support + the HTML format. +- **Minimal additional HTTP requests.** Using this API **MUST** not drastically + increase the amount of HTTP requests an installer must do in order to function. + Ideally it will require 0 additional requests, but if needed it may require one + or two additional requests (total, not per dependency). +- **Minimal additional unique reponses.** Due to the nature of how large + repositories like PyPI cache responses, this PEP should not introduce a + significantly or combinatorially large number of additional unique responses + that the repository may produce. +- **Supports TUF.** This PEP **MUST** be able to function within the bounds of + what TUF can support (:pep:`458`), and must be able to be secured using it. +- **Require only the standard library, or small external dependencies for clients.** + Parsing an API response should ideally require nothing but the standard + library, however it would be acceptable to require a small, pure Python + dependency. + + +Specification +============= + +To enable parsing responses with only the standard library, this PEP specifies that +all responses (besides the files themselves, and the HTML responses from +:pep:`503`) should be encoded using `JSON `_. + +To enable zero configuration discovery and to minimize the amount of additional HTTP +requests, this PEP extends :pep:`503` such that all of the API endpoints (other than the +files themselves) will utilize HTTP content negotiation to allow client and server to +select the correct format to serve, i.e. either HTML or JSON. + +Format Selection +---------------- + +A HTML response will be the default when requesting in version 1.0: + +- ``/simple/`` +- ``/simple/foo/`` + - Like :pep:`503`, the trailing ``/`` is expected + +To request a JSON response, the ``Accept`` header will need to be added to the +request specify the response type and version. For version 1.0 this will look like: + + ``Accept: application/vnd.pypi.simple.v1+json`` + +The version is also optional and will then always return the latest version: + + ``Accept: application/vnd.pypi.simple+json`` + +This is for clients who always want latest and should expect potential +breakages. Additionally, it is potential useful way to run integration tests +against a possibly breaking version. + +Specifying HTML is also allowed so clients can be explicit to backends (e.g if we +switch to JSON default in the future): + + ``Accept: application/vnd.pypi.simple.v1+html`` + +Using ``text/html`` will also work, which will serve the latest API version. To +be explicit, clients should use specific HTML ``Accept``. If no +``Accept`` is specified, the latest HTTP version will be returned unless +the backend *only* supports JSON. Backends may default to returning JSON in the +future. + +The ``Accept:`` header also allows you to say that you prefer the the V1 Simple JSON API, +if that's not available then you prefer the V1 HTML API, and if that's not available, +just ``text/html``. To do this would look like: + + ``Accept: application/vnd.pypi.simple.v1+json, application/vnd.pypi.simple.v1+html, text/html`` + +Versioning +---------- + +Versioning will adhere to :pep:`629` format (``Major.Minor``) and will be +included in the ``Accept`` request that clients add to obtain a JSON +response. We don't foresee the use of *Minor* versioning but will support it if +the need does arise. + +The header for clients accessing version 1.0 of the API will be: + + ``application/vnd.pypi.simple.index.v1+json`` + +An example for Accept values that a newer APIs could support **would** look like: + + ``application/vnd.pypi.simple.index.v2+json`` + +If a version that does not exist is requested, the server will explicitly return a +`406 Not Acceptable +`_ HTTP status +code. The response will also indicate available API versions and links to +version formats. + + +TUF Support - PEP 458 +--------------------- + +:pep:`458` states that the "Simple Index" needs to be hashable. To adhere to the TUF +standard, we will need a target for each response, i.e. the HTML and JSON (plus any +future type) response. To provide this we could have two targets per API endpoint: + +- ``/simple/foo/vnd.pypi.simple.v1.html`` +- ``/simple/foo/vnd.pypi.simple.v1.json`` + +Additionally, when calculating the digest of a JSON response, indices should +use the `Canonical JSON `_ format. + + +Root URL +-------- + +The root URL ``/`` for this PEP (which represents the base URL) will be a JSON encoded +dictionary where each key is a string of the normalized project name, and the value is +a dictionary with a single key, ``url``, which represents the URL that the project can +be fetched from. As an example:: + + { + "frob": {"url": "/frob/"}, + "spamspamspam": {"url": "/spamspamspam/"} + } + +Below the root URL is another URL for each individual project contained within +a repository. The format of this URL is ``//`` where the ```` +is replaced by the :pep:`503`-canonicalized name for that project, so a project named +"Holy_Grail" would have a URL like ``/holy-grail/``. This URL must respond with a +JSON encoded dictionary that has two keys, ``name``, which represents the normalized +name of the project and ``files``. The ``files`` key is a list of dictionaries, +each one representing an individual file. + +Each individual file dictionary has the following keys: + +- ``filename``: The filename that is being represented. +- ``url``: The URL that the file can be fetched from. +- ``hashes``: A dictionary mapping a hash name to a hex encoded digest of the file. + Multiple hashes can be included, and it is up to the client to decide what to do + with multiple hashes (it may validate all of them or a subset of them, or nothing + at all). These hash names **SHOULD** always be normalized to be lowercase. + + The ``hashes`` dictionary **MUST** be present, even if no hashes are available + for the file, however it is **HIGHLY** recommended that at least one secure, + guaranteed to be available hash is always included. +- ``requires-python``: An **optional** key that exposes the *Requires-Python* + metadata field, specified in :pep:`345`. Where this is present, installer tools + **SHOULD** ignore the download when installing to a Python version that + doesn't satisfy the requirement. +- ``dist-info-metadata-available``: An **optional** key that indicates + that metadata for this file is available, via the same location as specified in + :pep:`658` (`{file_url}.metadata`). Where this is present, it **MUST** be true, + or a dictionary mapping a hash name to a hex encoded digest of the metadata hash. +- ``gpg-sig``: An **optional** key that acts a boolean to indicate if the file has + an associated GPG signature or not. If this key does not exist, then the signature + may or may not exist. +- ``yanked``: An **optional** key which may have no value, or may have an + arbitrary string as a value. The presence of a ``yanked`` key SHOULD + be interpreted as indicating that the file pointed to by the ``url`` field + has been "Yanked" as per :pep:`592`. + +As an example:: + + { + "name": "holygrail", + "files": [ + { + "filename": "holygrail-1.0.tar.gz", + "url": "https://example.com/files/holygrail-1.0.tar.gz", + "hashes": {"sha256": "...", "blake2b": "..."}, + "requires-python": ">=3.7", + "yanked": "Had a vulnerability" + }, + { + "filename": "holygrail-1.0-py3-none-any.whl", + "url": "https://example.com/files/holygrail-1.0-py3-none-any.whl", + "hashes": {"sha256": "...", "blake2b": "..."}, + "requires-python": ">=3.7", + "dist-info-metadata-available": true + }, + ] + } + +In addition to the above, the following constraints are placed on the API: + +* While JSON doesn't natively support an URL type, any value that represents an + URL in this API may be either absolute or relative as long as they point to + the correct location. If relative, they are relative to the current URL as if + it were HTML. + +* Additional keys may be added to any dictionary objects in the API responses + and clients **MUST** ignore keys that they don't understand. + +* By default, any hash algorithm available via `hashlib + ` (specifically any that can + be passed to ``hashlib.new()`` and do not require additional parameters) can + be used as a key for the hashes dictionary. At least one secure algorithm from + ``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time + of this PEP, ``sha256`` specifically is recommended. + +* Unlike ``data-requires-python`` in :pep:`503`, the ``requires-python`` key does not + require any special escaping other than anything JSON does naturally. + +* Future features **MAY** be implemented or only supported when operating under JSON. + This would be decided on a case by case basis depending on how important the feature + is, how widely used HTML is at that point, and how difficult representing the feature + in HTML would be. + +* All requirements of :pep:`503` that are not HTML specific still apply. + + +FAQ +=== + + +Why JSON instead of X format? +----------------------------- + +JSON parsers are widely available in most, if not every, language. A JSON +parser is also available in the Python standard library. It's not the perfect +format, but it's good enough. + + +Why not add X feature? +---------------------- + +The general goal of this PEP is to change or add very little. We will instead focus +largely on translating the existing information contained within our HTML responses +into a sensible JSON representation. This will include :pep:`658` metadata required +for packaging tooling. + +The only real new capability that is added in this PEP is the ability to have +multiple hashes for a single file. That was done because the current mechanism being +limited to a single hash has made it painful in the past to migrate hashes +(md5 to sha256) and the cost of making the hashes a dictionary and allowing multiple +is pretty low. + +The API was generally designed to allow further extension through adding new keys, +so if there's some new piece of data that an installer might need, future PEPs can +easily make that available. + + +Why is the root URL a dictionary instead of a list? +--------------------------------------------------- + +The most natural direct translation of the root URL being a list of links is to turn +it into a list of objects. However, stepping back, that's not the most natural way +to actually represent this data. This was a result of a HTML limitation that we had to +work around. With a list (either of ```` tags, or objects) there's nothing stopping +you from listing the same project twice and other unwanted patterns. + +A dictionary also allows for an average of constant-time access given the project name. + + +Why include the filename when the URL has it already? +----------------------------------------------------- + +We could reduce the size of our responses by removing the ``filename`` key and expecting +clients to pull that information out of the URL. + +Currently this PEP chooses not to do that, largely because :pep:`503` explicitly required +that the filename be available via the anchor tag of the links, though that was largely +because *something* had to be there. It's not clear if repositories in the wild always +have a filename as the last part of the URL or if they're relying on the filename in the +anchor tag. + +It also makes the responses slightly nicer to read for a human, as you get a nice short +unique identifier. + +If we got reasonable confidence that mandating the filename is in the URL, then we could +drop this data and reduce the size of the JSON response. + + +Why not break out other pieces of information from the filename? +---------------------------------------------------------------- + +Currently clients are expected to parse a number of pieces of information from the +filename such as project name, version, ABI tags, etc. We could break these out +and add them as keys to the file object. + +This PEP has chosen not to do that because doing so would increase the size of the +API responses, and most clients are going to require the ability to parse that +information out of file names anyways regardless of what the API does. Thus it makes +sense to keep that functionality inside of the clients. + + +Why Content Negotiation instead of multiple URLs? +------------------------------------------------- + +Another reasonable way to implement this would be to duplicate the API routes and +include some marker in the URL itself for JSON. Such as making the URLs be something +like ``/simple/foo.json``, ``/simple/_index.json``, etc. + +This makes some things simpler like TUF integration and fully static serving of a +repository (since ``.json`` files can just be written out). + +However, this is two pretty major issues: + +- Our current URL structure relies on the fact that there is an URL that represents + the "root", ``/`` to serve the list of projects. If we want to have separate URLs + for JSON and HTML, we would need to come up with some way to have two root URLs. + + Something like ``/`` being HTML and ``/_index.json`` being JSON, since ``_index`` + isn't a valid project name could work. But ``/`` being HTML doesn't work great if + a repository wants to remove support for HTML. + + Another option could be moving all of the existing HTML URLs under a namespace while + making a new namespace for JSON. Since ``//`` was defined, we would have to + make these namespaces not valid project names, so something like ``/_html/`` and + ``/_json/`` could work, then just redirect the non namespaced URLs to whatever the + "default" for that repository is (likely HTML, unless they've disabled HTML then JSON). +- With separate URLs, there's no good way to support zero configuration discovery + that a repository supports the JSON URLs without making additional HTTP requests to + determine if the JSON URL exists or not. + + The most naive implementation of this would be to request the JSON URL and fall back + to the HTML URL for *every* single request, but that would be horribly performant + and violate the goal of minimal additional HTTP requests. + + The most likely implementation of this would be to make some sort of repository level + configuration file that somehow indicates what is supported. We would have the same + namespace problem as above, with the same solution, something like ``/_config.json`` + or so could hold that data, and a client could first make an HTTP request to that, + and if it exists pull it down and parse it to learn about the capabilities of this + particular repository. +- The use of ``Accept`` also allows us to add versioning into this field + +All being said, it is the opinion of this PEP that those three issues combined make +using separate API routes a less desirable solution than relying on content +negotiation to select the most ideal representation of the data. + + +Appendix 1: Survey of use cases to cover +======================================== + +This was done through a discussion between ``pip`` and ``bandersnarch`` +maintainers, where the two first potential use cases for the new API. This is +how they use the Simple + JSON APIs today: + +- ``pip``: + + - List of all files for a particular release + - Metadata of each individual artifact: + + - was it yanked? (`data-yanked`) + - what's the python-requires? (`data-python-requires`) + - what's the hash of this file? (currently, hash in URL) + - Full metadata (`data-dist-info-metadata`) + - [Bonus] what are the declared dependencies, if available (list-of-strings, null if unavailable)? + +- ``bandersnatch`` - Only uses legacy JSON API + XMLRPC today: + + - Generates Simple HTML rather than copying from PyPI + + - Maybe this changes with the new API and we verbatim pull these API assets from PyPI + + - List of all files for a particular release. + + - Workout URL for release files to download + + - Metadata of each individual artifact. + + - Write out the JSON to mirror storage today (disk/S3) + + - Required metadata used (via Package class - https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py): + + - metadata["info"] + - metadata["last_serial"] + - metadata["releases"] + + - digests + - URL + + - XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API) + + - [Bonus] Get packages since serial X (or all) + + - XML-RPC Call: ``changelog_since_serial`` + + - [Bonus] Get all packages with serial + + - XML-RPC Call: ``list_packages_with_serial`` From b8f03ef6d69de8b81b1fb421e3f6e19a19f5af2d Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 21:31:17 +0000 Subject: [PATCH 2/7] Adopt PEP number 691 --- pep-9999.rst => pep-0691.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename pep-9999.rst => pep-0691.rst (99%) diff --git a/pep-9999.rst b/pep-0691.rst similarity index 99% rename from pep-9999.rst rename to pep-0691.rst index 606bc4a34dd..56262b38834 100644 --- a/pep-9999.rst +++ b/pep-0691.rst @@ -1,4 +1,4 @@ -PEP: 9999 +PEP: 691 Title: JSON-based Simple API for Python Package Indexes Author: Donald Stufft , Pradyun Gedam , From 8625754e4a958227be0c190fabaca9c49f498bae Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 17:37:38 -0400 Subject: [PATCH 3/7] Update pep-0691.rst Co-authored-by: Jelle Zijlstra --- pep-0691.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pep-0691.rst b/pep-0691.rst index 56262b38834..3ce81bf29f5 100644 --- a/pep-0691.rst +++ b/pep-0691.rst @@ -262,7 +262,7 @@ In addition to the above, the following constraints are placed on the API: and clients **MUST** ignore keys that they don't understand. * By default, any hash algorithm available via `hashlib - ` (specifically any that can + `_ (specifically any that can be passed to ``hashlib.new()`` and do not require additional parameters) can be used as a key for the hashes dictionary. At least one secure algorithm from ``hashlib.algorithms_guaranteed`` **SHOULD** always be included. At the time From fad86a75ef85f73df79afc3eb173a2755e1f6e10 Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 17:40:33 -0400 Subject: [PATCH 4/7] Update pep-0691.rst Co-authored-by: Jelle Zijlstra --- pep-0691.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pep-0691.rst b/pep-0691.rst index 3ce81bf29f5..e44f009d187 100644 --- a/pep-0691.rst +++ b/pep-0691.rst @@ -404,7 +404,7 @@ Appendix 1: Survey of use cases to cover ======================================== This was done through a discussion between ``pip`` and ``bandersnarch`` -maintainers, where the two first potential use cases for the new API. This is +maintainers, who are the two first potential users for the new API. This is how they use the Simple + JSON APIs today: - ``pip``: From 551887e9018de101a81d6f03f3695d20d24aa8fb Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 21:42:45 +0000 Subject: [PATCH 5/7] Update .github/CODEOWNERS --- .github/CODEOWNERS | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index bb98413d806..ed0b3283827 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -571,6 +571,7 @@ pep-0687.rst @encukou pep-0688.rst @jellezijlstra pep-0689.rst @encukou pep-0690.rst @warsaw +pep-0691.rst @dstufft # ... # pep-0754.txt # ... From 1f2e3b49f8179a1b7714acd0bfaff6f7e753a356 Mon Sep 17 00:00:00 2001 From: Dustin Ingram Date: Thu, 5 May 2022 17:53:23 -0400 Subject: [PATCH 6/7] Update pep-0691.rst Co-authored-by: Jelle Zijlstra --- pep-0691.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pep-0691.rst b/pep-0691.rst index e44f009d187..5b8c4281054 100644 --- a/pep-0691.rst +++ b/pep-0691.rst @@ -5,7 +5,7 @@ Author: Donald Stufft , Cooper Lees , Dustin Ingram Status: Draft -Type: Informational +Type: Standards Track Content-Type: text/x-rst BDFL-Delegate: Donald Stufft Discussions-To: https://discuss.python.org/t/AAAAAA/999999 From 28225a0401a2c27a422d679759582e9d600f7402 Mon Sep 17 00:00:00 2001 From: Jelle Zijlstra Date: Thu, 5 May 2022 14:56:47 -0700 Subject: [PATCH 7/7] Update pep-0691.rst --- pep-0691.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/pep-0691.rst b/pep-0691.rst index 5b8c4281054..267d75d9284 100644 --- a/pep-0691.rst +++ b/pep-0691.rst @@ -432,14 +432,14 @@ how they use the Simple + JSON APIs today: - Write out the JSON to mirror storage today (disk/S3) - - Required metadata used (via Package class - https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py): + - Required metadata used (via Package class - https://github.com/pypa/bandersnatch/blob/main/src/bandersnatch/package.py): - - metadata["info"] - - metadata["last_serial"] - - metadata["releases"] + - metadata["info"] + - metadata["last_serial"] + - metadata["releases"] - - digests - - URL + - digests + - URL - XML-RPC calls (we'd love to deprecate - but we don't think should go in the Simple API)