Skip to content

Latest commit

 

History

History
297 lines (231 loc) · 16.3 KB

todo.org

File metadata and controls

297 lines (231 loc) · 16.3 KB

https://www.lpalmieri.com/posts/2020-09-27-zero-to-production-4-are-we-observable-yet/#GDPR

bundle licenses:

  • cargo-about
  • cargo-bundle-licenses
  • cargo-lichking

git resolving: keep git clones in cache https://docs.rs/git2/latest/git2/struct.Repository.html#method.resolve_reference_from_short_name https://docs.rs/git2/latest/git2/struct.Reference.html#method.peel_to_commit https://docs.rs/git2/latest/git2/struct.Commit.html#method.id

quoting on posix: surround with ’ and replace each internal ’ with ‘”’”’

in .posy/ workspace, create .gitignore with ‘*’ in it, so automatically gitignored

swap over to this for our newtypes? https://docs.rs/aliri_braid/latest/aliri_braid/index.html and also short strings

https://blog.yossarian.net/2022/05/09/A-most-vexing-parse-but-for-Python-packaging (and PEP 625 discussion) https://discuss.python.org/t/pep-625-file-name-of-a-source-distribution/4686/27

switch to this for error reporting? https://docs.rs/miette/latest/miette/ can make pretty toml errors (https://docs.rs/toml_edit/latest/toml_edit/struct.TomlError.html) but even better would be if can convince it to make pretty explanations of failed resolution! (well, might have to do this on our own) maybe also look at https://docs.rs/ariadne/0.1.5/ariadne/

Actual todo/planning list

  • Pybis
    • build latest patch releases
    • clean up build tooling
    • set up GHA or similar automation
    • post PEP
  • UI
    • Tracing
      • better rendering of errors (esp.) and warnings etc.
      • make verbose mode make sense
        • maybe: higher levels of -v on their own show our debug stuff and also our span entry/exit so you can see what’s happening, but not other modules and also, some kind of EnvFilter support so can get stuff like TLS handshake traces or HTML parsing traces if we want (with span context)
        • maybe our context! spans should have different priorities, and higher priority ones should be printed as we go to give info on progress?
      • progress bars
    • Command line parsing
    • Finding pyproject.toml
    • pyproject language for describing environments (steal heavily from Hatch, or even just merge the projects I guess)
    • ‘posy new’
      • ‘posy new –adopt’
    • update checking (crates.io/crate/update-informer?)
  • storage
    • GC
  • installer
    • use saved env markers to pick between multiple blueprints
    • figure out what to do with RECORD
    • environment export
      • factor out common blueprint installation logic
  • sdist handling
    • get metadata during resolution
    • build during install
  • @ dependencies
    • git
    • relative paths
    • random urls
  • workspaces?
  • package db
    • multiple hash support + json support
    • .metadata support
  • Resolver
    • proper pybi resolution (combine version ranking code with wheels)
    • minimal-upgrade support
    • Readable failure reporting
    • progress reporting (progress bar showing (max(candidates.len()) - current(candidates.len())? show the top package/version pairs in each candidate request?)
    • performance:
      • better range representation
      • better next-package heuristics
        • maybe pulling in the priority queue PR
      • more efficient network usage
  • Trampolines
    • relative path support for exporting environments
    • fix quoting in windows wrappers
      • actual parsing/quoting algorithm ref
      • but what we actually need is much simpler than that! ref (which also means our current code is accidentally correct, b/c the broken quoting never executes)
    • maybe remove link to shell32.dll? ref
  • ‘posy self upgrade’ (or ‘up’ or ‘update’ or whatever we pick)
  • ‘posy self info’ -> version, license, detected platform, paths, …?
  • Add #[serde(deny_unknown_fields)] on most structs?

when examining saved blueprints to create env for brief:

  • probably want a copy of that brief
  • not clear to me whether we should save the marker exprs + revalidate them, or save the original requirements + revalidate them. latter seems maybe simpler?
  • probably want a version of resolve() that returns a consolidated bundle of blueprints? maybe take either Any([&PybiPlatform) or All([&PybiPlatform]) and consolidate the massively overlapping stuff, like hashes + wheel metadata (might also want to record which platforms each sub-blueprint was resolved for, just for documentation, and maybe short-circuit validation. And so we can detect when the user has changed the set of desired platforms, so we need to re-resolve.)

sdist support

given sdist AI + envmaker, get metadata / wheel

so db.get_metadata needs to have access to envmaker+context somehow. maybe need an sdistbuilder object to encapsulate? (would also be a good place to stash stuff like build-constraints if we add them in the future)

WheelBuilderConfig { PackageDB, EnvMaker, // constraints, …? }

maybe the Config should also keep a tempdir cache of unpacked sdists, b/c it’s likely that within a single session we may want to first run get_requires_for_build_wheel and then build_wheel

// whenever we start building an sdist, we first need to walk the stack to make sure // there’s no loop, so we can also access the root’s WheelBuilderConfig at the same time enum WheelBuilder<’a> { Root(WheelBuilderConfig), Nested(&WheelBuilder<’a>, &’a PackageName), }

impl<’a> WheelBuilder<’a> { fn check(&self, new_name: &PackageName) -> Result<&WheelBuilderConfig> { match self { Root(config) > Ok(&config), Nested(parent, name) => if name = new_name { bail!(“loop”); } else { parent.check(&new_name), } } }

fn new_child<’self, ‘name, ‘new>(&’self self, &’name PackageName) -> Result<(WheelBuilder<’self + ‘name>, &’self WheelBuilderConfig> { let config = self.check(name)?; Ok((WheelBuilder::Nested(&self, &name), &config) }

// pybi needs to come in here somewhere too… part of the config, or part of the // invocation? fn make_metadata(&self, sdist: &Sdist) -> Result<WheelCoreMetadata> { let child = self.new_child(&sdist.name.distribution)?; // unpack, read pyproject.toml, make brief // pass ‘child’ into the resolver as the builder for any sdists it needs // do pep517 stuff

// this should have the option of stashing the built wheel in the cache, // in case it’s forced to build one }

fn make_wheel(&self, sdist: &Sdist) -> Result<Wheel> { … } }

so &WheelBuilder going into resolver, package_db.get_metadata, maybe get_artifact::<Wheel>?

wheel caching: store mapping sdist -> dir dir maps compat key -> wheel

compat keys: if wheel has ‘any’ tag, use its actual tag as the key sdist_hash/py3-none-any/foo-12.3-py3-none-any.whl /py37-none-any/foo-12.3.py37-none-any.whl (allow dotted names here, we can parse and expand during retrieval)

if has an abi, take the most-restrictive (highest priority) wheel tag …maybe should have some hack like, define our own “posy_local_manylinux_2_24_x86_64” tags? in practice wheels will end up with -linux_x86_64 tags and currently we don’t believe those are compatible, so that’s an issue and we can’t just add those tags in general, b/c then will conflate locally-built manylinux+musllinux wheels in the cache

oh shoot, build-constraints would also need to be included in the cache key or maybe better, a record of which build-dependencies were actually used that we can check against when looking it up?

so local-wheel cache is more like map sdist->set<(build context, wheel)>, where we treat all the wheels as candidates and loop through to pick the one we like best (or make a new one) maybe need a new KV*Store for this honestly

choosing pybi to build with: for metadata we already assume that any wheel is as good as another, so we might as well do the same for building metadata? …though, we assume that any wheel is as good as another for wheels that exist, but b/c of python-requires (explicit or implicit) building a wheel on the wrong pybi might just fail. so maybe we should pass in a python version?

for building (installing) a wheel, have a specific pybi in mind. we want to use exactly that one if possible. otherwise want something “close” (e.g. same version but different platform tag).

maybe pass in the AI for the pybi we’re actually using for (install/resolve), and then have fallback logic inside the SdistBuilder that tries “next best” if it can?

and in cache, for each wheel include:

  • the wheel
  • the build tool versions used (so can filter for build-constraints post hoc)
  • the pybi name used?
  • the platform built on (could just dump the full WheelPlatform tag set) then accept if either wheel’s actual tag matches, or build platform is a subset of target platform?

PEP 517 loop goal -> {metadata, wheel} also backend-path state: initial -> all_requires -> prepare_metadata_for_build_wheel OR build_wheel

so python script will get passed the relevant pyproject.toml data, the state, and the goal and returns either a wheel or a .dist-info directory

ideally:

  • some way to pass in config settings
  • some way to hold onto unpacked sdists and .dist-info directories, so if the same process both gets metadata and then builds a wheel we can reuse that part

resolution:

  • need to know target pybi + platform, pass into db.metadata

ideally can also check for built wheels? which means “find all wheels from this sdist” operation, which is tricky if they have unique ids… could also just unconditionally cache metadata when creating, which seems easier

install:

  • given sdist + target info, want an id and maybe a wheel …yeah I think we always fetch id+wheel. if we already have the wheel unpacked in the forest, then fetching the wheel itself is a bit of overhead (substantial if it’s not cached anymore, minor but non-zero if it is). but realistically we can’t figure out which id is correct without consulting some version of the wheel cache, and if we really care about optimizing time-to-enter-cached-environment then we’ll want another layer of whole environment caching anyway.
  • in env.rs will want to tweak pick_pinned and WheelResolveMetadata::from. right now pick_pinned returns wheel (cool) and wheel_ai, which is needed for hash (okay) and also for attaching provenance to the wheel metadata for error reporting (ugh) maybe move provenance into the Wheel object itself?

thinking about more interpreter types

probably simpler to model universal2 pybis as a special type of artifact it’s 2 pybis, maybe even just generate 2 separate ArtifactInfos (or rename to ArtifactRef?) each with an extra hint for which platform it is and then during install create two separate sets of arch-specific trampolines, because they share a hash but the two ArtifactInfos report a different bin/ directory, for adding to PATH and setting POSY_PYTHON

@ dependencies: also a special kind of ArtifactInfo I guess? git urls: have a hash, act as an sdist binary urls: might have a #sha256=… style hash, might not. I guess might be able to use HTTP cache info to decide whether to re-fetch? idk. need to see what people want I guess.

adopting an external python: maybe compute a “hash” based on interpreter file fingerprint (last-changed, inode, etc.), and “unpack” it by creating a venv under that hash? https://apenwarr.ca/log/20181113 probably need to `import packaging` to generate metadata, so that needs an envforest… I guess can use the sdist one. PackageDB can import it as a usable python if configured to do so, cache its metadata

artifact types

ArtifactRef -> (package, version) or (url) or… ReleaseRef?

Artifact -> Wheel/Pybi/Sdist, each a wrapper around a Read+Seek (might be file, might be lazy remote file…) methods to fetch metadtaa, unpack?

PEP 643 (reliable metadata in sdists)

apparently this is a thing now! in an sdist, look in PKG-INFO in the root, and if Metadata-Version >= 2.2 and the fields we need are not listed in Dynamic: then we’re good.

…and actually, I feel like a good resolution algorithm might be, trust PKG-INFO for all sdists, and then do the expensive prepare_metadata_for_build_wheel thing for all the unreliable sdists and replan if any of them turn out to have been wrong?

(the idea is that in most cases, the PKG-INFO will be reliable, so 99% of the time we can avoid building wheels for packages unless we’re actually going to install them)

check if we’re using the same method of finding .dist-info as pip

https://github.com/pypa/pip/blob/bf91a079791f2daf4339115fb39ce7d7e33a9312/src/pip/_internal/utils/wheel.py#L84-L114

better version ranges

I’m thinking: for each package, split available versions into three “tiers”: tier 1 (preferred): any “hinted” versions (like previously pinned version) tier 2 (neutral): the regular non-yanked non-prerelease versions in order tier 3 (dispreferred): all prereleases (some question about whether to consider yanked here too; or that could be tier 4)

within each tier, intern to get a vector of ints version set is represented as 3 sets of ranges, one for each tier

when picking the next (package, version) to try, we always prefer candidates from a higher tier, so e.g. we never try any version from tier 3 until all packages have exhausted all their tier 1 and tier 2 options

…or, hmm. Does this actually work? when pubgrub asks us for the next (package, version) candidate assignment, then it restricts it to only a subset of packages. so we could be in a situation where the only valid candidates from those packages are pre-releases, b/c of the constraints set by the versions already chosen, even though there still exists some other resolution that doesn’t use the pre-releases.

What this might do though is give the equivalent to “we only return pre-releases if explicitly requested”? …ah, but no, if someone says `foo >= 10` and the only version >=10 is a pre-release, it could be selected. So this whole approach for pre-releases doesn’t work.

make resolution less wildly inefficient when choosing next candidate

[NOTE: there’s a PR for this: pubgrub-rs/pubgrub#104]

right now, every time pubgrub wants to consider a candidate, it gives us a set of ranges for all the packages under consideration, and then for every one, we do an O(n) loop through every package version, filtering out which ones fit into the range. This is at least accidentally quadratic, quite possibly worse. There’s gotta be a better . data structure here.

One idea: with pubgrub custom Range trait support, have the range objects themselves aware of the complete version set and track which packages fit, propagating this incrementally through range operations?

(Or just storing the candidate versions sorted could also help quite a lot, b/c could make counting ~O(number of spans in range * log(n)) and “find max in range” in even ~O(log(n))

COMPLICATION: @ dependencies. I think we … actually cannot support these within pubgrub’s model? They make the intern-all-versions thing tricky of course, because they’re new versions we can discover while we go but even without that, they’re… new versions we can discover as we go, which means that a set we previously told pubgrub was empty could suddenly become !empty, which could break the inferences it made from that, etc. Fortunately, @ dependencies are supposed to be forbidden inside packages so… say that @ dependencies are only supported at the top-level? must be specifically mentioned in pyproject.toml? prodigy-teams kind of case might want to also allow them in sibling projects within the same workspace and then we can process them up-front, and simply tell pubgrub that these are the only versions available of those packages, the end. (probably also want some kind of support for ‘override’ requirements there, which are regular dependencies that will usually be @ in practice, and that cause all other version constraints on that package to be totally ignored

optimizing network usage during resolution

when we pull up a list of requirements for some package, can immediately fetch the simple pages for all of them in parallel, and even the metadata for the most-preferred version ideally over HTTP/2 or HTTP/3, and/or in the background

We could even prime the pump by pre-fetching all the packages listed in the .lock or user-requirements