Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admin::render_readmes: Support out of order tar files #3971

Closed
wants to merge 4 commits into from

Conversation

nipunn1313
Copy link
Contributor

Depends on #3970

(Because of community/community#4477 - github ends up rendering both diffs together. Go to the commits tab and just look at the most recent commit to review).

Achieves this by pulling the entire tar file into
memory in the form of a hashmap. Then we can look for
the files as needed.

Before, we would stream the pkg response from the server
and un-tar it on the fly, but this behavior is incorrect
if the Cargo.toml isn't at the beginning of the archive.

Could imagine this might be problematic if the entire
archive won't fit in memory - but that seems unlikely?
We could mitigate by limiting to certain file types, but
we won't know what type the README will be.

Added a test to for this case - the test fails
before this change and passes with this change
@Turbo87
Copy link
Member

Turbo87 commented Oct 5, 2021

hmm, I do admit that I'm a bit worried about pulling all of the contents into memory. we do have a few crates that are several dozens of megabytes gzipped, so uncompressing all of their contents and keeping them in memory might be problematic.

@nipunn1313
Copy link
Contributor Author

One saving grace is that it's only pulling them in from memory one at a time. Also it's only doing it on the machines running the admin scripts. Several dozens of megabytes in memory seems fine?

Mitigating ideas include

  • Streaming the entire crate off of crates.io multiple times to get the various files (seems inelegant). One pass to get the Cargo.toml and another pass to get the README.
  • heuristically stripping out parts that we're definitely not going to read. Since we're looking for README files, we can filter by extension to eliminate large files that aren't going to be read.

If I were deploying this, I'd probably try to avoid adding such heuristics/complexity until they are proven necessary. I've often found that unnecessary complexity like that can be a cure that's worse than the disease.

Since it's an admin functionality, we could try it out and fix based on the issues that come up. I don't have access to run the admin functionality (of course) - but if you assigned me a task with the failing crates, I could look at them and add heuristics around the specific issues that come up. My suspicion is that nothing would come up - as it seems unlikely that we'd have gigabyte sized crates (is there a way to check easily ahead of time?)

@nipunn1313
Copy link
Contributor Author

Closing based on discussion - not planning to maintain admin::render_readmes in this state.

Ideally admin::render_readmes is not downloading anything from crates-io.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants