-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add --unstable-feature=regex_link_parsing for a 12% resolve speedup #8488
Conversation
f2bf236
to
922ce86
Compare
I really don't think we should be adding regex-based link parsing to pip -- HTML is based on a context free grammar, and regex can only model/match regular languages (which is strictly a subset of CFGs). There's a lot of "don't use regex for processing [X]HTML" content on the internet (for good reason). Here's one that comes to mind easily: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ |
First off, should I have created an issue for this instead of a draft PR first? Second: Yes! Everything you said is totally correct! One thing that I find is often overlooked is that regular expressions + conditional logic and/or state can create a parser that can parse languages equivalent to CFGs and above. This is how emacs's The above bit about linguistic power with conditional logic is what wikipedia would call "original research", but I believe it is true. I am actually also working on publishing a novel parsing algorithm which is highly parallelizable and fully general in the background right now. The one concern I have for this being feasible long-term is that html parsers like html5lib will handle broken HTML, which would be hard to replicate, but:
If this was able to pass the same test suite as the html5lib parser, and there were many more tests added for things like broken HTML, could this feature ever be considered for merging (even under an unstable flag)? |
One concern I'd have is maintainability. If we were to adopt this code, it would in future be supported, and quite likely modified, by people who are not well-versed in the intricacies and risks of correctly parsing (possibly broken) HTML code. So there's a very real chance of bitrot. And describing this algorithm as "original research" doesn't do much to alleviate that concern 🙂 One of the maybe less obvious advantages of pip relying on external libraries (even if we have to vendor them) is that we delegate the expertise in the respective subject areas onto those downstream library maintainers. How viable would it be to split this code out into a reusable library? Maybe a dedicated PEP 503 parser? Then pip could relatively safely vendor that package without taking on the maintenance of that code. A 12% speed increase is not something we should ignore lightly, but speed is very definitely only one consideration out of many here. |
FWIW, there is a dedicated PEP 503 parser library -- https://github.com/brettcannon/mousebender. Adopting it is blocked on dropping Python 2 support. |
@pfmoore thank you for your thoughtful and detailed response! :D
Yep! I don't think this idea is really baked enough to work out right now. Thank you so much for talking it through with me!! ^_^ @pradyunsg: that dedicated parser library sounds like it would be a great place to focus my efforts in line with @pfmoore's comment. I do think a dedicated parser library is exactly where I would expect this kind of performance-sensitive code to be. |
TODO
Problem
In performing some profiling (with
py-spy
) for #8448 and #8480, I found that callingpip._internal.models.wheel.Wheel.supported()
andpip._internal.index.collector.parse_links()
(specificallyhtml5lib.parse()
) took up almost the entire time of finding candidates.I previously worked on this method in #7729, which was an optimization that
pex
version 1 had made before moving to invoking pip as a subprocess in pex versions >=2.Solution
--unstable-feature=regex_link_parsing
to use an experimental parser that uses regular expressions instead ofhtml5lib
.Result
A 12% decrease in the mean time to download
tensorflow==1.14.0
and all of its dependencies over 6 runs.