-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_parser.py fails most tests when running with vendored feedparser #350
Comments
I managed to hunt down the root cause of the parser failures. It's caused by this part of the vendored feedparser code (which is different from the latest official feedparser release): reader/src/reader/_vendor/feedparser/api.py Lines 320 to 332 in 63e8ec3
Apparently, There is a secondary problem in that the error is not detected because the code here tries to catch an I don't know why the --- a/src/reader/_vendor/feedparser/api.py
+++ b/src/reader/_vendor/feedparser/api.py
@@ -316,13 +316,7 @@ def _parse_file_inplace(
saxparser.setContentHandler(feed_parser)
saxparser.setErrorHandler(feed_parser) # type: ignore[arg-type]
source = xml.sax.xmlreader.InputSource()
-
- # If an encoding was detected, decode the file on the fly;
- # otherwise, pass it as-is and let the SAX parser deal with it.
- try:
- source.setCharacterStream(stream_factory.get_text_file())
- except MissingEncoding:
- source.setByteStream(stream_factory.get_binary_file())
+ source.setByteStream(stream_factory.get_binary_file())
try:
saxparser.parse(source) So I think there are two action items here:
|
Great debugging work! Funnily enough, the api.py code you link above was last touched by myself in kurtmckee/feedparser#302; that PR was merged into feedparser/develop some time ago, so if it is that code that has the problem, we'll have to fix it upstream. I should note that I am (obviously) not seeing the Re. error handling in feedparser.parse():
Specifically, I think the following sequence happens:
I will assume the same kind of shadowing happens before my kurtmckee/feedparser#302 PR, but it is never apparent because only the setByteStream branch exists (setCharacterStream is never used): Re. why setCharacterStream() doesn't work as intended: I will try to trace that issue more carefully later (ran out of time for today), but one possible root cause is that your Python ends up with different Meanwhile, can you please help me confirm the text file is indeed returning strings by adding the following to reader._vendor.feedparser.api line 321, and running one of the failing tests with file = stream_factory.get_text_file()
print('one:', repr(file.read()[:40]))
print('two:', repr(file.read())
raise Exception('failing intentionally') (If both of those are strings, it may be reasonable to assume your libxml2 only supports bytes / binary files.) |
I think you're right in your analysis except for this:
This doesn't happen, because if no exception is raised, then That snippet prints:
which works as expected because the first call consumes all the input. But something weird happens if I change the code to:
Then it prints:
(I truncated the second line.) Note that the first call returns 40 characters as expected, but the second read call returns the entire file including the first 40 bytes. That's not how read() is supposed to behave; the final read() call should return just the remaining text. I created a pull request to fix this upstream: kurtmckee/feedparser#469, but I don't think it's the root cause of the problem here. |
So I hunted the problem down to libxml2. It looks like it doesn't support text input at all. Here's a test case that you can use to reproduce the "result is not a String" error: https://gist.github.com/maksverver/7ec9221f163070cbd98f4f38a3932036 @lemon24 can you check which XML parser is being used on your system? Is it libxml2 or something else? If libxml2, which version? (I'm using version 2.13.3.) |
I tracked down the issue to libxml2 and filed an upstream issue, which was promptly fixed: https://gitlab.gnome.org/GNOME/libxml2/-/issues/790 The summary is that the current version of feedparser (which calls So I think the issue is resolved when a new libxml2 release contains the fix. That does leave the other issue I mentioned: fix the error handling around the subsequent parse() call, since the current |
Previously these errors were ignored, since an exception is raised only on fatal errors. With this change, when a non-fatal error occurs, the bozo bit is still set, but the feed is not reparsed with the loose parser. Background and discussion here: lemon24/reader#350
Previously these errors were ignored, since an exception is raised only on fatal errors. With this change, when a non-fatal error occurs, the bozo bit is still set, but the feed is not reparsed with the loose parser. Background and discussion here: lemon24/reader#350
Previously these errors were ignored, since an exception is raised only on fatal errors. With this change, when a non-fatal error occurs, the bozo bit is still set, but the feed is not reparsed with the loose parser. Background and discussion here: lemon24/reader#350
Previously these errors were ignored, since an exception is raised only on fatal errors. With this change, when a non-fatal error occurs, the bozo bit is still set, but the feed is not reparsed with the loose parser. Background and discussion here: lemon24/reader#350
Indeed, xml.sax.expatreader.ExpatParser, on both my old-ish Mac and on an Ubuntu 22.04. |
@maksverver thank you for following up in many different places! I will try to review the remaining feedparser PR by the end of the weekend to help out. Once merged into develop, I can update the vendored version too. |
@maksverver, are you OK until then? I was going to suggest monkey-patching feedparser.api.PREFERRED_XML_PARSERS, but that's made a bit more complicated by me vendoring it. I am reluctant to monkeypatch feedparser itself in reader, since other things may be using it, but there should be no issue with monkeypatching the vendored version. (Users can make this choice, since in theory they have more visibility into the deployment environment.) Possible mitigations on reader side (easiest first):
|
I'm good for now. I plan to wait for the libxml2 patch to be released before I update the AUR package to use the vendored version of feedparser. (If there are any bugs in the 6.0.11 release I haven't run into them yet.) It would be cool if at some point the feedparser fixes are released so you don't need to bundle the development version with reader. I dislike having different versions of the same library on my system; it often leads to difficult-to-debug problems. I do find it strange that feedparser seems to prefers libxml2 via PREFERRED_XML_PARSERS, yet all the tests run against Expat only. This seems like a recipe for trouble, if there are bugs on systems with libxml2 installed, they won't be detected by the tests. My preference would be to run the tests against both expat and libxml2 to make sure both work as intended, but I don't know how easy that would be to setup. |
libxml2 2.13.4 was released two weeks ago, so this is no longer an issue :) |
Don't disagree, but it is what it is :) Reopening this so I can add an environment variable to allow using "system" (non-vendored) feedparser; having no escape hatch for dealing with issues like this one sits a bit wrong with me. (In theory I could do away with the vendored one entirely and recommend users do |
Too many failures to list them all, but here is a representative sample:
It looks like no XML feeds (neither RSS nor Atom) are recognized. Deleting the src/reader/_vendor/ subdirectory (and applying this patch to update the imports) fixes the issue.
I'm running on Arch Linux, using system packages. Installed Python package versions:
The text was updated successfully, but these errors were encountered: