Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace lxml with Python's built in html.parser #301

Merged
merged 4 commits into from
Jun 20, 2018

Conversation

orenyomtov
Copy link
Contributor

After my server has been running for some time, fbchat starts throwing the following exception:

Failed loading session
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/fbchat/client.py", line 366, in setSession
    self._postLogin()
  File "/usr/local/lib/python2.7/dist-packages/fbchat/client.py", line 212, in _postLogin
    soup = bs(r.text, "lxml")
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 215, in __init__
    self._feed()
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 239, in _feed
    self.builder.feed(self.markup)
  File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 240, in feed
    self.parser.feed(markup)
  File "src/lxml/parser.pxi", line 1194, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:119773)
  File "src/lxml/parser.pxi", line 1237, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:118685)
  File "src/lxml/parser.pxi", line 827, in lxml.etree._BaseParser._getPushParserContext (src/lxml/lxml.etree.c:113960)
  File "src/lxml/parser.pxi", line 844, in lxml.etree._BaseParser._createContext (src/lxml/lxml.etree.c:114186)
  File "src/lxml/parsertarget.pxi", line 110, in lxml.etree._TargetParserContext._setTarget (src/lxml/
  File "src/lxml/parsertarget.pxi", line 41, in lxml.etree._PythonSaxParserTarget.__cinit__ (src/lxml/
  File "/usr/lib/python2.7/inspect.py", line 814, in getargspec
    func = func.im_func
RuntimeError: restricted attribute

Restarting the server fixes the problem temporarily, but it returns after a while.

Researching the issue brought up this problem being reported several times and considered "The unpredictable BS behavior again."
image

The only remedy being offered is replacing lxml.

This pull request replaces lxml with Python's built in html.parser.

@madsmtm
Copy link
Member

madsmtm commented Jun 18, 2018

The main reason for using lxml is the speed benefit, but I went ahead and ran a few tests against data from the three Facebook urls we're be parsing with BeautifulSoup:

from requests import get
from timeit import timeit
from bs4 import BeautifulSoup as bs
n = 2000

d = get("https://m.facebook.com/").text
timeit('bs(d, "html.parser")', number=n, globals=globals())/n
# -> 0.0039508302819886015
timeit('bs(d, "lxml")', number=n, globals=globals())/n
# -> 0.0026858313914999597

d = get("https://www.facebook.com/").text
timeit('bs(d, "html.parser")', number=n, globals=globals())/n
# -> 0.037392912552997586
timeit('bs(d, "lxml")', number=n, globals=globals())/n
# -> 0.03604348213601043

d = get("https://m.facebook.com/login.php?login_attempt=1").text
timeit('bs(d, "html.parser")', number=n, globals=globals())/n
# -> 0.0038544781890086596
timeit('bs(d, "lxml")', number=n, globals=globals())/n
# -> 0.002652162682003109

The speed benefit is mostly negligible (compared to IO), and non-existent in the most time consuming case, so I'd argue this was a classic example of premature optimisation.
I've also seen a lot of people having trouble installing lxml in the past, see #101, #119, #113, #230 and #272.
I'll test it with TravisCI when I get it up and running again, and I'll make sure to include it in the next release. Thanks ;)

@madsmtm madsmtm added the bug label Jun 18, 2018
@madsmtm madsmtm merged commit eaaa526 into fbchat-dev:master Jun 20, 2018
madsmtm added a commit that referenced this pull request Jun 20, 2018
* Added `removeFriend` method, #298
* Removed `lxml` from dependencies, #301
* Moved configuration to setup.cfg instead of setup.py
@chelsybradley15 chelsybradley15 linked an issue Mar 25, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Screenshot (Mar 25, 2023 6:10:08 AM)
2 participants