Skip to content

Commit

Permalink
Merge #356
Browse files Browse the repository at this point in the history
356: Try to fix Scrapy ERROR: Spider error processing r=alallema a=alallema

Currently when running doc-scraper this error occur:
```sh
2023-03-14 12:10:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://docs.meilisearch.com/learn/advanced/geosearch.html> (referer: https://docs.meilisearch.com/sitemap.xml)
Traceback (most recent call last):
  File "/Users/amelielallemand/.local/share/virtualenvs/docs-scraper-vWaWSN46/lib/python3.9/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/documentation_spider.py", line 170, in parse_from_sitemap
    self.add_records(response, from_sitemap=True)
  File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/documentation_spider.py", line 151, in add_records
    records = self.strategy.get_records_from_response(response)
  File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/strategies/default_strategy.py", line 44, in get_records_from_response
    records = self.get_records_from_dom(response.url)
  File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/strategies/default_strategy.py", line 67, in get_records_from_dom
    sys.exit('DefaultStrategy.dom is not defined')
SystemExit: DefaultStrategy.dom is not defined
```
This PR try to fix it

Co-authored-by: alallema <amelie@meilisearch.com>
  • Loading branch information
meili-bors[bot] and alallema authored Mar 14, 2023
2 parents d36dd0e + 4c9514d commit 4df835a
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions scraper/src/strategies/abstract_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,8 @@ def get_dom(response):
try:
body = response.body.decode(response.encoding)
result = lxml.html.fromstring(body)
except (UnicodeError, ValueError):
except (UnicodeError, ValueError, lxml.etree.ParserError):
result = lxml.html.fromstring(response.body)

return result

def get_strip_chars(self, level, selectors):
Expand Down

0 comments on commit 4df835a

Please sign in to comment.