Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Issue #840 - URL pattern too broad #879

Merged
merged 3 commits into from
Mar 9, 2017
Merged

Fix for Issue #840 - URL pattern too broad #879

merged 3 commits into from
Mar 9, 2017

Conversation

rappdw
Copy link
Contributor

@rappdw rappdw commented Mar 9, 2017

Changed _URL_PATTERN in language/tokenizer_exceptions.py, using url regex validation pattern from: https://mathiasbynens.be/demo/url-regex

Description

This change will screen out a number of patterns that are not URLs as well as picking up some patterns that were missed.
Test cases were added to test_urls to expose the issue and validate the fix.

Types of changes

  • [ x] Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (addition to documentation of spaCy)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • [x ] I have added tests to cover my changes.
  • All new and existing tests passed.

@honnibal
Copy link
Member

honnibal commented Mar 9, 2017

🎉
Awesome before and after here --- thanks for taking the time to lay out the regex nicely (not always easy!).

@honnibal honnibal merged commit dd13aac into explosion:master Mar 9, 2017
@rappdw rappdw deleted the rappdw/tokenizer_exceptions_url_fix branch March 9, 2017 19:44
@ines ines added lang / all Global language data performance labels Sep 26, 2017
@antonpibm antonpibm mentioned this pull request Jul 12, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang / all Global language data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants