-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle pycld2 errors #1
Comments
Agreed, error handling is pretty pitiful at the moment. Do you happen to have a string that throws the error you see there? |
I changed your code to print the document on error (and to then ignore the error, classifying the sentence as "error"). But I got the following document, which parses fine when checking manually with cld.
This is the source of the document For some reason it includes the Unicode Character 'DELETE' (U+007F) at position 977. |
The exact parsing problem should probably be fixed in the pycld2 library. Nevertheless I sent you a Pull Request so that spacy-cld won't crash the whole pipeline due to issues in pycld2. |
Thanks @nickdavidhaynes for writing this wrapper.
I believe it should handle errors thrown by pycld2. Currently the whole spacy pipeline may crash, if pycld2 throws an error such as
pycld2.error: input contains invalid UTF-8 around byte 977 (of 4604)
. Instead it may be better to return an empty language tuple (as is done if the language is unknown). What do you think?The text was updated successfully, but these errors were encountered: