[WIP] A small pipeline tweak: tokenization (x-ray) #174
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
"x-ray" was tokenized as "ray": fix that by changing default tokenizer.
Fixes #172. I also tried other changes mentioned in the issue:
'
inside words), but "ins't" still isn't present in default scikit-learn stop-word listFor some reason some other things changed a bit in the notebook - probably due to newer version. I only added blocks 17 and 18 (with text before and after them), and re-run the first half of the notebook. I didn't update the tutorial yet - if you like the changes, I'll update them and will re-run the whole notebook.
Here is the link to the notebook: https://github.com/TeamHG-Memex/eli5/blob/text-tutorial-x-ray/notebooks/Debugging%20scikit-learn%20text%20classification%20pipeline.ipynb