[WIP] A small pipeline tweak: tokenization (x-ray) #174

lopuhin · 2017-04-27T20:03:42Z

"x-ray" was tokenized as "ray": fix that by changing default tokenizer.

Fixes #172. I also tried other changes mentioned in the issue:

made tokenizer parse "isn't" as a single token (allowing ' inside words), but "ins't" still isn't present in default scikit-learn stop-word list
tried bigrams - they made accuracy a bit worse and did not result in any noticeable changes in the example we are analyzing. And they also are slower to run, so I left them as an exercise to the reader :)

For some reason some other things changed a bit in the notebook - probably due to newer version. I only added blocks 17 and 18 (with text before and after them), and re-run the first half of the notebook. I didn't update the tutorial yet - if you like the changes, I'll update them and will re-run the whole notebook.

Here is the link to the notebook: https://github.com/TeamHG-Memex/eli5/blob/text-tutorial-x-ray/notebooks/Debugging%20scikit-learn%20text%20classification%20pipeline.ipynb

"x-ray" was tokenized as "ray": fix that by changing default tokenizer.

codecov-io · 2017-04-27T20:14:32Z

Codecov Report

Merging #174 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #174   +/-   ##
=======================================
  Coverage   97.25%   97.25%           
=======================================
  Files          39       39           
  Lines        2405     2405           
  Branches      452      452           
=======================================
  Hits         2339     2339           
  Misses         34       34           
  Partials       32       32

A small pipeline tweak: tokenization (x-ray)

809cc43

"x-ray" was tokenized as "ray": fix that by changing default tokenizer.

lopuhin requested a review from kmike April 27, 2017 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] A small pipeline tweak: tokenization (x-ray) #174

[WIP] A small pipeline tweak: tokenization (x-ray) #174

lopuhin commented Apr 27, 2017

codecov-io commented Apr 27, 2017 •

edited

Loading

[WIP] A small pipeline tweak: tokenization (x-ray) #174

Are you sure you want to change the base?

[WIP] A small pipeline tweak: tokenization (x-ray) #174

Conversation

lopuhin commented Apr 27, 2017

codecov-io commented Apr 27, 2017 • edited Loading

Codecov Report

codecov-io commented Apr 27, 2017 •

edited

Loading