Data scrubber #1

joesondow · 2017-03-07T03:39:16Z

Json tweet data is unnecessarily bulky. Write a json preprocessor that you can run one time per Twitter data export, that converts big json data to smaller json data.

Exclude tweets that aren't useful to training markov chain: duplicates, retweets.
Exclude json data fields that you aren't using.
Better still, pre-process the whole thing into a json file that just contains the mapping of keys to arrays of values. Put the markov chain data in a file. Read it in verbatim at runtime. Why calculate the data set each time the program runs?

In theory this should mean less memory to load json data and less time to traverse it all. Significant savings? Not sure. Try it and see.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data scrubber #1

Data scrubber #1

joesondow commented Mar 7, 2017 •

edited

Loading

Data scrubber #1

Data scrubber #1

Comments

joesondow commented Mar 7, 2017 • edited Loading

joesondow commented Mar 7, 2017 •

edited

Loading