Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data scrubber #1

Open
joesondow opened this issue Mar 7, 2017 · 0 comments
Open

Data scrubber #1

joesondow opened this issue Mar 7, 2017 · 0 comments

Comments

@joesondow
Copy link
Owner

joesondow commented Mar 7, 2017

Json tweet data is unnecessarily bulky. Write a json preprocessor that you can run one time per Twitter data export, that converts big json data to smaller json data.

Exclude tweets that aren't useful to training markov chain: duplicates, retweets.
Exclude json data fields that you aren't using.
Better still, pre-process the whole thing into a json file that just contains the mapping of keys to arrays of values. Put the markov chain data in a file. Read it in verbatim at runtime. Why calculate the data set each time the program runs?

In theory this should mean less memory to load json data and less time to traverse it all. Significant savings? Not sure. Try it and see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant