Investigated different NLP tools. The plan is to find a good Java one (OpenNLP sounds promising) and build my classes in Java.
Researched Project Gutenberg. Tried to download via mirror; this proved difficult/wget did not work out. Project Gutenberg very deliberately makes it difficult to download things.
Looked around GitHub to see what other Java libraries look like. Many have src, scripts, gradle folder/extensions? But lots of different structures. I'll also have a resources folder to store books in, and scripts for downloading books.
Initial commit! Plus, I continue to work on understanding the wget flags/what mirrors of wget look like.
Spent lots of time getting familiar with Gradle and Maven (which a lot of Java libraries I've looked at appear to have). Lots of stuff to learn about creating build.gradle, etc. Still not sure I know entirely what's happening, but I have a very minimal build.gradle.
Maven install errors have blocked me from installing OpenNLP, which I need for my project. Started work on the skeleton AverageWordLength class; however, because I can't install OpenNLP I have to figure out how to proceed.
Back to figuring out the Gutenberg side of things: Tried to use rsync. Wiped computer :(
Back on track. Began researching BeautifulSoup; don't dare venture down the rsync path again.
Planned out structure of classes. Think I'll focus on 1) word length and 2) part of speech tagging. The plan is to create a table for average word length for top 100 books, another for distribution of parts of speech, then use those to find the closest match in the given Gutenberg file.
Spent the weekend reading through BeautifulSoup documentation, working on scripts. Think it'll be easier to just focus on the top 100 books anyways, so worked on scraping that page.
Switched to Python. Much easier. Rather than all of that gradle business, I just need requirements.txt which is a blessing. Also, finished BeautifulSoup! It really is a fantastic tool.
Successfully created script that can download all 100 books, name them by their book names. Also created an individualized version; this facilitates the download of a single text. Think what I'll do is have that be an "update" type of function, which writes whatever the chosen book is under a specific name, so the other functions can use that file without needing the URL every time they're called.
Completed AverageWordLength.py. Think I might rework so that there's a paramaterized function and a non-paramaterized. We'll see how it's coming together Fri/weekend.
In the interest of preparing for next week, added CONTRIBUTING and CONDUCT. Plan to use the weekend to add documentation and finish up my POS tagging.
Partner work days!
Reworked the filenaming system, as special characters kept raising issues as I parsed the top 100 books. Still ironing out wrinkles in some files, almost have a complete and accurate dictionary of word lengths--should accomplish this tonight. Hopefully we can reuse lots of this code for top sentences.
Got AverageWordLength and AverageSentLength working; both now return the book with the closest lengths to the given Project Gutenberg txt URL. Will likely work on POS tagging now, along with contributing more to my partner's project, and expanding to other text files (not just Gutenberg URLs).
FreqPOS is now up and running! It creates a CSV containing each book's parts of speech and their quantities.