UofG grad course CIS-6190 Assignment 1
Test Environment: Python 3.7.2 linux.socs.uoguelph.ca
python split.py --Input YourInputFile --Output YourOutputFile
The input is set to samples.txt, and the output is set to sample.splitted in default.
The goal is to split the sentences into sequence of words based on three common end-of-sentence marks: period("."), question("?") and exclamation("!"). The program should also avoid to split the labels which starts with a "$".
python tokenization.py --Input YourInputFile --Output YourOutputFile
The input is set to samples.splitted, and the output is set to sample.tokenized in default.
The goal is to perform tokenization by scanner-generation tool Ply.lex. There are total seven categories: LABEL, WORD, NUMBER, APOSTROPHIZED, HYPHENATED, DELIMITER, and PUNCTUATION. After the tokenization, a post-processing would be applied for HYPHENATED, APOSTROPHIZED and WORD type to split certain type into sequence of tokens.
The test cases are added at the top of sample.txt file. Such as "this-is-just-a-test's" or "-123abc". Please check the file for full details.
python pos-tag.py --Input YourInputFile --Output YourOutputFile
The input is set to samples.tokenized, and the output is set to sample.tagged in default.
The goal of this program is to apply POS-tagger in NLTK to tag the words with their POS tags. All the output tokens are paired up with their POS tags
python data-analysis.py --Input YourInputFile --Output YourOutputFile
The input is set to samples.tagged, and the output is set to sample.analysis in default.
The goal of this program is to analysis the tagged file and generate a summary file. The file indicates: the total amount of documents, the min, avg and max document lengths by the number os sentences/tokens, the average sentence lengths by the number of tokens for the whole data collection as well as all individual documents, etc.