CIS-6190-Assignment-1

UofG grad course CIS-6190 Assignment 1

Test Environment: Python 3.7.2 linux.socs.uoguelph.ca

Sentence Splitting
- Usage
python split.py --Input YourInputFile --Output YourOutputFile

The input is set to samples.txt, and the output is set to sample.splitted in default.
- Objectives
The goal is to split the sentences into sequence of words based on three common end-of-sentence marks: period("."), question("?") and exclamation("!"). The program should also avoid to split the labels which starts with a "$".
2. Tokenization
- Usage
python tokenization.py --Input YourInputFile --Output YourOutputFile

The input is set to samples.splitted, and the output is set to sample.tokenized in default.
- Objectives
The goal is to perform tokenization by scanner-generation tool Ply.lex. There are total seven categories: LABEL, WORD, NUMBER, APOSTROPHIZED, HYPHENATED, DELIMITER, and PUNCTUATION. After the tokenization, a post-processing would be applied for HYPHENATED, APOSTROPHIZED and WORD type to split certain type into sequence of tokens.
- Test Plan
The test cases are added at the top of sample.txt file. Such as "this-is-just-a-test's" or "-123abc". Please check the file for full details.
3. POS-Tagging
- Usage
python pos-tag.py --Input YourInputFile --Output YourOutputFile

The input is set to samples.tokenized, and the output is set to sample.tagged in default.
- Objectives
The goal of this program is to apply POS-tagger in NLTK to tag the words with their POS tags. All the output tokens are paired up with their POS tags
4. Data Analysis
- Usage
python data-analysis.py --Input YourInputFile --Output YourOutputFile

The input is set to samples.tagged, and the output is set to sample.analysis in default.
- Objectives
The goal of this program is to analysis the tagged file and generate a summary file. The file indicates: the total amount of documents, the min, avg and max document lengths by the number os sentences/tokens, the average sentence lengths by the number of tokens for the whole data collection as well as all individual documents, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CIS-6190-Assignment-1

Sentence Splitting

Usage

Objectives

2. Tokenization

Usage

Objectives

Test Plan

3. POS-Tagging

Usage

Objectives

4. Data Analysis

Usage

Objectives

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
data-analysis.py		data-analysis.py
pos-tag.py		pos-tag.py
samples.analysis		samples.analysis
samples.splitted		samples.splitted
samples.tagged		samples.tagged
samples.tokenized		samples.tokenized
samples.txt		samples.txt
split.py		split.py
tokenization.py		tokenization.py

ZhentaoHuang/CIS-6190-Assignment-1

Folders and files

Latest commit

History

Repository files navigation

CIS-6190-Assignment-1

Sentence Splitting

Usage

Objectives

2. Tokenization

Usage

Objectives

Test Plan

3. POS-Tagging

Usage

Objectives

4. Data Analysis

Usage

Objectives

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages