Quality4Dataset

This is a project evaluating quality for dataset.

Author: Hao-Ying Cheng, Nickname: MaskerTim

Data Format of Quality

This will output the two dataset, quality_issues.json and quality_score.json. This section tells you about the data format for both.

Quality Issues

Quality isssues tell that what quality issues are concerned. There are five issues classified by three classes.

Three classes:

Column Issue
- issue: the name of issue
- metric: prevalence
- count: tell what amounts of data for this issue in column.
Row Issue
- issue: the name of issue
- metric: confidence
- count: tell what amounts of data for this issue in row.
Cell Issue
- issue: the name of issue
- metric: confidence
- count: tell what amounts of cells for this issue in dataset.

Issues means that it is expected or not in dataset. Column, Row, Cell mean the type that occurs in dataset. prevalence means what portion of datas that occur the issue in dataset. conference means what belief to the extent about the issue in dataset .

Five Issues:

Uniqueness Violation:
- type: colume issue
- description: not expected duplicated value in column.
distinct format
- type: colume issue
- description: not expected distinct data format in column.
missing value
- type: cell issue
- description: not expected missing value in cell.
outlier
- type: cell issue
- description: not expected outlier value in cell.
duplicated row
- type: row issue
- description: not expected duplicated data in row.

Quality Scores

Quality scores tell that what quality scores per column. We have four keys for this dataset.

Keys:

id: identity for column
colName: what column name is
issues: what issues the column owns
score: what score the column gains (unit:%)

Reference

How to quantify Data Quality?

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
output		output
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
breast-cancer-wisconsin.data		breast-cancer-wisconsin.data
breast-cancer-wisconsin.names		breast-cancer-wisconsin.names
london-air-quality-master.zip		london-air-quality-master.zip
monthly-averages-messy.csv		monthly-averages-messy.csv
monthly-averages.csv		monthly-averages.csv
time-of-day-per-month.csv		time-of-day-per-month.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality4Dataset

Author: Hao-Ying Cheng, Nickname: MaskerTim

Data Format of Quality

Quality Issues

Three classes:

Five Issues:

Quality Scores

Keys:

Reference

About

Releases

Packages

Languages

License

Maskertim-School/DS2020_NTUT

Folders and files

Latest commit

History

Repository files navigation

Quality4Dataset

Author: Hao-Ying Cheng, Nickname: MaskerTim

Data Format of Quality

Quality Issues

Three classes:

Five Issues:

Quality Scores

Keys:

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages