-
Notifications
You must be signed in to change notification settings - Fork 3
Data organization
François Briatte edited this page Aug 17, 2015
·
3 revisions
Checking whether the bills and sponsors intermediate datasets conform to Karl Broman's advice on organising spreadsheet data.
Boxes are checked when the advice is fully applied. A few items are unchecked because each country uses its own names for bill and sponsor variables in the CSV datasets. Network attributes, however, are fully standardised.
-
Consistency:
- Use consistent codes for categorical variables
- Use a single fixed code for any missing values
- Use consistent variable names
- Use consistent subject IDs
- Use a common data layout in multiple files
- Use consistent file names
- Use a single common format for all dates
- Use consistent phrases in your notes
- Be careful about extra spaces within cells
- Write dates as YYYY-MM-DD
- No empty cells
- Put just one thing in a cell
- Make it a rectangle
- Create a data dictionary
- No calculations in the raw data files
- Don't use font color or highlighting as data
- Choose good names for things
- Make backups
- Use data validation to avoid errors
- Save the data in plain text files
- Other things to avoid
- Beware of long integers turning to scientific notation
- Avoid screen splits
- Fill in all zeroes