Ingest data from different sources and formats into a duckdb
database.
Create some methods to find duplicate data among records and do some quality checks
and corrections.
Do some analysis of the training data, about the fitness aspect of the data.
Do some analysis and aggregation on the location data, for presence statistics and GIS
applications.
This probably, will always be a work in progress.
- Include .csv files from smartphone logs
- Include .hrv files from Polar
- Include .json files from Garmin.
- Include .tcx files from Polar
- Include sqlite from Gadgetbridge
- Include sqlite from Amazfitbip
- Include sqlite from GarminDB
- Include data from Google location service.
- Include .fit files from Garmin.
- Include .gpx files from other sources.
- Include .json files from GoldenCheetah.
- Database maintenance.
- Check for duplicated records.
- Check variables names similarity.
- Create new vars automatically.
- Remove db data from deleted files.
- Remove db data from modified files.
- Deduplicate points.
- Remove errors in records.
- Combine columns/variables.
The main database collects all available data from the source files. The intent is to aggregate as much data as possible, then to analyze the raw data, in order to find source files that we can delete or exclude from the main database. Also, by reading all the files we can detect file and formatting problems. The source files have been produced by different devices and have been processed by different software. We want to collect all the information gathered over a period of more than 10 years, so we expect more than 100 variables/columns and more than 30M records/rows. The processing scheme we try to implement should work with simple hardware specifications (8GB RAM or even less).
With further analysis, we can merge some of the variables, and check the data quality.
After we are confident about the data quality and the info in them, we can use the data to create other datasets we need.
- Location History JSON Converter used to parse the huge json file to a simple csv.
- garmin-connect-export
- GarminDB
fit | gpx | json | Rds |
---|---|---|---|
1277 | 4474 | 2314 | 1 |
Table: File types
fit | gpx | gz | json | Rds | zip |
---|---|---|---|---|---|
82 | 4470 | 492 | 2314 | 1 | 707 |
Table: Files extensions
Total rows: 48629335
Total files: 8066
Total days: 4018
Total vars: 147
DB Size: 2.4 GiB
Source Size: 6.5 GiB