This repository has been archived by the owner on Dec 10, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Showing
10 changed files
with
160 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
name: Unit and integration tests | ||
name: tests | ||
on: | ||
workflow_dispatch: | ||
push: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Getting started | ||
|
||
## Installation | ||
|
||
AutoParser is a Python package that can either be built into your code or run as a | ||
command-line interface (CLI). You can install AutoParser using pip: | ||
|
||
```bash | ||
python3 -m pip install git+https://github.com/globaldothealth/autoparser | ||
``` | ||
|
||
Note that it is usually recommended to install into a virtual environment. We recommend using [uv](https://github.com/astral-sh/uv) to manage the virtual environment. To create and active a virtual environment for AutoParser using `uv` run the following commands: | ||
|
||
```bash | ||
uv sync | ||
. .venv/bin/activate | ||
``` | ||
|
||
To view and use the CLI, you can type `autoparser` into the command line to view the options available. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,27 @@ | ||
# autoparser | ||
Autparser is a tool for semi-automated data parser creation. | ||
# AutoParser | ||
AutoParser is a tool for semi-automated data parser creation. The package allows you | ||
to generate a new data parser for converting your source data into a new format specified | ||
using a schema file, ready to use with the data transformation tool [adtl](https://adtl.readthedocs.io/en/latest/index.html). | ||
|
||
## Key Features | ||
- Data Dictionary Creation: Automatically create a basic data dictionary framework | ||
- Parser Generation: Generate data parsers to match a given schema | ||
|
||
## Framework | ||
|
||
```{figure} images/flowchart.png | ||
Flowchart showing the inputs (bright blue), outputs (green blocks) and functions | ||
(dashed diamonds) of AutoParser. | ||
``` | ||
|
||
## Documentation | ||
```{toctree} | ||
--- | ||
maxdepth: 2 | ||
caption: Contents: | ||
--- | ||
self | ||
getting_started/index | ||
usage/data_dict | ||
usage/parser_generation | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
sphinx==8.0.2 | ||
myst_parser==4.0.0 | ||
sphinx-book-theme==1.1.3 | ||
sphinxcontrib-mermaid==0.9.2 | ||
sphinxcontrib-mermaid==0.9.2 | ||
myst_nb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Creating a Data Dictionary | ||
|
||
## Motivation | ||
|
||
A data dictionary is a structured guide which contains the details of a data file. | ||
It should contain, at minimum, a list of field/column names, and some kind of description | ||
of what data each field holds. This often takes the form of a textual description, plus | ||
a note of the data type (text, decimals, date, boolean...) and/or a set of expected values. | ||
|
||
A data dictionary is required by AutoParser for (parser generation)[parser_generation]. | ||
This is to avoid having to send potentially sensitive or confidential data to an external | ||
body (in this case an externally hosted LLM hosted); instead a *decription* of what the | ||
data looks like from the dictionary can be sent to the LLM, which allows for mapping to | ||
occur without risking the unintentional release of data. | ||
|
||
Many data capture services such as (REDCaP)[https://projectredcap.org/] will generate | ||
a data dictionary automatically when surveys are set up. However, where data is being | ||
captured either rapidly, or by individuals/small teams, a formal data dictionary may not | ||
have been created for a corresponding dataset. For this scenario, AutoParser provides | ||
functionality to generate a simple dictionary based on your data. This dictionary can | ||
then be used in other AutoParser modules. | ||
|
||
## Create a basic data dictionary | ||
AutoParser will take your raw data file and create a basic data dictionary. For an example | ||
dataset of animals, a generated data dictionary looks like this: | ||
|
||
| source_field | source_description | source_type | common_values | | ||
|-------------------|--------------------|-------------|----------------------------------------------------------| | ||
| Identité | | string | | | ||
| Province | | choice | Equateur, Orientale, Katanga, Kinshasa | | ||
| DateNotification | | string | | | ||
| Classicfication | | choice | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | | ||
| Nom complet | | string | | | ||
| Date de naissance | | string | | | ||
| AgeAns | | number | | | ||
| AgeMois | | number | | | ||
| Sexe | | choice | F, M, f, m, f, m , inconnu | | ||
|
||
`source_field` contains each column header from the source data, and `source_type` shows the | ||
data type in each column. 'choice' denotes where a small set of strings have been detected, | ||
so AutoParser assumes that specified terms are being used, and lists them in `common values`. | ||
|
||
Notice that the `source_description` column is empty. This is done by default, so the | ||
user can add in a short text description *in English* (as this column is read by the LLM | ||
in later steps and assumes the text is written in English). For example, the description | ||
for the `AgeMois` column might be 'Age in Months'. | ||
|
||
If instead you would like to auto-generate these descriptions, AutoParser can use an LLM | ||
to automate this step. Note, we strongly encourage all users to check the results of the | ||
auto-generated descriptions for accuracy before proceeding to use the described data dictionary | ||
to generate a data parser. | ||
|
||
## API | ||
|
||
```{eval-rst} | ||
.. autofunction:: autoparser.create_dict | ||
:noindex: | ||
.. autofunction:: autoparser.generate_descriptions | ||
:noindex: | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Write a Data Parser | ||
|
||
AutoParser assumes the use of Global.Health's (adtl)[https://github.com/globaldothealth/adtl] | ||
package to transform your source data into a standardised format. To do this, adtl requires a | ||
(TOML)[https://toml.io/en/] specification file which describes how raw data should be | ||
converted into the new format, on a field-by-field basis. Every unique data file format | ||
(i.e. unique sets of fields and data types) should have a corresponding parser file. | ||
|
||
AutoParser exists to semi-automate the process of writing new parser files. This requires | ||
a data dictionary (which can be created if it does not already exist, see [data_dict]), | ||
and the JSON schema of the target format. | ||
|
||
Parser generation is a 2-step process. | ||
|
||
## Generate intermedaite mappings (CSV) | ||
First, an intermediate mapping file is created which can look like this: | ||
|
||
| target_field | source_description | source_field | common_values | target_values | value_mapping | | ||
|-------------------|--------------------|------------------|----------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------| | ||
| identity | Identity | Identité | | | | | ||
| name | Full Name | Nom complet | | | | | ||
| loc_admin_1 | Province | Province | Equateur, Orientale, Katanga, Kinshasa | | | | ||
| country_iso3 | | | | | | | ||
| notification_date | Notification Date | DateNotification | | | | | ||
| classification | Classification | Classicfication | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | mammal, bird, reptile, amphibian, fish, invertebrate, None | mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish | | ||
| case_status | Case Status | StatusCas | Vivant, Décédé | alive, dead, unknown, None | décédé=dead, vivant=alive | | ||
|
||
`target_x` refers to the desired output format, while `source_x` refers to the raw data. | ||
In this example, the final row shows that the `case_status` field in the desired output | ||
format should be filled using data from the `StatusCas` field in the raw data. The `value_mapping` | ||
column indicated that all instances of `décédé` in the raw data should be mapped to `dead` | ||
in the converted file, and `vivant` should map to `alive`. | ||
|
||
These intermediate mappings should be manually curated, as they are generated using an | ||
LLM which may be prone to errors and hallucinations, generating incorrect matches for either | ||
the field, or the values within that field. | ||
|
||
## Generate TOML | ||
|
||
This step is automated and should produce a TOML file that conforms to the adtl parser | ||
schema, ready for use transforming data. | ||
|
||
## API | ||
|
||
```{eval-rst} | ||
.. autofunction:: autoparser.create_mapping | ||
:noindex: | ||
.. autofunction:: autoparser.create_parser | ||
:noindex: | ||
``` |