Skip to content
This repository has been archived by the owner on Dec 10, 2024. It is now read-only.

Commit

Permalink
Add ReadTheDocs doccumentation (#1)
Browse files Browse the repository at this point in the history
Add basic documentation for API
  • Loading branch information
pipliggins authored Nov 20, 2024
1 parent 766d381 commit c167219
Show file tree
Hide file tree
Showing 10 changed files with 160 additions and 9 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Unit and integration tests
name: tests
on:
workflow_dispatch:
push:
Expand Down
2 changes: 2 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,6 @@ sphinx:
# Optionally declare the Python requirements required to build your docs
python:
install:
- method: pip
path: .
- requirements: docs/requirements.txt
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# autoparser

[![](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![tests](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml)
[![Test Status](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml/badge.svg)](https://github.com/globaldothealth/autoparser/actions/workflows/tests.yml)
[![Documentation Status](https://readthedocs.org/projects/insightboard/badge/?version=latest)](https://insightboard.readthedocs.io/en/latest/?badge=latest)
<!-- [![codecov](https://codecov.io/gh/globaldothealth/autoparser/graph/badge.svg?token=AINU8PNJE3)](https://codecov.io/gh/globaldothealth/autoparser) -->
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

Expand All @@ -10,6 +11,8 @@ TOML files, which can then be processed by
[adtl](https://github.com/globaldothealth/adtl) to transform files from the
source schema to a specified schema.

Documentation: [ReadTheDocs](https://autoparser.readthedocs.io/en/latest)

Contains functionality to:
1. Create a basic data dictionary from a raw data file (`create-dict`)
2. Use an LLM (currently only ChatGPT via the OpenAI API) to add descriptions to the
Expand Down Expand Up @@ -93,7 +96,7 @@ defaultDateFormat = "%d/%m/%Y"
which should automatically convert the dates for you.

2. ADTL can't find my schema (error: No such file or directory ..../x.schema.json)
autoparser puts the path to the schema at the top of the TOML file, relative to the
AutoParser puts the path to the schema at the top of the TOML file, relative to the
*current location of the parser* (i.e, where you ran the autoparser command from).
If you have since moved the parser file, you will need to update the schema path at the
top of the TOML parser.
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = "InsightBoard"
project = "AutoParser"
copyright = "2024, globaldothealth"
author = "globaldothealth"
# -- General configuration ---------------------------------------------------
Expand All @@ -14,9 +14,9 @@
"sphinx.ext.napoleon",
"sphinx.ext.coverage",
"sphinx.ext.graphviz",
"myst_parser",
"sphinx_book_theme",
"sphinxcontrib.mermaid",
"myst_nb",
]
templates_path = [
"_templates",
Expand Down
19 changes: 19 additions & 0 deletions docs/getting_started/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Getting started

## Installation

AutoParser is a Python package that can either be built into your code or run as a
command-line interface (CLI). You can install AutoParser using pip:

```bash
python3 -m pip install git+https://github.com/globaldothealth/autoparser
```

Note that it is usually recommended to install into a virtual environment. We recommend using [uv](https://github.com/astral-sh/uv) to manage the virtual environment. To create and active a virtual environment for AutoParser using `uv` run the following commands:

```bash
uv sync
. .venv/bin/activate
```

To view and use the CLI, you can type `autoparser` into the command line to view the options available.
Binary file added docs/images/flowchart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 15 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# autoparser
Autparser is a tool for semi-automated data parser creation.
# AutoParser
AutoParser is a tool for semi-automated data parser creation. The package allows you
to generate a new data parser for converting your source data into a new format specified
using a schema file, ready to use with the data transformation tool [adtl](https://adtl.readthedocs.io/en/latest/index.html).

## Key Features
- Data Dictionary Creation: Automatically create a basic data dictionary framework
- Parser Generation: Generate data parsers to match a given schema

## Framework

```{figure} images/flowchart.png
Flowchart showing the inputs (bright blue), outputs (green blocks) and functions
(dashed diamonds) of AutoParser.
```

## Documentation
```{toctree}
---
maxdepth: 2
caption: Contents:
---
self
getting_started/index
usage/data_dict
usage/parser_generation
```
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sphinx==8.0.2
myst_parser==4.0.0
sphinx-book-theme==1.1.3
sphinxcontrib-mermaid==0.9.2
sphinxcontrib-mermaid==0.9.2
myst_nb
63 changes: 63 additions & 0 deletions docs/usage/data_dict.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Creating a Data Dictionary

## Motivation

A data dictionary is a structured guide which contains the details of a data file.
It should contain, at minimum, a list of field/column names, and some kind of description
of what data each field holds. This often takes the form of a textual description, plus
a note of the data type (text, decimals, date, boolean...) and/or a set of expected values.

A data dictionary is required by AutoParser for (parser generation)[parser_generation].
This is to avoid having to send potentially sensitive or confidential data to an external
body (in this case an externally hosted LLM hosted); instead a *decription* of what the
data looks like from the dictionary can be sent to the LLM, which allows for mapping to
occur without risking the unintentional release of data.

Many data capture services such as (REDCaP)[https://projectredcap.org/] will generate
a data dictionary automatically when surveys are set up. However, where data is being
captured either rapidly, or by individuals/small teams, a formal data dictionary may not
have been created for a corresponding dataset. For this scenario, AutoParser provides
functionality to generate a simple dictionary based on your data. This dictionary can
then be used in other AutoParser modules.

## Create a basic data dictionary
AutoParser will take your raw data file and create a basic data dictionary. For an example
dataset of animals, a generated data dictionary looks like this:

| source_field | source_description | source_type | common_values |
|-------------------|--------------------|-------------|----------------------------------------------------------|
| Identité | | string | |
| Province | | choice | Equateur, Orientale, Katanga, Kinshasa |
| DateNotification | | string | |
| Classicfication | | choice | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU |
| Nom complet | | string | |
| Date de naissance | | string | |
| AgeAns | | number | |
| AgeMois | | number | |
| Sexe | | choice | F, M, f, m, f, m , inconnu |

`source_field` contains each column header from the source data, and `source_type` shows the
data type in each column. 'choice' denotes where a small set of strings have been detected,
so AutoParser assumes that specified terms are being used, and lists them in `common values`.

Notice that the `source_description` column is empty. This is done by default, so the
user can add in a short text description *in English* (as this column is read by the LLM
in later steps and assumes the text is written in English). For example, the description
for the `AgeMois` column might be 'Age in Months'.

If instead you would like to auto-generate these descriptions, AutoParser can use an LLM
to automate this step. Note, we strongly encourage all users to check the results of the
auto-generated descriptions for accuracy before proceeding to use the described data dictionary
to generate a data parser.

## API

```{eval-rst}
.. autofunction:: autoparser.create_dict
:noindex:
.. autofunction:: autoparser.generate_descriptions
:noindex:
```


51 changes: 51 additions & 0 deletions docs/usage/parser_generation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Write a Data Parser

AutoParser assumes the use of Global.Health's (adtl)[https://github.com/globaldothealth/adtl]
package to transform your source data into a standardised format. To do this, adtl requires a
(TOML)[https://toml.io/en/] specification file which describes how raw data should be
converted into the new format, on a field-by-field basis. Every unique data file format
(i.e. unique sets of fields and data types) should have a corresponding parser file.

AutoParser exists to semi-automate the process of writing new parser files. This requires
a data dictionary (which can be created if it does not already exist, see [data_dict]),
and the JSON schema of the target format.

Parser generation is a 2-step process.

## Generate intermedaite mappings (CSV)
First, an intermediate mapping file is created which can look like this:

| target_field | source_description | source_field | common_values | target_values | value_mapping |
|-------------------|--------------------|------------------|----------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------------------------------------|
| identity | Identity | Identité | | | |
| name | Full Name | Nom complet | | | |
| loc_admin_1 | Province | Province | Equateur, Orientale, Katanga, Kinshasa | | |
| country_iso3 | | | | | |
| notification_date | Notification Date | DateNotification | | | |
| classification | Classification | Classicfication | FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU | mammal, bird, reptile, amphibian, fish, invertebrate, None | mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish |
| case_status | Case Status | StatusCas | Vivant, Décédé | alive, dead, unknown, None | décédé=dead, vivant=alive |

`target_x` refers to the desired output format, while `source_x` refers to the raw data.
In this example, the final row shows that the `case_status` field in the desired output
format should be filled using data from the `StatusCas` field in the raw data. The `value_mapping`
column indicated that all instances of `décédé` in the raw data should be mapped to `dead`
in the converted file, and `vivant` should map to `alive`.

These intermediate mappings should be manually curated, as they are generated using an
LLM which may be prone to errors and hallucinations, generating incorrect matches for either
the field, or the values within that field.

## Generate TOML

This step is automated and should produce a TOML file that conforms to the adtl parser
schema, ready for use transforming data.

## API

```{eval-rst}
.. autofunction:: autoparser.create_mapping
:noindex:
.. autofunction:: autoparser.create_parser
:noindex:
```

0 comments on commit c167219

Please sign in to comment.