From d76bd0fcbd8ff894d4d9f5842866b098d12181fe Mon Sep 17 00:00:00 2001 From: "Robert (Bob) Borges" Date: Thu, 17 Oct 2024 13:45:27 +0200 Subject: [PATCH 1/7] Revert "Decision 8 Documenting quality dimensions" --- .../decisions/data-integrity-test-template.md | 10 --------- ...on-3-API-structure-of-data-repositories.md | 2 +- .../decision-8-quality-dimensions.md | 21 ------------------- docs/decisions/quality-dimension-template.md | 19 ----------------- 4 files changed, 1 insertion(+), 51 deletions(-) delete mode 100644 docs/decisions/data-integrity-test-template.md delete mode 100644 docs/decisions/decision-8-quality-dimensions.md delete mode 100644 docs/decisions/quality-dimension-template.md diff --git a/docs/decisions/data-integrity-test-template.md b/docs/decisions/data-integrity-test-template.md deleted file mode 100644 index c4c3dca..0000000 --- a/docs/decisions/data-integrity-test-template.md +++ /dev/null @@ -1,10 +0,0 @@ -# Title - -## Summary -Short description of the data integrity test(s). - -## What is the problem -What do we want to check/test? What is the the problem, why has it been included as an automated test? - -## Previous experiences -Previous experience with the test that might be relevant. E.g. has there been difficulties before? diff --git a/docs/decisions/decision-3-API-structure-of-data-repositories.md b/docs/decisions/decision-3-API-structure-of-data-repositories.md index 53fc714..62a7519 100644 --- a/docs/decisions/decision-3-API-structure-of-data-repositories.md +++ b/docs/decisions/decision-3-API-structure-of-data-repositories.md @@ -24,7 +24,7 @@ We store the data, test, and quality estimation in the following way. /quality/... -> scripts used for quality estimation /quality/data/... -> data used for quality estimation /quality/estimates/... -> estimates by version stored for easy access -/quality/docs/... -> quality dimension descriptions + ## Consequences This will make clear where different data sources go. diff --git a/docs/decisions/decision-8-quality-dimensions.md b/docs/decisions/decision-8-quality-dimensions.md deleted file mode 100644 index cfa92b1..0000000 --- a/docs/decisions/decision-8-quality-dimensions.md +++ /dev/null @@ -1,21 +0,0 @@ -# Title -## Status -proposed - -## Context -An important part of the continous integration of the swerik corpora is the quality control step, consisting of automated tests and the estimation of quality dimensions of interest. - -The estimation of quality is commonly done using sampling strategies and are then automatically estimated during quality control. Currently the description of the quality estimation is found in a google doc. We want the quality estimation to be stored together with the actual corpora to document how the estimation is done. - -The test files for automated tests are already included as separate tests. - -Quality dimension estimation is commonly done by counting or by estimation based on a sample. There is a need to understand what is estimated and why it is estimated/motivation behind the quality estimation. Hence we need to describe how the each quality dimension more thoroughly. - -## Decision -All separate quality dimensions sheet should be included in the actual repository where they are used/analyzed for each quality dimension we estimate in that corpus. See the quality dimension template. - -Each test should be stored as `test-[what-is-checked-automatically].py/r`. Each test should include a header in comment describing the test and why it has been included. See the data-integrity-test template. - -## Consequences -This makes it easier to follow and understand the quality dimension, unit testing and to automatically parse the descriptions of the tests and how it is computed for each individual corpus. It also makes the corpus quality control process an integrated part of each individual corpus that then will work independently. - diff --git a/docs/decisions/quality-dimension-template.md b/docs/decisions/quality-dimension-template.md deleted file mode 100644 index c584559..0000000 --- a/docs/decisions/quality-dimension-template.md +++ /dev/null @@ -1,19 +0,0 @@ -# Title - -## Summary -Short description of the quality dimension. - -## What is the problem -What do we want to estimate? What is the the problem, why has it been included as a quality dimension? - -## Estimation procedure -How is the estimation conducted (in words) What dataset has been been used to estimate - -### Sampling plan [if applicable] -How has the sampling been done? - -### Annotation guidelines [if applicable] -What is the change that we're proposing and/or doing? - -## Previous experiences -E.g. how long time does it take to annotate? From ce76b9fa7372d94f5433fef884c95e21fbc335a0 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 13:52:59 +0200 Subject: [PATCH 2/7] chore: revert undo merge for decision-3 --- ...on-3-API-structure-of-data-repositories.md | 2 +- .../decision-8-quality-dimensions.md | 21 +++++++++++++++++++ docs/decisions/quality-dimension-template.md | 19 +++++++++++++++++ 3 files changed, 41 insertions(+), 1 deletion(-) create mode 100644 docs/decisions/decision-8-quality-dimensions.md create mode 100644 docs/decisions/quality-dimension-template.md diff --git a/docs/decisions/decision-3-API-structure-of-data-repositories.md b/docs/decisions/decision-3-API-structure-of-data-repositories.md index 62a7519..53fc714 100644 --- a/docs/decisions/decision-3-API-structure-of-data-repositories.md +++ b/docs/decisions/decision-3-API-structure-of-data-repositories.md @@ -24,7 +24,7 @@ We store the data, test, and quality estimation in the following way. /quality/... -> scripts used for quality estimation /quality/data/... -> data used for quality estimation /quality/estimates/... -> estimates by version stored for easy access - +/quality/docs/... -> quality dimension descriptions ## Consequences This will make clear where different data sources go. diff --git a/docs/decisions/decision-8-quality-dimensions.md b/docs/decisions/decision-8-quality-dimensions.md new file mode 100644 index 0000000..cfa92b1 --- /dev/null +++ b/docs/decisions/decision-8-quality-dimensions.md @@ -0,0 +1,21 @@ +# Title +## Status +proposed + +## Context +An important part of the continous integration of the swerik corpora is the quality control step, consisting of automated tests and the estimation of quality dimensions of interest. + +The estimation of quality is commonly done using sampling strategies and are then automatically estimated during quality control. Currently the description of the quality estimation is found in a google doc. We want the quality estimation to be stored together with the actual corpora to document how the estimation is done. + +The test files for automated tests are already included as separate tests. + +Quality dimension estimation is commonly done by counting or by estimation based on a sample. There is a need to understand what is estimated and why it is estimated/motivation behind the quality estimation. Hence we need to describe how the each quality dimension more thoroughly. + +## Decision +All separate quality dimensions sheet should be included in the actual repository where they are used/analyzed for each quality dimension we estimate in that corpus. See the quality dimension template. + +Each test should be stored as `test-[what-is-checked-automatically].py/r`. Each test should include a header in comment describing the test and why it has been included. See the data-integrity-test template. + +## Consequences +This makes it easier to follow and understand the quality dimension, unit testing and to automatically parse the descriptions of the tests and how it is computed for each individual corpus. It also makes the corpus quality control process an integrated part of each individual corpus that then will work independently. + diff --git a/docs/decisions/quality-dimension-template.md b/docs/decisions/quality-dimension-template.md new file mode 100644 index 0000000..c584559 --- /dev/null +++ b/docs/decisions/quality-dimension-template.md @@ -0,0 +1,19 @@ +# Title + +## Summary +Short description of the quality dimension. + +## What is the problem +What do we want to estimate? What is the the problem, why has it been included as a quality dimension? + +## Estimation procedure +How is the estimation conducted (in words) What dataset has been been used to estimate + +### Sampling plan [if applicable] +How has the sampling been done? + +### Annotation guidelines [if applicable] +What is the change that we're proposing and/or doing? + +## Previous experiences +E.g. how long time does it take to annotate? From 93bf221d5cb52408006df57d2567df4366e88312 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 13:56:25 +0200 Subject: [PATCH 3/7] chore: revert unmerged file --- docs/decisions/data-integrity-test-template.md | 10 ++++++++++ 1 file changed, 10 insertions(+) create mode 100644 docs/decisions/data-integrity-test-template.md diff --git a/docs/decisions/data-integrity-test-template.md b/docs/decisions/data-integrity-test-template.md new file mode 100644 index 0000000..c4c3dca --- /dev/null +++ b/docs/decisions/data-integrity-test-template.md @@ -0,0 +1,10 @@ +# Title + +## Summary +Short description of the data integrity test(s). + +## What is the problem +What do we want to check/test? What is the the problem, why has it been included as an automated test? + +## Previous experiences +Previous experience with the test that might be relevant. E.g. has there been difficulties before? From e2593076791388181d348fc88fb18bd70d4c9bb6 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 14:30:18 +0200 Subject: [PATCH 4/7] feat: elaborate description --- docs/decisions/decision-8-quality-dimensions.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/decisions/decision-8-quality-dimensions.md b/docs/decisions/decision-8-quality-dimensions.md index cfa92b1..f4f173d 100644 --- a/docs/decisions/decision-8-quality-dimensions.md +++ b/docs/decisions/decision-8-quality-dimensions.md @@ -1,9 +1,10 @@ -# Title +# How do document quality dimensions and data integrity tests ## Status + proposed ## Context -An important part of the continous integration of the swerik corpora is the quality control step, consisting of automated tests and the estimation of quality dimensions of interest. +An important part of the continuous integration of the SWERIK corpora is the quality control step, consisting of automated tests and the estimation of quality dimensions of interest. The estimation of quality is commonly done using sampling strategies and are then automatically estimated during quality control. Currently the description of the quality estimation is found in a google doc. We want the quality estimation to be stored together with the actual corpora to document how the estimation is done. @@ -16,6 +17,8 @@ All separate quality dimensions sheet should be included in the actual repositor Each test should be stored as `test-[what-is-checked-automatically].py/r`. Each test should include a header in comment describing the test and why it has been included. See the data-integrity-test template. +Python functions contained in the text files are to be described with docstrings that "include" the relevant .md files. API documentation is then generated with the pdoc module and becomes available under `swerik-project.github.io//quality` or `swerik-project.gihub.io//data-integrity` + ## Consequences This makes it easier to follow and understand the quality dimension, unit testing and to automatically parse the descriptions of the tests and how it is computed for each individual corpus. It also makes the corpus quality control process an integrated part of each individual corpus that then will work independently. From 159f5ae283ba2c00d383f764b6e2bdea11fd6406 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 14:31:30 +0200 Subject: [PATCH 5/7] style: UNDERSCORE --- ...n-8-quality-dimensions.md => decision-8_quality-dimensions.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/decisions/{decision-8-quality-dimensions.md => decision-8_quality-dimensions.md} (100%) diff --git a/docs/decisions/decision-8-quality-dimensions.md b/docs/decisions/decision-8_quality-dimensions.md similarity index 100% rename from docs/decisions/decision-8-quality-dimensions.md rename to docs/decisions/decision-8_quality-dimensions.md From 2abee830bb8d9434a5c9ab55743622a9a6b62204 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 15:12:16 +0200 Subject: [PATCH 6/7] chore: specify qe vs test --- docs/decisions/decision-8_quality-dimensions.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/decisions/decision-8_quality-dimensions.md b/docs/decisions/decision-8_quality-dimensions.md index f4f173d..2802318 100644 --- a/docs/decisions/decision-8_quality-dimensions.md +++ b/docs/decisions/decision-8_quality-dimensions.md @@ -13,11 +13,12 @@ The test files for automated tests are already included as separate tests. Quality dimension estimation is commonly done by counting or by estimation based on a sample. There is a need to understand what is estimated and why it is estimated/motivation behind the quality estimation. Hence we need to describe how the each quality dimension more thoroughly. ## Decision -All separate quality dimensions sheet should be included in the actual repository where they are used/analyzed for each quality dimension we estimate in that corpus. See the quality dimension template. -Each test should be stored as `test-[what-is-checked-automatically].py/r`. Each test should include a header in comment describing the test and why it has been included. See the data-integrity-test template. +All quality estimations and data-integrity tests should be included in the actual repository where they are used/analyzed. Each quality dimension estimation should be stored as `qe-[what-dimension-is-estimated].py/r` and should include a short docstring / header describing the the estimation as well as a reference to or "inclusion" of the markdown file that contains more detailed descriptions of what is estimated and the estimation process. See the quality dimension template. -Python functions contained in the text files are to be described with docstrings that "include" the relevant .md files. API documentation is then generated with the pdoc module and becomes available under `swerik-project.github.io//quality` or `swerik-project.gihub.io//data-integrity` +Similarly each data-integrity test should be stored as `test-[what-is-checked-automatically].py/r` and should include a short docstring / header describing the test as well as a reference to or "inclusion" of the markdown file that contains more detailed descriptions of what is tested and the testing process. See the data-integrity-test template. + +Python functions contained in the quality estimation and data-integrity test files described with docstrings that "include" the relevant .md files provide the basis to end-user API documentation. API documentation is then generated with the pdoc module and becomes available under `swerik-project.github.io//quality-estimation` or `swerik-project.gihub.io//data-integrity` ## Consequences This makes it easier to follow and understand the quality dimension, unit testing and to automatically parse the descriptions of the tests and how it is computed for each individual corpus. It also makes the corpus quality control process an integrated part of each individual corpus that then will work independently. From ffd7e48a3e2b5f09dddf4dc1172823b2404fd4b4 Mon Sep 17 00:00:00 2001 From: Bob Borges Date: Thu, 17 Oct 2024 15:24:23 +0200 Subject: [PATCH 7/7] feat: preliminary approval --- docs/decisions/decision-8_quality-dimensions.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/decisions/decision-8_quality-dimensions.md b/docs/decisions/decision-8_quality-dimensions.md index 2802318..13a847c 100644 --- a/docs/decisions/decision-8_quality-dimensions.md +++ b/docs/decisions/decision-8_quality-dimensions.md @@ -1,7 +1,9 @@ # How do document quality dimensions and data integrity tests ## Status -proposed +Decided + +- Decision: approved ## Context An important part of the continuous integration of the SWERIK corpora is the quality control step, consisting of automated tests and the estimation of quality dimensions of interest.