Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor!: clean up app #474

Merged
merged 120 commits into from
Aug 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
120 commits
Select commit Hold shift + click to select a range
b17d52c
refactor: update TokenType + remove protein termination classifier
korikuzma May 10, 2023
008b632
refactor: Remove TokenMatchType (not necessary)
korikuzma May 10, 2023
c2a47bb
forgot to remove additional match_type
korikuzma May 10, 2023
111d104
refactor: clean up handling unknown tokens
korikuzma May 10, 2023
cc79fd1
refactor: Rename GeneMatchToken to GeneToken
korikuzma May 10, 2023
5cd918d
refactor: create AltType str enum
korikuzma May 10, 2023
095c9e9
refactor: update TokenType type
korikuzma May 10, 2023
b79410c
refactor: remove LookupType enum (not used)
korikuzma May 10, 2023
67ab56a
wip: add work for tokenizers
korikuzma May 22, 2023
944d6db
wip: storing progress
korikuzma May 26, 2023
a4c2bf7
wip: store progress for delins
korikuzma Jun 6, 2023
8b76419
wip: store progress for cdna insertion
korikuzma Jun 12, 2023
7857767
wip: store progress for ref agree
korikuzma Jun 12, 2023
6778521
wip: minor cleanup of validators
korikuzma Jun 12, 2023
d567c5a
wip: store progress for protein stop gain
korikuzma Jun 13, 2023
59e7942
wip: store initial work for genomic dup
korikuzma Jun 13, 2023
4dd8043
wip: store progress for genomic ambiguous dups
korikuzma Jun 20, 2023
b652db8
wip: progress for genomic del
korikuzma Jun 20, 2023
d2cf750
wip: remove canonical variation work
korikuzma Jun 22, 2023
d909427
wip: fix classifiers
korikuzma Jun 22, 2023
80dd57c
wip: handle if mane none in translators
korikuzma Jun 22, 2023
1b39114
wip: tmp stop running gh actions
korikuzma Jun 22, 2023
8dd9d92
wip: more progress for normalize
korikuzma Jun 26, 2023
df8e845
wip: store more progress
korikuzma Jun 27, 2023
450981e
wip: add genomic del ambiguous
korikuzma Jun 27, 2023
8310cdf
iMerge branch 'main' into issue-332-kori-merge-main
korikuzma Jun 28, 2023
8512eea
wip: fix tokenizers
korikuzma Jun 28, 2023
6ee5110
wip: clean up classifier tests
korikuzma Jun 28, 2023
666ab23
wip: rename coding dna --> cdna
korikuzma Jun 28, 2023
f194615
wip: storing progress for dup1
korikuzma Jun 28, 2023
aa4d895
wip: dup/del progress
korikuzma Jun 29, 2023
082ebe0
wip: more progress for dup del
korikuzma Jun 30, 2023
89aee85
wip: get hgvs dup del mode tests to pass
korikuzma Jun 30, 2023
91c56b0
wip: fix bug in to_vrs
korikuzma Jun 30, 2023
e3296a5
wip: rm unused fixture + fix normalize test
korikuzma Jul 5, 2023
5a3443d
Merge branch 'main' into issue-332-kori
korikuzma Jul 13, 2023
716025d
wip: use local cool-seq-tool
korikuzma Jul 13, 2023
126709b
wip: classification.gene --> classification.gene_token
korikuzma Jul 13, 2023
ef1239e
wip: fixes to gene -> gene_token in classifiers
korikuzma Jul 13, 2023
1a9d375
wip: fix cool-seq-tool imports
korikuzma Jul 14, 2023
73eedcb
wip: fix setting gene in genomic translators
korikuzma Jul 14, 2023
e65d4d7
wip: fix case for BRAF V512E test in normalize
korikuzma Jul 14, 2023
d43f552
wip: update genomic sub for normalize
korikuzma Jul 14, 2023
511da9d
wip: rm support for some acs
korikuzma Jul 14, 2023
c043da3
wip: more progress for genomic
korikuzma Jul 16, 2023
8678516
wip: initial work for to_copy_number
korikuzma Jul 17, 2023
908f89d
wip: sort translation results
korikuzma Jul 17, 2023
66f2e55
wip: rm get_mane_valid_result
korikuzma Jul 17, 2023
8d53275
wip: sort translation result
korikuzma Jul 17, 2023
c5a1324
wip: fix gnomad vcf / genomic insertion
korikuzma Jul 17, 2023
4c36c8b
wip: fix classifier test
korikuzma Jul 18, 2023
8e31b68
wip: more fixes
korikuzma Jul 18, 2023
3866538
wip: fix gnomad vcf to protein
korikuzma Jul 19, 2023
21fcc1a
wip: fix cnv tests
korikuzma Jul 19, 2023
ff10b04
wip: make mappings an instance var
korikuzma Jul 20, 2023
2c6bc3c
wip: update initializing cst
korikuzma Jul 20, 2023
57dd0bf
wip: rm gene_tokens
korikuzma Jul 20, 2023
c8e689b
wip: clean up some todos
korikuzma Jul 20, 2023
ebf63a1
wip: more cleanup
korikuzma Jul 20, 2023
11c5672
wip: clean up some flake8 errors
korikuzma Jul 20, 2023
6a85a6c
wip: clean up del/dup ambiguous translate method
korikuzma Jul 21, 2023
299c1fd
wip: clean up genomic del/dup translator
korikuzma Jul 21, 2023
bdb8882
wip: flake8 + fix gnomad vcf to protein
korikuzma Jul 21, 2023
fcdd907
wip: more flake8
korikuzma Jul 21, 2023
9c4e929
wip: rm todos for issues
korikuzma Jul 24, 2023
1377c8d
wip: fix gnomad vcf deletions
korikuzma Jul 24, 2023
c5f0266
wip: update cool-seq-tool version
korikuzma Jul 24, 2023
b58cde3
wip: transcripts --> accessions
korikuzma Jul 24, 2023
8b1e4ef
wip: rm todos (made new issues / commented on existing)
korikuzma Jul 24, 2023
0110910
wip: coding dna --> cdna
korikuzma Jul 24, 2023
98e9629
wip: validate gene pos
korikuzma Jul 24, 2023
8a25ac4
wip: update validation checks
korikuzma Jul 24, 2023
8bba456
wip: rm todo check liftover note
korikuzma Jul 24, 2023
20fe592
wip: remove unused code
korikuzma Jul 25, 2023
88dd0f2
wip: rename instance vars + imports for vrs-python
korikuzma Jul 25, 2023
8d8e570
wip: clean up class instance vars
korikuzma Jul 25, 2023
06af434
wip: rm CodonTable class, create function in gnomad vcf to protein
korikuzma Jul 25, 2023
5219160
wip: more flake8
korikuzma Jul 25, 2023
5c4b71c
wip: HGVSDupDelModeEnum -> HGVSDupDelModeOption + fix CopyChange import
korikuzma Jul 25, 2023
d213851
wip: clean up validator tests
korikuzma Jul 25, 2023
1724266
wip: move validator fixtures
korikuzma Jul 26, 2023
fdcfe33
wip: rm duplications from to_vrs
korikuzma Jul 26, 2023
ead256d
wip: add amplification import + allow parentheses in hgvs
korikuzma Jul 26, 2023
3a14c63
wip: accidentally used wrong var name
korikuzma Jul 26, 2023
3a170aa
wip: update translator tests + fix translator bugs
korikuzma Jul 26, 2023
330f7b9
wip: add amplification validator test
korikuzma Jul 26, 2023
38354f0
wip: move classifier tests back to yaml
korikuzma Jul 26, 2023
c44575c
wip: move validator tests back to yaml
korikuzma Jul 26, 2023
55b6813
wip: refactor tokenizer tests
korikuzma Jul 26, 2023
858e2dc
wip: cleanup schemas (flake8/enum changes)
korikuzma Jul 27, 2023
802215a
wip: cleanup classifiers (flake8)
korikuzma Jul 27, 2023
187fe34
wip: flake8 for normalize + utils
korikuzma Jul 27, 2023
660fd42
wip: cleanup tests (flake8/vulture)
korikuzma Jul 27, 2023
51c8254
wip: resolve flake8 errors
korikuzma Jul 27, 2023
7a11575
wip: update gh actions
korikuzma Jul 27, 2023
c4672f9
add more comments
korikuzma Jul 27, 2023
031c36a
bump version
korikuzma Jul 27, 2023
d511da4
cleanup: replace flake8 with ruff
korikuzma Jul 29, 2023
18f7058
Merge branch 'main' into issue-332-kori
korikuzma Jul 29, 2023
c4e25eb
cleanup: add black
korikuzma Jul 29, 2023
99bc49e
style: add isort to ruff select
korikuzma Jul 29, 2023
934663d
fix: hgvs_to_copy_number_count requires baseline_copies
korikuzma Jul 29, 2023
0636f48
cicd: combine black + ruff into one job
korikuzma Jul 29, 2023
222778f
style: remove old noqa
korikuzma Jul 29, 2023
6c8c122
refactor: create methods for validating pos
korikuzma Jul 30, 2023
6859e93
refactor: add method for validating protein hgvs classification
korikuzma Jul 30, 2023
6e705d2
refactor: cdna + protein translators
korikuzma Jul 30, 2023
421ba5d
style: ignore ANN003 - missing-type-kwargs
korikuzma Jul 31, 2023
fd26395
fix: classifier import
korikuzma Jul 31, 2023
051d8f7
docs: cleanup readme
korikuzma Aug 1, 2023
60f68eb
refactor: ensure unique list of warnings in service response
korikuzma Aug 1, 2023
d6cbd64
fix: validate Allele in to_vrs_allele
korikuzma Aug 1, 2023
2e2351d
pr review changes
korikuzma Aug 1, 2023
6bcb1ed
fix: forgot to update return in to copy number variation
korikuzma Aug 1, 2023
bf3458b
update readme on why we return cdna when given gene genomic change
korikuzma Aug 1, 2023
1bb95ae
pr review changes
korikuzma Aug 1, 2023
ce7bbfb
update invalid tests for normalize + put in todo reminder
korikuzma Aug 1, 2023
ae7ef17
tests: add test for genomic delins change w gene
korikuzma Aug 2, 2023
4bac65d
tests: stop checking exact gene normalizer response
korikuzma Aug 2, 2023
f422b4d
pr review changes: update type hints + docstrings
korikuzma Aug 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
22 changes: 0 additions & 22 deletions .flake8

This file was deleted.

11 changes: 11 additions & 0 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ name: checks
on: [push, pull_request]
jobs:
deps:
name: deps py${{ matrix.python-version }}
runs-on: ubuntu-latest
strategy:
fail-fast: false
Expand All @@ -19,3 +20,13 @@ jobs:
run: |
python -m pip install pipenv
pipenv install --skip-lock # this is what Elastic beanstalk uses
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: black
uses: psf/black@stable

- name: ruff
uses: chartboost/ruff-action@v1
4 changes: 2 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,6 @@ Pipfile.lock

.python-version

pyproject.toml

# Jupyter Notebook
.ipynb_checkpoints/

Expand All @@ -38,3 +36,5 @@ pyproject.toml

build/
dynamodb_local_latest/

*.http
28 changes: 19 additions & 9 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,22 @@
# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v1.4.0
hooks:
- id: flake8
additional_dependencies: [flake8-docstrings, flake8-quotes, flake8-import-order, flake8-annotations]
- id: check-added-large-files
- id: detect-private-key
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v1.4.0
hooks:
- id: check-added-large-files
- id: detect-private-key
- id: trailing-whitespace
- id: end-of-file-fixer
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.0.280
hooks:
- id: ruff
args: [ --fix, --exit-non-zero-on-fix ]
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
args: ["--check"]
language_version: python3.10
3 changes: 1 addition & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2018 VICC
Copyright (c) 2018-2023 VICC

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand All @@ -19,4 +19,3 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

9 changes: 3 additions & 6 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,13 @@ verify_ssl = true
pytest = "*"
pytest-asyncio = "*"
pytest-cov = "*"
flake8 = "*"
flake8-docstrings = "*"
flake8-quotes = "*"
flake8-annotations = "*"
flake8-import-order = "*"
pre-commit = "*"
variation-normalizer = {editable = true, path = "."}
jupyter = "*"
ipykernel = "*"
psycopg2-binary = "*"
ruff = "*"
black = "*"

[packages]
"biocommons.seqrepo" = "*"
Expand All @@ -28,4 +25,4 @@ gene-normalizer = "~=0.1.36"
pyliftover = "*"
boto3 = "*"
"ga4gh.vrsatile.pydantic" = "~=0.0.13"
cool-seq-tool = ">=0.1.13"
cool-seq-tool = ">=0.1.14.dev0"
71 changes: 45 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,30 @@
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5894937.svg)](https://doi.org/10.5281/zenodo.5894937)

# Variation Normalization

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5894937.svg)](https://doi.org/10.5281/zenodo.5894937)

Services and guidelines for normalizing variation terms into [VRS](https://vrs.ga4gh.org/en/latest) and [VRSATILE](https://vrsatile.readthedocs.io/en/latest/) compatible representations.

Public OpenAPI endpoint: https://normalize.cancervariants.org/variation
Public OpenAPI endpoint: <https://normalize.cancervariants.org/variation>

Installing with pip:

```commandline
```shell
pip install variation-normalizer
```

The variation-normalization repo depends on VRS and VRSATILE models, and therefore each variation-normalizer package on PyPI uses a particular version of VRS and VRSATILE. The correspondences between packages may be summarized as:

| variation-normalization branch | variation-normalizer version | gene-normalizer version | ga4gh.vrsatile.pydantic version | VRS version | VRSATILE version |
| ---- | --- | ---- | --- | --- | --- |
| [main](https://github.com/cancervariants/variation-normalization/tree/main) | 0.5.X | 0.1.X | 0.0.X | [1.X.X](https://github.com/ga4gh/vrs) | [main](https://github.com/ga4gh/vrsatile/tree/main)
| [main](https://github.com/cancervariants/variation-normalization/tree/main) | 0.6.X | 0.1.X | 0.0.X | [1.X.X](https://github.com/ga4gh/vrs) | [main](https://github.com/ga4gh/vrsatile/tree/main)
| [staging](https://github.com/cancervariants/variation-normalization/tree/staging) | 0.7.X | 0.2.X | 0.1.X | [metaschema-update](https://github.com/ga4gh/vrs/tree/metaschema-update) | [metaschema-update](https://github.com/ga4gh/vrsatile/tree/metaschema-update)

## About

Variation Normalization works by using four main steps: tokenization, classification, validation, and translation. During tokenization, we split strings on whitespace and parse to determine the type of token. During classification, we specify the order of tokens a classification can have. We then do validation checks such as ensuring references for a nucleotide or amino acid matches the expected value and validating a position exists on the given transcript. During translation, we return a VRS Allele object.

Variation Normalization is limited to the following types of variants:

* HGVS expressions and text representations (ex: `BRAF V600E`):
* **protein (p.)**: substitution, deletion, insertion, deletion-insertion
* **coding DNA (c.)**: substitution, deletion, insertion, deletion-insertion
Expand All @@ -36,14 +38,21 @@ We are working towards adding more types of variations, coordinates, and represe

### Endpoints

The `/to_vrs` endpoint returns a list of validated VRS [Variations](https://vrs.ga4gh.org/en/1.2.0/terms_and_model.html#variation).
#### `/to_vrs`

Returns a list of validated VRS [Variations](https://vrs.ga4gh.org/en/stable/terms_and_model.html#variation).

#### `/normalize`

The `/normalize` endpoint returns a [Variation Descriptor](https://vrsatile.readthedocs.io/en/latest/value_object_descriptor/vod_index.html#variation-descriptor) containing the MANE Transcript, if one is found. If a genomic query is not given a gene, `normalize` will return its GRCh38 representation. Variation Normalizer relies on [**C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool (cool-seq-tool)](https://github.com/GenomicMedLab/cool-seq-tool) for retrieving MANE Transcript data. More information on the transcript selection algorithm can be found [here](https://github.com/GenomicMedLab/cool-seq-tool/blob/main/docs/TranscriptSelectionPriority.md).
Feturns a [Variation Descriptor](https://vrsatile.readthedocs.io/en/latest/value_object_descriptor/vod_index.html#variation-descriptor) aligned to the prioritized transcript. The Variation Normalizer relies on [**C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool (cool-seq-tool)](https://github.com/GenomicMedLab/cool-seq-tool) for retrieving the prioritized transcript data. More information on the transcript selection algorithm can be found [here](https://github.com/GenomicMedLab/cool-seq-tool/blob/main/docs/TranscriptSelectionPriority.md).

If a genomic variation query _is_ given a gene (E.g. `BRAF g.140753336A>T`), the associated cDNA representation will be returned. This is because the gene provides additional strand context. If a genomic variation query is _not_ given a gene, the GRCh38 representation will be returned.

## Developer Instructions

Clone the repo:
```

```shell
git clone https://github.com/cancervariants/variation-normalization.git
cd variation-normalization
```
Expand All @@ -54,10 +63,9 @@ for direction on installing pipenv in your compute environment.

Once installed, from the project root dir, just run:

```commandline
```shell
pipenv shell
pipenv lock && pipenv sync
pipenv install --dev
pipenv update && pipenv install --dev
```

### Backend Services
Expand All @@ -73,74 +81,83 @@ You must also have Gene Normalization's DynamoDB running in a separate terminal
For more information about the gene-normalizer and how to load the database, visit the [README](https://github.com/cancervariants/gene-normalization/blob/main/README.md).

#### SeqRepo

Variation Normalization relies on [seqrepo](https://github.com/biocommons/biocommons.seqrepo), which you must download yourself.

Variation Normalizer uses seqrepo to retrieve sequences at given positions on a transcript.

From the _root_ directory:
```

```shell
pip install seqrepo
sudo mkdir /usr/local/share/seqrepo
sudo chown $USER /usr/local/share/seqrepo
seqrepo pull -i 2021-01-29 # Replace with latest version using `seqrepo list-remote-instances` if outdated
```

If you get an error similar to the one below:
```

```shell
PermissionError: [Error 13] Permission denied: '/usr/local/share/seqrepo/2021-01-29._fkuefgd' -> '/usr/local/share/seqrepo/2021-01-29'
```

You will want to do the following:\
(*Might not be ._fkuefgd, so replace with your error message path*)
```console

```shell
sudo mv /usr/local/share/seqrepo/2021-01-29._fkuefgd /usr/local/share/seqrepo/2021-01-29
exit
```

Use the `SEQREPO_ROOT_DIR` environment variable to set the path of an already existing SeqRepo directory. The default is `/usr/local/share/seqrepo/latest`.

#### UTA

Variation Normalizer also uses [**C**ommon **O**perations **O**n **L**ots-of **Seq**uences Tool (cool-seq-tool)](https://github.com/GenomicMedLab/cool-seq-tool) which uses [UTA](https://github.com/biocommons/uta) as the underlying PostgreSQL database.

_The following commands will likely need modification appropriate for the installation environment._

1. Install [PostgreSQL](https://www.postgresql.org/)
2. Create user and database.

```
$ createuser -U postgres uta_admin
$ createuser -U postgres anonymous
$ createdb -U postgres -O uta_admin uta
```shell
createuser -U postgres uta_admin
createuser -U postgres anonymous
createdb -U postgres -O uta_admin uta
```

3. To install locally, from the _variation/data_ directory:
```

```shell
export UTA_VERSION=uta_20210129.pgd.gz
curl -O http://dl.biocommons.org/uta/$UTA_VERSION
gzip -cdq ${UTA_VERSION} | grep -v "^REFRESH MATERIALIZED VIEW" | psql -h localhost -U uta_admin --echo-errors --single-transaction -v ON_ERROR_STOP=1 -d uta -p 5433
```

##### UTA Installation Issues

If you have trouble installing UTA, you can visit [these two READMEs](https://github.com/ga4gh/vrs-python/tree/main/docs/setup_help).

##### Connecting to the UTA database
To connect to the UTA database, you can use the default url (`postgresql://uta_admin@localhost:5433/uta/uta_20210129`). If you use the default url, you must either set the password using environment variable `UTA_PASSWORD` or setting the parameter `db_pwd` in the UTA class.

If you do not wish to use the default, you must set the environment variable `UTA_DB_URL` which has the format of `driver://user:pass@host:port/database/schema`.
To connect to the UTA database, you can use the default url (`postgresql://uta_admin@localhost:5433/uta/uta_20210129`). If you do not wish to use the default, you must set the environment variable `UTA_DB_URL` which has the format of `driver://user:pass@host:port/database/schema`.

## Starting the Variation Normalization Service Locally

`gene-normalizer`s dynamodb and the `uta` database must be running.

To start the service, run the following:

```commandline
```shell
uvicorn variation.main:app --reload
```

Next, view the OpenAPI docs on your local machine:
http://127.0.0.1:8000/variation
<http://127.0.0.1:8000/variation>

### Init coding style tests
Code style is managed by [flake8](https://github.com/PyCQA/flake8) and checked prior to commit.

Code style is managed by [Ruff](https://github.com/astral-sh/ruff) and checked prior to commit.

We use [pre-commit](https://pre-commit.com/#usage) to run conformance tests.

Expand All @@ -153,12 +170,14 @@ This ensures:

Before first commit run:

```
```shell
pre-commit install
```

### Testing

From the _root_ directory of the repository:
```

```shell
pytest tests/
```
Loading