Skip to content

Commit

Permalink
Merge pull request #221 from EGA-archive/EE-2610_autocorrect
Browse files Browse the repository at this point in the history
EE-2610 Infer or autocorrect format in the genomic range cli args
  • Loading branch information
Alegria Aclan authored Apr 3, 2024
2 parents 943f932 + 74f8c1d commit 7204409
Show file tree
Hide file tree
Showing 6 changed files with 172 additions and 13 deletions.
10 changes: 10 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ stages:
- test
- build
- functional_test
- pre-release
- release
# Change pip's cache directory to be inside the project directory since we can
# only cache local items.
Expand Down Expand Up @@ -82,6 +83,15 @@ functional_tests:
- pip install "dist/pyega3-$PYEGA3_VERSION.tar.gz"
- pytest tests/functional

check-version:
stage: pre-release
script:
- 'TAG_VERSION=$(git describe --tags --abbrev=0)'
- 'FILE_VERSION=$(cat pyega3/VERSION)'
- 'if [ "$TAG_VERSION" != "v$FILE_VERSION" ]; then echo "Version mismatch! Tag: $TAG_VERSION, VERSION file: $FILE_VERSION"; exit 1; fi'
only:
- tags

release:
stage: release
when: manual
Expand Down
28 changes: 21 additions & 7 deletions development.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Development
The repo is maintained by the EGA development team. The client connects to the file-distribution download API which serves the files.

The repo is maintained by the EGA development team. The client connects to the file-distribution download API which
serves the files.

## Update the client

1. Clone the repo in a local workspace
2. Create a feature branch
3. Add changes & tests. To run pyega3 command locally:
Expand All @@ -12,12 +15,15 @@ The repo is maintained by the EGA development team. The client connects to the f
5. Update the VERSION file with the new target version
6. Create a PR against master
7. Merge PR
8. Push a tag. The tag must match the version in the VERSION file (there is an internal JIRA ticket EE-2809 filed to ensure this in the future). The tag push will then trigger the GitLab pipeline which will run tests and release to PyPI which is manually trigger at the moment to have more control on when to release.
8. Push a tag. The tag must match the version in the VERSION file (the check-version job in pre-release stage checks
this). The tag push will then trigger the GitLab pipeline which will run tests and release to PyPI which is manually
trigger at the moment to have more control on when to release.
9. Release to PyPI via manual trigger in GitLab https://gitlab.ebi.ac.uk/EGA/ega-download-client/-/pipelines
10. Verify the new version is published in PyPI https://pypi.org/project/pyega3/
11. Add a release note to describe the changes from last release

## Release a new version to PyPI

```mermaid
graph LR
A[Update VERSION file] --> B[Push tag] --> C[ Run Unit tests]
Expand All @@ -30,12 +36,20 @@ subgraph GitLab pipeline
E --> F["Release to PyPI <br> (manual trigger)"]
end
```
The GitLab pipeline pipeline covers running unit tests, functional tests to be run automatically after a commit is tagged for release. It also provides
the ability to download the python package (within a week) for manual testing. Manual testing can test other aspects of the download client like usability and the test cases which can’t be automated.

The test cases covered in functional tests are detailed in the [pyega3 functional test plan](https://docs.google.com/spreadsheets/d/1kMLWBGDrsf3f98DdArOc0HXG-vsvMLJ5MjKk-aJfKGw/edit#gid=0). This document is in a Shared Google drive and only the members of the EGA team have access to it.
The GitLab pipeline pipeline covers running unit tests, functional tests to be run automatically after a commit is
tagged for release. It also provides the ability to download the python package (within a week) for manual testing.
Manual testing can test other aspects of the download client like usability and the test cases which can’t be automated.

The test cases covered in functional tests are detailed in
the [pyega3 functional test plan](https://docs.google.com/spreadsheets/d/1kMLWBGDrsf3f98DdArOc0HXG-vsvMLJ5MjKk-aJfKGw/edit#gid=0).
This document is in a Shared Google drive and only the members of the EGA team have access to it.

After all testing is done, the team (or the main developer + tester) can decide if the tagged commit is good for release. The build allows the release stage to be manually triggered to proceed with the release to PyPI. There is no need to perform the tests after the release because it is the same package that was tested (automated and manual) and released. This flow ensures that the package is tested and functional before its release.
After all testing is done, the team (or the main developer + tester) can decide if the tagged commit is good for
release. The build allows the release stage to be manually triggered to proceed with the release to PyPI. There is no
need to perform the tests after the release because it is the same package that was tested (automated and manual) and
released. This flow ensures that the package is tested and functional before its release.

The functional tests stage also allows failure in case we want to override the process and release even if the functional tests in GitLab are failing.
The functional tests stage also allows failure in case we want to override the process and release even if the
functional tests in GitLab are failing.

2 changes: 1 addition & 1 deletion pyega3/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
5.1.0
5.2.0
14 changes: 9 additions & 5 deletions pyega3/libs/data_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,12 @@
from pyega3.libs.error import DataFileError, SliceError, MD5MismatchError, MaxRetriesReachedError
from pyega3.libs.stats import Stats

from pyega3.libs.file_format import is_bam_or_cram_file, autocorrect_format_in_genomic_range_args

DOWNLOAD_FILE_MEMORY_BUFFER_SIZE = 32 * 1024

SUPPORTED_FILE_FORMATS = ["BAM", "CRAM", "VCF", "BCF"]


class DataFile:
DEFAULT_SLICE_SIZE = 100 * 1024 * 1024
Expand Down Expand Up @@ -292,7 +296,10 @@ def download_file_retry(self, num_connections, output_dir, genomic_range_args, m
if self.does_file_exist(output_file):
DataFile.print_local_file_info('Local file exists:', output_file, self.unencrypted_checksum)
elif DataFile.is_genomic_range(genomic_range_args):
self._download_htsget_slice(genomic_range_args, max_retries, output_file, retry_wait)
corrected_genomic_range_args = autocorrect_format_in_genomic_range_args(self.name,
genomic_range_args,
SUPPORTED_FILE_FORMATS)
self._download_htsget_slice(corrected_genomic_range_args, max_retries, output_file, retry_wait)
else:
stats_list = self._download_whole_file(max_retries, max_slice_size, num_connections, output_file,
retry_wait, temporary_directory)
Expand All @@ -314,7 +321,7 @@ def _download_htsget_slice(self, genomic_range_args, max_retries, output_file, r
if self.data_client.api_version == 1:
endpoint_type = "files"
else:
endpoint_type = "htsget/reads" if self.is_bam_or_cram_file(self.name) else "htsget/variants"
endpoint_type = "htsget/reads" if is_bam_or_cram_file(self.name) else "htsget/variants"
with open(output_file, 'wb') as output:
htsget.get(
f"{self.data_client.htsget_url}/{endpoint_type}/{self.id}",
Expand Down Expand Up @@ -386,9 +393,6 @@ def _format_stats_error_reason(self, e):
error_reason = e.message
return error_class_name, error_reason

def is_bam_or_cram_file(self, name: str):
return re.search("\.bam", name, re.IGNORECASE) or re.search("\.cram", name, re.IGNORECASE)

def delete_temporary_folder(self, temporary_directory):
try:
shutil.rmtree(temporary_directory)
Expand Down
43 changes: 43 additions & 0 deletions pyega3/libs/file_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
import logging
import re


def autocorrect_format_in_genomic_range_args(name: str, genomic_range_args: tuple, possible_format_list: list) -> tuple:
file_format_from_user = genomic_range_args[4] if len(genomic_range_args) == 5 else None
detected_file_format = detect_file_format(name, possible_format_list)
if file_format_from_user and detected_file_format and file_format_from_user != detected_file_format:
logging.warning(
f"Warning: The specified format {file_format_from_user} does not match the detected format in the "
f"file name, {name} , detected format: {detected_file_format}. The detected format will be used "
f"since transcoding is not yet supported by the file distribution service.")

if not file_format_from_user and not detected_file_format:
logging.warning(
"Warning: No file format was specified nor detected. The file distribution service will use 'BAM' as "
"the default format. If you require a different format, please specify it using the '--format' option,"
"followed by the desired format (e.g., '--format CRAM'). For a list of supported formats, "
"use the '--help' option.")

genomic_range_args_list = list(genomic_range_args)
genomic_range_args_list[4] = detected_file_format if detected_file_format else file_format_from_user
updated_genomic_range_args = tuple(genomic_range_args_list)
return updated_genomic_range_args


def is_bam_or_cram_file(filename: str):
return search_format_in_filename("bam", filename) or search_format_in_filename("cram", filename)


def search_format_in_filename(file_format: str, filename: str):
return re.search(f"\.{file_format}", filename, re.IGNORECASE)


def detect_file_format(filename, possible_format_list):
detected_file_format = None

for file_format in possible_format_list:
if search_format_in_filename(file_format.lower(), filename):
detected_file_format = file_format
break

return detected_file_format
88 changes: 88 additions & 0 deletions tests/unit/test_file_format.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
from unittest.mock import patch

import pytest
from pyega3.libs.file_format import autocorrect_format_in_genomic_range_args, is_bam_or_cram_file


@pytest.fixture
def mock_warning():
with patch('logging.warning') as mock_warning:
yield mock_warning


def test_format_matches_detected(mock_warning):
name = "example.bam.cip"
genomic_range_args = ("chr1", 100, 200, "reference", "BAM")
format_list = ["BAM", "CRAM"]

result = autocorrect_format_in_genomic_range_args(name, genomic_range_args, format_list)

assert result == genomic_range_args
mock_warning.assert_not_called()


def test_format_does_not_match_detected(mock_warning):
name = "example.cram.cip"
genomic_range_args = ("chr1", 100, 200, "reference", "BAM")
format_list = ["BAM", "CRAM"]

result = autocorrect_format_in_genomic_range_args(name, genomic_range_args, format_list)

expected_result = ("chr1", 100, 200, "reference", "CRAM")
assert result == expected_result
mock_warning.assert_called_once_with("Warning: The specified format BAM does not match the detected format in "
"the file name, example.cram.cip , detected format: CRAM. The detected "
"format will be used since transcoding is not yet supported by the file "
"distribution service.")


def test_no_format_specified_and_not_detected(mock_warning):
name = "example.unknown.cip"
genomic_range_args = ("chr1", 100, 200, "reference", None)
format_list = ["BAM", "CRAM"]

result = autocorrect_format_in_genomic_range_args(name, genomic_range_args, format_list)

assert result == genomic_range_args
mock_warning.assert_called_once_with(
"Warning: No file format was specified nor detected. The file distribution service will use 'BAM' as the "
"default format. If you require a different format, please specify it using the '--format' option,"
"followed by the desired format (e.g., '--format CRAM'). For a list of supported formats, "
"use the '--help' option.")


def test_no_format_specified_but_detected(mock_warning):
name = "example.cram.cip"
genomic_range_args = ("chr1", 100, 200, "reference", None)
format_list = ["BAM", "CRAM"]

result = autocorrect_format_in_genomic_range_args(name, genomic_range_args, format_list)

expected_result = ("chr1", 100, 200, "reference", "CRAM")
assert result == expected_result
mock_warning.assert_not_called()


def test_is_bam_or_cram_file_returns_true():
name = "example.bam"
assert is_bam_or_cram_file(name)

name = "example.cram"
assert is_bam_or_cram_file(name)

name = "example.cram.cip"
assert is_bam_or_cram_file(name)

name = "example.2.bam.cip"
assert is_bam_or_cram_file(name)

name = "example.bam.cip"
assert is_bam_or_cram_file(name)


def test_is_bam_or_cram_file_returns_false():
name = "example.vcf.cip"
assert not is_bam_or_cram_file(name)

name = "example.txt"
assert not is_bam_or_cram_file(name)

0 comments on commit 7204409

Please sign in to comment.