Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First release review tweaks #34

Merged
merged 2 commits into from
Apr 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,6 @@ jobs:
env:
NXF_VER: ${{ matrix.nxf_ver }}
NXF_ANSI_LOG: false
COSMIC_USERNAME: ${{ secrets.COSMIC_USERNAME }}
COSMIC_PASSWORD: ${{ secrets.COSMIC_PASSWORD }}

strategy:
matrix:
Expand Down
30 changes: 17 additions & 13 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,30 +20,34 @@ The main source of canonical protein sequence in pgdb is ENSEMBL. The user can t
* [Decoy](#decoys) - Add decoy proteins to the final database.
* [Output](#output) - Output results including clean databases and decoy generation

## Ensembl
## Pipeline modes

The pipeline will download the the ENSEMBL protein reference proteome, this will be added to the final protein database. The protein databae is downloaded from [ENSEMBL FTP](http://www.ensembl.org/info/data/ftp/index.html)
### Ensembl

## Ensembl non canonical
The pipeline will download the the ENSEMBL protein reference proteome, this will be added to the final protein database. The protein database is downloaded from [ENSEMBL FTP](http://www.ensembl.org/info/data/ftp/index.html).

The Ensembl non canonical includes the pseudogenes, lncRNAs, etc. The accessions of each type of kind of novel protein is predefined by the [pypgatk tool](https://github.com/bigbio/py-pgatk)
### Ensembl non canonical

* ncRNA_ENST00000456688 - non coding RNA transcript.
* altorf_ENST00000310473 - alternative open reading frame
* pseudo_ENST00000436135 - pseudo gene translation
The Ensembl non canonical includes the pseudogenes, lncRNAs, etc. The accessions of each type of kind of novel protein is predefined by the [pypgatk tool](https://github.com/bigbio/py-pgatk).

## Variants
* `ncRNA_ENST00000456688` - non coding RNA transcript.
* `altorf_ENST00000310473` - alternative open reading frame
* `pseudo_ENST00000436135` - pseudo gene translation

### Variants

The COSMIC or cBioPortal variants are downloaded automatically from these resources. The accessions of those proteins are:

* COSMIC:ANXA3_ENST00000503570:p.A67T:Substitution-Missense - Accession of the protein includes the position of the aminoacid variant.
* `COSMIC:ANXA3_ENST00000503570:p.A67T:Substitution-Missense` - Accession of the protein includes the position of the aminoacid variant.

## Decoy
### Decoy

Decoy can be added to the final database. Decoys accessions are prefix with `DECOY_` by default, but they can be configured by the users.

## Output
## Output files

The nf-core/pgdb pipeline produces one single output file:

/fasta_database.fa
* `/fasta_database.fa`

The FASTA database including all the protein sequences including the reference proteomes, variants, pseudo-genes, etc.
This FASTA database includes all of the protein sequences including the reference proteomes, variants, pseudo-genes, etc.