Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to nf-core last v2 version; incorporating GTDB #370

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from
Open

Conversation

chasemc
Copy link
Member

@chasemc chasemc commented Dec 17, 2024

Old and new work on updating the nf-core standardization, along with incorporating the GTDB code that hadn't been added to the Nextflow workflow

@chasemc
Copy link
Member Author

chasemc commented Dec 17, 2024

The following was used to test version d6706c6


# The combined output directory for both samples in both runs (NCBI and NCBI-GTDB) ends at ~293 GB
# The single_db_dir ends at 758GB (could be reduced 87GB by updating the workflow to delete nr.gz after the diamond database is created)
# The work directory ends at 147GB for the NCBI-only

git clone git@github.com:KwanLab/Autometa.git
cd Autometa
git switch new-nf

# make sure we build the container from the checked out branch for the workflow to use
docker build . -t jasonkwan/autometa:`git branch --show-current`

# create the output and database directories and the sample csv
example_dir="/home/chase/autometa_test"
mkdir -p $example_dir $example_dir/database_directory $example_dir/output

sample_sheet="$example_dir/autometa_test_samplesheet.csv"
echo "sample,assembly,fastq_1,fastq_2,coverage_tab,cov_from_assembly" > $sample_sheet
echo "78mbp,/media/bigdrive1/autometa_test_data/78Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/78Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/78Mbp/reverse_reads.fastq.gz,,0" >> $sample_sheet
echo "625Mbp,/media/bigdrive1/autometa_test_data/625Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/625Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/625Mbp/reverse_reads.fastq.gz,,0" >> $sample_sheet

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
''' > $example_dir/nextflow.config

# run the full workflow + GTDB refinement 
nextflow run . \
    -profile docker  \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --use_gtdb \
    --gtdb_version '220' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume

# run the full workflow without GTDB refinement 
nextflow run . \
    -profile docker,slurm \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output_ncbi_only \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume


# rm -rf $example_dir/output
# rm -rf /media/BRIANDATA3/autometa_test
# rm -rf /media/BRIANDATA3/temp
# rm -rf '/home/chase/tempauto'

@chasemc
Copy link
Member Author

chasemc commented Dec 17, 2024

@chasemc chasemc requested a review from jason-c-kwan December 17, 2024 00:20
@jason-c-kwan
Copy link
Collaborator

I keep getting errors that it can't find the new-nf docker image even though I have built it locally on deep thought and I even tried pushing it to docker hub. Have you any insight into why that is happening?

@chasemc
Copy link
Member Author

chasemc commented Dec 23, 2024

To use the local version now you have to set the registry to nothing, e.g.

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
docker.registry = ""
''' > $example_dir/nextflow.config

But either works right now on the server

@jason-c-kwan
Copy link
Collaborator

I was running on the server before, so not having docker.registry = "" in the cextflow.config file doesn't seem to work for me. With that line it appears to be running, although I am getting these errors in the output:

ERROR ~ Error executing process > 'AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS (Preparing db cache for gtdb)'

Caused by:
  Process `AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS (Preparing db cache for gtdb)` terminated with an error exit status (1)


Command executed:

  # https://autometa.readthedocs.io/en/latest/scripts/taxonomy/lca.html
  autometa-taxonomy-lca \
      --blast . \
      --lca-output . \
      --dbdir . \
      --dbtype gtdb \
      --cache cache \
      --only-prepare-cache

  cat <<-END_VERSIONS > versions.yml
  "AUTOMETA:TAXONOMY_WORKFLOW:GTDB_REFINEMENT:TAXON_SPLIT:LCA:PREP_DBS":
      autometa: $(autometa --version | sed -e 's/autometa: //g')
  END_VERSIONS

Command exit status:
  1


  Command error:
  WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
  Traceback (most recent call last):
    File "/opt/conda/bin/autometa-taxonomy-lca", line 8, in <module>
      sys.exit(main())
               ^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/lca.py", line 698, in main
      taxonomy_db = GTDB(args.dbdir)
                    ^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/gtdb.py", line 67, in __init__
      self.names = self.parse_names()
                   ^^^^^^^^^^^^^^^^^^
    File "/opt/conda/lib/python3.12/site-packages/autometa/taxonomy/gtdb.py", line 180, in parse_names
      fh = open(self.names_fpath)
           ^^^^^^^^^^^^^^^^^^^^^^
  PermissionError: [Errno 13] Permission denied: './names.dmp'

Work dir:
  /media/BRIANDATA3/temp/a9/624e869a6c5610e906b5e1e66413e0

Container:
  jasonkwan/autometa:new-nf

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

I tried to make the /media/BRIANDATA3/autometa_test directory readable/writable by all users, but I still got the same error messages. It does appear to carry on running despite of this, though.

@jason-c-kwan
Copy link
Collaborator

Update: it ended after about an hour, so this is preventing it from running. I tried running it without pointing to the existing database files, and I got this:

ERROR ~ No such variable: out_ch

 -- Check script 'Autometa/./workflows/../subworkflows/local/./././prepare_nr.nf' at line: 133 or see '.nextflow.log' file for more details

@chasemc
Copy link
Member Author

chasemc commented Dec 23, 2024

That drive has odd group permissions and those files were all assigned to "storage" group. I chowned the directory just now to be chase:chase but if that doesn't work I would just try another drive

@chasemc
Copy link
Member Author

chasemc commented Dec 23, 2024

i.e. it seems to be a system-level file permission issue, not a workflow issue

@jason-c-kwan
Copy link
Collaborator

OK, I think I fixed the permissions issue, but I didn't realize that above the message about out_ch there was this message:

 Neither nr.dmnd or nr.gz were found and `--large_downloads_permission` is set to false.

Not totally sure why it is not using the stuff that is already there, but I would like to just try allowing it to download new databases. I tried adding --large_downloads_permission to the main.nf command in my submit script, but I got the same result. Is there something I am missing about how to pass this option to the workflow?

@chasemc
Copy link
Member Author

chasemc commented Jan 6, 2025

Can you provide the full commands you are using?

@jason-c-kwan
Copy link
Collaborator

This is my current submit script:

#!/bin/bash
#SBATCH --partition=queue
#SBATCH -N 1 # Nodes
#SBATCH -n 1 # Tasks
#SBATCH --cpus-per-task=1
#SBATCH --error=autometa_test.%J.err
#SBATCH --output=autometa_test.%J.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=jason.kwan@wisc.edu

# Initialize conda/mamba for bash shell
source ~/.bashrc   # or your shell rc file
source ~/miniconda3/etc/profile.d/conda.sh
source ~/miniconda3/etc/profile.d/mamba.sh

mamba activate autometa-nf

example_dir="/media/bigdrive1/autometa_test"
sample_sheet="$example_dir/autometa_test_samplesheet.csv"

mkdir -p $example_dir $example_dir/database_directory $example_dir/output

echo "sample,assembly,fastq_1,fastq_2,coverage_tab,cov_from_assembly" > $sample_sheet
echo "78mbp,/media/bigdrive1/autometa_test_data/78Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/78Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/78Mbp/reverse_
reads.fastq.gz,,0" >> $sample_sheet
echo "625Mbp,/media/bigdrive1/autometa_test_data/625Mbp/metagenome.fna.gz,/media/bigdrive1/autometa_test_data/625Mbp/forward_reads.fastq.gz,/media/bigdrive1/autometa_test_data/625Mbp/reve
rse_reads.fastq.gz,,0" >> $sample_sheet

# edit the resources for the workflow to use
echo '''
process {
  withLabel:process_low {
    cpus   = { 1 * task.attempt }
    memory = { 14.GB * task.attempt }
    time   = { 24.h  * task.attempt }
  }
  withLabel:process_medium {
    cpus   = { 12  * task.attempt }
    memory = { 42.GB * task.attempt }
    time   = { 24.h * task.attempt }
  }
  withLabel:process_high {
    cpus   = { 36 * task.attempt }
    memory = { 200.GB * task.attempt }
    time   = { 48.h * task.attempt }
  }
}
docker.registry = ""
''' > $example_dir/nextflow.config

# run the full workflow + GTDB refinement
nextflow run /home/jkwan/Autometa/main.nf \
    -profile docker  \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output \
    --single_db_dir /media/BRIANDATA3/autometa_test \
    #--single_db_dir ${example_dir}
    --autometa_image_tag 'new-nf' \
    --use_gtdb \
    --gtdb_version '220' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    --large_downloads_permission \
    -resume

# run the full workflow without GTDB refinement
nextflow run /home/jkwan/Autometa/main.nf \
    -profile docker,slurm \
    --input $sample_sheet \
    --taxonomy_aware \
    --outdir ${example_dir}/output_ncbi_only \
    #--single_db_dir ${example_dir}
    --single_db_dir /media/BRIANDATA3/autometa_test \
    --autometa_image_tag 'new-nf' \
    --large_downloads_permission \
    --max_memory '900.GB' \
    --max_cpus 90 \
    --max_time '20040.h' \
    -c $example_dir/nextflow.config \
    -w /media/BRIANDATA3/temp \
    -resume

@chasemc
Copy link
Member Author

chasemc commented Jan 6, 2025

Internet here is being worked on so I can't test it

my assumption would be that you added and then commented out #--single_db_dir ${example_dir} causing the remaining flags to not be executed

@jason-c-kwan
Copy link
Collaborator

OK, I think that might have been it. I couldn't get it to use the existing databases, so it is currently downloading them.

@jason-c-kwan
Copy link
Collaborator

It did get further along the pipeline, but I am now getting another error in the output:

executor >  local (22)
[71/3459b9] AUT…meta_test_samplesheet.csv) | 1 of 1 ✔
[e5/2a5c1b] AUT…gs < 3000 bp, from 625Mbp) | 2 of 2 ✔
[2b/704af7] AUT…(Aligning reads to 625Mbp) | 2 of 2 ✔
[d3/f38305] AUT…OLS_VIEW_AND_SORT (625Mbp) | 2 of 2 ✔
[4d/cec06a] AUT…EDTOOLS_GENOMECOV (625Mbp) | 2 of 2 ✔
[7a/06f6a4] AUT…OVERAGE:PARSE_BED (625Mbp) | 2 of 2 ✔
[-        ] AUT…ERAGE:SPADES_KMER_COVERAGE -
[75/75a41d] AUTOMETA:PRODIGAL (625Mbp)     | 2 of 2 ✔
[16/0a01fe] AUT…in 625Mbp against nr.dmnd) | 1 of 2
[75/a55f03] AUT…eparing db cache for ncbi) | 1 of 1, cached: 1 ✔
[4c/489840] AUT…inding ncbi LCA for 78mbp) | 1 of 1
[7a/c86131] AUT…on majority vote on 78mbp) | 1 of 1
[09/0e1081] AUT…s into kingdoms for 78mbp) | 1 of 1
[skipped  ] AUT…GTDB database version 220) | 1 of 1, stored: 1 ✔
[skipped  ] AUT…reparing Diamond database) | 1 of 1, stored: 1 ✔
[-        ] AUT…DB_REFINEMENT:EXTRACT_ORFS -
[-        ] AUT…TAXON_SPLIT:DIAMOND_BLASTP -
[c1/3e860a] AUT…eparing db cache for gtdb) | 1 of 1, cached: 1 ✔
[-        ] AUT…ENT:TAXON_SPLIT:LCA:REDUCE -
[-        ] AUT…:TAXON_SPLIT:MAJORITY_VOTE -
[9a/c25c71] AUT…rchaea markers for 625Mbp) | 4 of 4 ✔
Plus 7 more processes waiting for tasks… 
ERROR ~ Negative array index [-2] too large for array size 1

 -- Check script 'Autometa/./workflows/../subworkflows/local/././taxon_split.nf' at line: 73 or see '.nextflow.log' file for more details

I did look in .nextflow.log but I couldn't find the part about this error. Perhaps I missed it. Anyway, do you have any idea how to troubleshoot this?

@chasemc
Copy link
Member Author

chasemc commented Jan 17, 2025

Can take a look when back in the US next week. Can you post a the log or email to my wisc email

@chasemc
Copy link
Member Author

chasemc commented Jan 22, 2025

Downloading files and running on a completely new ubuntu instance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants