Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

yeat-auto - automatically fill in samples for the config file #76

Merged
merged 10 commits into from
Apr 9, 2024

Conversation

danejo3
Copy link
Collaborator

@danejo3 danejo3 commented Apr 3, 2024

The purpose of this PR is to automatically fill in the configuration file. When users have multiple samples, for example, 10+, it can be tedious to fill out all the paths in the sample section of the config. To ease the manual labor and to prevent potential type-os, the yeat-auto command was created. When using this command, users will need to supply either a 1) sample name, 2) list of sample names, or 3) a text document containing all the sample names separated by newlines. The second required input is either a list of fastq files or a directory containing the sample's fastq files (see the --seq-path and --files flags).

Auto-populating is done by taking a sample's name and looking for matching fastq files based on substrings. 

For example, the sample name is short_reads and the matching files are short_read_1.fastq.gz and short_read_2.fastq.gz.

Auto-populating is only available for paired-end reads. As a result, all samples are dumped into the default spades algorithm in the "assemblies" section of the config.

"assemblies": {
    "spades-default": {
        "algorithm": "spades",
        "extra_args": "",
        "samples": [
            "short_reads"
        ],
        "mode": "paired"
    }
}

Users can then run the normal YEAT command with the newly created configuration, or if there are any changes that need to be made, users can adjust the config file as needed before passing it into yeat.

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is ready for review! @standage

@@ -34,7 +34,8 @@
entry_points={
'console_scripts': [
'yeat=yeat.cli:main',
'just-yeat-it=yeat.cli.just_yeat_it:main'
'just-yeat-it=yeat.cli.just_yeat_it:main',
'yeat-auto=yeat.cli.yeat_auto:main'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New command: yeat-auto

Comment on lines 22 to 31
class AutoPop:
def __init__(self, args):
self.samples = self.get_samples(args.samples)
self.check_samples()
self.check_seq_path(args.seq_path)
self.files = self.get_files(args.files, args.seq_path)
self.check_files()
self.files_to_samples = self.organize_files_to_samples()
self.check_files_to_samples()
self.write_config_file(args.outfile)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created the AutoPop class to create the config file. Samples and files are parsed from the command-line arguments and checked for validity. If input data has no problems, we sort the fastq files to their samples and do another check to ensure that there are enough reads per sample. If sorted correctly, we then write the config file.

Comment on lines 71 to 79
def traverse(self, dirpath):
dirpath = Path(dirpath)
if not dirpath.is_dir():
return # pragma: no cover
for subpath in dirpath.iterdir():
if subpath.is_dir():
yield from self.traverse(subpath)
else:
yield subpath
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--seq-path can take in a nested directory to find fastq files.

Comment on lines 104 to 111
assemblies = {
"spades-default": {
"algorithm": "spades",
"extra_args": "",
"samples": self.samples,
"mode": "paired",
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All samples will be dumped into the spades-default assembly since the auto-populate feature is only for paired-end reads.

Comment on lines 41 to 56
def input_configuration(parser):
ingrp = parser.add_argument_group("input configuration")
meg = ingrp.add_mutually_exclusive_group(required=True)
meg.add_argument(
"--seq-path",
default=None,
help="path to a directory containing FASTQ files to use as input; incompatible with --files",
metavar="PATH",
)
meg.add_argument(
"--files",
default=None,
help="a list of FASTQ files to use as input; incompatible with --seq-path",
metavar="FQ",
nargs="+",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will need to provide either one of these flags along with the sample or list of sample names.

@danejo3 danejo3 requested a review from standage April 5, 2024 13:38
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new program works as expected in my hands, and the code looks great! I've made a few suggestions.

Comment on lines 23 to 26
def __init__(self, args):
self.samples = self.get_samples(args.samples)
self.check_samples()
self.check_seq_path(args.seq_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The args argument here hides all the data that the AutoPop object depends on to work properly. If we explicitly pass that data in discrete arguments, it will more clearly communicate what the object needs and what the code is doing.

It looks like there are four arguments: samples, files, seq_path, and outfile. "Clean Code" purists would argue that this is too many, and in some cases they may be right. But I'm not sure I see a way to implement this feature without passing at least the first three to the same object. You could definitely simplify by moving the write_config_file method outside of the constructor, requiring the user to call it explicitly, and passing outfile as an argument to that method.

Comment on lines 55 to 63
def get_files(self, files, seq_path):
if files:
return [Path(f) for f in files]
files = []
for file in self.traverse(seq_path):
if not file.name.endswith(EXTENSIONS):
continue
files.append(file)
return files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method violates the "do one thing" rule. Perhaps it would be better to remove the first two lines, drop the files argument, and call the method like this.

self.files = files if files is not None else self.get_files(seq_path)

Comment on lines 63 to 69
def test_sequence_with_no_files(tmp_path):
wd = str(tmp_path)
config = f"{wd}/config.cfg"
arglist = ["SAMPLE", "-o", config, "--seq-path", data_file("")]
message = "sample SAMPLE: expected 2 FASTQ files for paired-end data, found 0"
with pytest.raises(AutoPopError, match=message):
run_yeat_auto(arglist)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change SAMPLE here to something that's more obviously bogus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have super strong feelings about this, but I think one could argue that this class is out of scope for the CLI and could be moved up a level of organization to the main YEAT package or the config subpackage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument groups become very helpful when a program has a large number of arguments and options. For a program this size with essentially four arguments, breaking things down that granularly isn't as effective in my opinion.

usage: yeat-auto [-h] [-o FILE] (--seq-path PATH | --files FQ [FQ ...]) samples [samples ...]

required arguments:
  samples               list of sample names or path to .txt file containing sample names

options:
  -h, --help            show this help message and exit

config configuration:
  -o FILE, --outfile FILE
                        output config file; by default, "config.cfg"

input configuration:
  --seq-path PATH       path to a directory containing FASTQ files to use as input; incompatible with --files
  --files FQ [FQ ...]   a list of FASTQ files to use as input; incompatible with --seq-path

I think the example below is more effective and aesthetically pleasing than the example above.

usage: yeat-auto [-h] [-o FILE] (--seq-path PATH | --files FQ [FQ ...]) samples [samples ...]

required arguments:
  samples               list of sample names or path to .txt file containing sample names

options:
  -h, --help            show this help message and exit
  --seq-path PATH       path to a directory containing FASTQ files to use as input; incompatible with --files
  --files FQ [FQ ...]   a list of FASTQ files to use as input; incompatible with --seq-path
  -o FILE, --outfile FILE
                        output config file; by default, "config.cfg"

I also think printing the config to the standard output is a more conventional default than a specific filename.

Copy link
Collaborator Author

@danejo3 danejo3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated code with suggestions! Ready for another review!

self.samples = self.get_samples(samples)
self.check_samples()
self.check_seq_path(seq_path)
self.files = [Path(f) for f in files] if files is not None else self.get_files(seq_path)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took one of your suggestions and added a slight change to it. Instead of setting self.files to a list of strings (from the command-line), if files is not None, we do a list comprehension of converting each string to Path objects. These objects are required for the next time of code self.check_files().

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more minor changes.

Comment on lines 47 to 52
def check_seq_path(self, seq_path):
if not seq_path:
return
if not Path(seq_path).exists():
message = f"path does not exist: '{seq_path}'"
raise AutoPopError(message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a built-in exception type for this use case.

    def check_seq_path(self, seq_path):
        if not seq_path:
            return
        if not Path(seq_path).exists():
            raise FileNotFoundError(seq_path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I updated another function that did the same thing as well.

Comment on lines 78 to 82
def organize_files_to_samples(self):
files_to_samples = defaultdict(list)
for sample in self.samples:
files_to_samples[sample] = [file for file in self.files if sample in str(file)]
return files_to_samples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get away with just declaring a dict here. The defaultdict only provides an advantage if you want to call dict_of_lists[key].append() without first calling dict_of_lists[key] = [].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point 👍

Comment on lines 44 to 51
parser.add_argument(
"-o",
"--outfile",
default="config.cfg",
help='output config file; by default, "config.cfg"',
metavar="FILE",
type=str,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reiterating my suggestion that stdout is used as default.

    parser.add_argument(
        "-o",
        "--outfile",
        default=sys.stdout,
        help='output config file; by default, output is written to terminal (standard output)',
        metavar="FILE",
        type=argparse.FileType("w"),
    )

The .write_config_file() method would need to be updated to take a file handle instead of a string, and the main method to do the directory handling, but that's probably more appropriate anyway given the scope of these units of code.

Copy link
Collaborator Author

@danejo3 danejo3 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely missed the previous comment. I agree that the stdout is the way to go. I decided to remove the -o flag. Users can pipe the output into a file, etc. Keeping it consistent with yeat --init.

Also didn't realize you could use the type=argparse.FileType("w") that way.

@danejo3
Copy link
Collaborator Author

danejo3 commented Apr 9, 2024

Comments addressed! Thanks!

@standage standage merged commit 0f3717f into main Apr 9, 2024
@standage standage deleted the autopop-config branch April 9, 2024 18:24
@danejo3 danejo3 mentioned this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants