-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
yeat-auto
- automatically fill in samples for the config file
#76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is ready for review! @standage
@@ -34,7 +34,8 @@ | |||
entry_points={ | |||
'console_scripts': [ | |||
'yeat=yeat.cli:main', | |||
'just-yeat-it=yeat.cli.just_yeat_it:main' | |||
'just-yeat-it=yeat.cli.just_yeat_it:main', | |||
'yeat-auto=yeat.cli.yeat_auto:main' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New command: yeat-auto
yeat/cli/auto_pop.py
Outdated
class AutoPop: | ||
def __init__(self, args): | ||
self.samples = self.get_samples(args.samples) | ||
self.check_samples() | ||
self.check_seq_path(args.seq_path) | ||
self.files = self.get_files(args.files, args.seq_path) | ||
self.check_files() | ||
self.files_to_samples = self.organize_files_to_samples() | ||
self.check_files_to_samples() | ||
self.write_config_file(args.outfile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created the AutoPop
class to create the config file. Samples and files are parsed from the command-line arguments and checked for validity. If input data has no problems, we sort the fastq files to their samples and do another check to ensure that there are enough reads per sample. If sorted correctly, we then write the config file.
yeat/cli/auto_pop.py
Outdated
def traverse(self, dirpath): | ||
dirpath = Path(dirpath) | ||
if not dirpath.is_dir(): | ||
return # pragma: no cover | ||
for subpath in dirpath.iterdir(): | ||
if subpath.is_dir(): | ||
yield from self.traverse(subpath) | ||
else: | ||
yield subpath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--seq-path
can take in a nested directory to find fastq files.
yeat/cli/auto_pop.py
Outdated
assemblies = { | ||
"spades-default": { | ||
"algorithm": "spades", | ||
"extra_args": "", | ||
"samples": self.samples, | ||
"mode": "paired", | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All samples will be dumped into the spades-default
assembly since the auto-populate feature is only for paired-end reads.
yeat/cli/yeat_auto.py
Outdated
def input_configuration(parser): | ||
ingrp = parser.add_argument_group("input configuration") | ||
meg = ingrp.add_mutually_exclusive_group(required=True) | ||
meg.add_argument( | ||
"--seq-path", | ||
default=None, | ||
help="path to a directory containing FASTQ files to use as input; incompatible with --files", | ||
metavar="PATH", | ||
) | ||
meg.add_argument( | ||
"--files", | ||
default=None, | ||
help="a list of FASTQ files to use as input; incompatible with --seq-path", | ||
metavar="FQ", | ||
nargs="+", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users will need to provide either one of these flags along with the sample or list of sample names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new program works as expected in my hands, and the code looks great! I've made a few suggestions.
yeat/cli/auto_pop.py
Outdated
def __init__(self, args): | ||
self.samples = self.get_samples(args.samples) | ||
self.check_samples() | ||
self.check_seq_path(args.seq_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The args
argument here hides all the data that the AutoPop
object depends on to work properly. If we explicitly pass that data in discrete arguments, it will more clearly communicate what the object needs and what the code is doing.
It looks like there are four arguments: samples, files, seq_path, and outfile. "Clean Code" purists would argue that this is too many, and in some cases they may be right. But I'm not sure I see a way to implement this feature without passing at least the first three to the same object. You could definitely simplify by moving the write_config_file
method outside of the constructor, requiring the user to call it explicitly, and passing outfile
as an argument to that method.
yeat/cli/auto_pop.py
Outdated
def get_files(self, files, seq_path): | ||
if files: | ||
return [Path(f) for f in files] | ||
files = [] | ||
for file in self.traverse(seq_path): | ||
if not file.name.endswith(EXTENSIONS): | ||
continue | ||
files.append(file) | ||
return files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method violates the "do one thing" rule. Perhaps it would be better to remove the first two lines, drop the files
argument, and call the method like this.
self.files = files if files is not None else self.get_files(seq_path)
yeat/tests/test_yeat_auto.py
Outdated
def test_sequence_with_no_files(tmp_path): | ||
wd = str(tmp_path) | ||
config = f"{wd}/config.cfg" | ||
arglist = ["SAMPLE", "-o", config, "--seq-path", data_file("")] | ||
message = "sample SAMPLE: expected 2 FASTQ files for paired-end data, found 0" | ||
with pytest.raises(AutoPopError, match=message): | ||
run_yeat_auto(arglist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change SAMPLE
here to something that's more obviously bogus.
yeat/cli/auto_pop.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have super strong feelings about this, but I think one could argue that this class is out of scope for the CLI and could be moved up a level of organization to the main YEAT package or the config
subpackage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argument groups become very helpful when a program has a large number of arguments and options. For a program this size with essentially four arguments, breaking things down that granularly isn't as effective in my opinion.
usage: yeat-auto [-h] [-o FILE] (--seq-path PATH | --files FQ [FQ ...]) samples [samples ...]
required arguments:
samples list of sample names or path to .txt file containing sample names
options:
-h, --help show this help message and exit
config configuration:
-o FILE, --outfile FILE
output config file; by default, "config.cfg"
input configuration:
--seq-path PATH path to a directory containing FASTQ files to use as input; incompatible with --files
--files FQ [FQ ...] a list of FASTQ files to use as input; incompatible with --seq-path
I think the example below is more effective and aesthetically pleasing than the example above.
usage: yeat-auto [-h] [-o FILE] (--seq-path PATH | --files FQ [FQ ...]) samples [samples ...]
required arguments:
samples list of sample names or path to .txt file containing sample names
options:
-h, --help show this help message and exit
--seq-path PATH path to a directory containing FASTQ files to use as input; incompatible with --files
--files FQ [FQ ...] a list of FASTQ files to use as input; incompatible with --seq-path
-o FILE, --outfile FILE
output config file; by default, "config.cfg"
I also think printing the config to the standard output is a more conventional default than a specific filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated code with suggestions! Ready for another review!
self.samples = self.get_samples(samples) | ||
self.check_samples() | ||
self.check_seq_path(seq_path) | ||
self.files = [Path(f) for f in files] if files is not None else self.get_files(seq_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took one of your suggestions and added a slight change to it. Instead of setting self.files
to a list of strings (from the command-line), if files
is not None, we do a list comprehension of converting each string to Path
objects. These objects are required for the next time of code self.check_files()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor changes.
yeat/config/auto_pop.py
Outdated
def check_seq_path(self, seq_path): | ||
if not seq_path: | ||
return | ||
if not Path(seq_path).exists(): | ||
message = f"path does not exist: '{seq_path}'" | ||
raise AutoPopError(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a built-in exception type for this use case.
def check_seq_path(self, seq_path):
if not seq_path:
return
if not Path(seq_path).exists():
raise FileNotFoundError(seq_path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! I updated another function that did the same thing as well.
yeat/config/auto_pop.py
Outdated
def organize_files_to_samples(self): | ||
files_to_samples = defaultdict(list) | ||
for sample in self.samples: | ||
files_to_samples[sample] = [file for file in self.files if sample in str(file)] | ||
return files_to_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could get away with just declaring a dict
here. The defaultdict
only provides an advantage if you want to call dict_of_lists[key].append()
without first calling dict_of_lists[key] = []
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point 👍
yeat/cli/yeat_auto.py
Outdated
parser.add_argument( | ||
"-o", | ||
"--outfile", | ||
default="config.cfg", | ||
help='output config file; by default, "config.cfg"', | ||
metavar="FILE", | ||
type=str, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reiterating my suggestion that stdout is used as default.
parser.add_argument(
"-o",
"--outfile",
default=sys.stdout,
help='output config file; by default, output is written to terminal (standard output)',
metavar="FILE",
type=argparse.FileType("w"),
)
The .write_config_file()
method would need to be updated to take a file handle instead of a string, and the main method to do the directory handling, but that's probably more appropriate anyway given the scope of these units of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely missed the previous comment. I agree that the stdout is the way to go. I decided to remove the -o
flag. Users can pipe the output into a file, etc. Keeping it consistent with yeat --init
.
Also didn't realize you could use the type=argparse.FileType("w")
that way.
Comments addressed! Thanks! |
The purpose of this PR is to automatically fill in the configuration file. When users have multiple samples, for example, 10+, it can be tedious to fill out all the paths in the sample section of the config. To ease the manual labor and to prevent potential type-os, the
yeat-auto
command was created. When using this command, users will need to supply either a 1) sample name, 2) list of sample names, or 3) a text document containing all the sample names separated by newlines. The second required input is either a list of fastq files or a directory containing the sample's fastq files (see the--seq-path
and--files
flags).Auto-populating is done by taking a sample's name and looking for matching fastq files based on substrings.
For example, the sample name is
short_reads
and the matching files areshort_read_1.fastq.gz
andshort_read_2.fastq.gz
.Auto-populating is only available for paired-end reads. As a result, all samples are dumped into the default spades algorithm in the "assemblies" section of the config.
Users can then run the normal YEAT command with the newly created configuration, or if there are any changes that need to be made, users can adjust the config file as needed before passing it into
yeat
.