-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ported GATK's FastaReferenceWriter #1172
Conversation
@alecw will this work as well for you? |
Codecov Report
@@ Coverage Diff @@
## master #1172 +/- ##
===============================================
+ Coverage 68.343% 70.306% +1.963%
- Complexity 8014 9891 +1877
===============================================
Files 541 611 +70
Lines 32716 38378 +5662
Branches 5531 6717 +1186
===============================================
+ Hits 22359 26982 +4623
- Misses 8136 8894 +758
- Partials 2221 2502 +281
|
@yfarjoun I'll try this when I get a chance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some overall comments about the design (I did not check the implementation yet).
* THE SOFTWARE. | ||
*/ | ||
|
||
package htsjdk; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can better live in htsjdk.utils
/** | ||
* Simple functions that streamline the checking of values. | ||
*/ | ||
public class ValidationUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally a class like this in HTSJDK!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:-)
* <ul> | ||
* <li>Sequence names are valid (non-empty, without space/blank, control characters),</li> | ||
* <li>sequence description are valid (without control characters),</li> | ||
* <li>bases are valid nucleotides ore IUPAC redundancy codes and X [ACGTNX...] (lower or uppercase are accepted),</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: nucleotides ore IUPAC
-> nucleotides or IUPAC
* @throws IllegalArgumentException if {@code fastaFile} is {@code null}. | ||
* @throws IOException if such exception is thrown when accessing the output path resources. | ||
*/ | ||
public FastaReferenceWriter(final Path fastaFile, final boolean makeFaiOutput, final boolean makeDictOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better to have a factory instead of too many constructors...
// for the sake of avoiding creating output if basesPerLine is invalid. | ||
this.defaultBasePerLine = checkBasesPerLine(basesPerLine); | ||
|
||
this.fastaStream = new CountingOutputStream(Files.newOutputStream(fastaFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support for compressed FASTA files (with gzi/fai index) can be also added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
later perhaps...I wanted to get something in that works for the simple case...feel free to modify it afterwards...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please open an issue and mention it here (or in the factory when the stream is open)? It is easier to track that way
} | ||
} | ||
|
||
private void writeIndexEntry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this functionality somewhere already? Otherwise, it will be nice to have an index-codec for decoding/encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a FastaIndexing
class to index on the fly streams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently FastaSequenceIndex and FastaSequenceIndexCreator are geared towards making an index from a fasta-file, not for generating the index on the fly....so without some serious refactoring that's not going to be easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I open #1173 for the refactor
* @param bases the sequence bases, cannot be {@code null}. | ||
* @throws IOException if such exception is thrown when writing in the output resources. | ||
*/ | ||
public static void writeSingleSequenceReference(final Path whereTo, final boolean makeIndex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a factory/builder for the writer, the static methods can be moved there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would expect a static method like this to be in the class itself, not a factory method.
/** | ||
* Implementation of a writer that does nothing | ||
*/ | ||
public class NullWriter extends Writer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the main or the test sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main....it's used in the production code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it belongs in htsjdk and not in your production code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until you have more than one use case in htsjdk, it could live inside the writer class as a static private inner class.
import java.util.stream.IntStream; | ||
import java.util.stream.Stream; | ||
|
||
public class FastaReferenceWriterTest extends HtsjdkTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be nice to have a test that reads an input FASTA, writes it down and checks if the files are the same (either MD5 or re-reading line by line).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can review FastaReferenceWriter
once the changes @magicDGS suggested about a factory instead of so many public constructors.
*/ | ||
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){ | ||
nonNull(collection, "The collection is null: " + message); | ||
if(collection.isEmpty()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
){
-> ) {
here and throughout. Generally, the spacing in if/else
and so on is not consistent here and throughout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
s can be used here and throughout.
*/ | ||
public static <T> T nonNull(final T object, final Supplier<String> message) { | ||
if (object == null) { | ||
throw new IllegalArgumentException(message.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and below is ripe for a method like
public static T require(final T actual, final T expected, final Supplier<String> message) {
if (actual == expected) return actual;
else throw new IllegalArgumentException(message.get());
}
I'd just use require
in scala :P
return nonEmpty(collection, "collection must not be null or empty."); | ||
} | ||
|
||
public static void validateArg(final boolean condition, final String msg){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs here and blow
for (int i = 0; i < length; ++i) { | ||
bases[i] = BASES[this.random.nextInt(BASES.length)]; | ||
} | ||
final byte[] bases = SequenceUtil.getRandomBases(random, length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can get rid of BASES
now.
} | ||
|
||
return bases; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one test please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. thanks for the reminder.
@@ -0,0 +1,32 @@ | |||
package htsjdk.utils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license
/** | ||
* Implementation of a writer that does nothing | ||
*/ | ||
public class NullWriter extends Writer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure it belongs in htsjdk and not in your production code.
|
||
/** | ||
* Writes a FASTA formatted reference file. | ||
* <p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this javadoc look correct to you when it's generated? I would change it in a few ways:
- omit
</p>
as it will result in (unnecessary) extra vertical space - for a code block, wrap with
<pre>..</pre>
with proper indenting
also, </li>
is unnecessary although it does no harm to include it
this.defaultBasePerLine = checkBasesPerLine(basesPerLine); | ||
this.fastaStream = new CountingOutputStream(fastaOutput); | ||
this.indexWriter = indexOutput == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(indexOutput, CHARSET); | ||
final BufferedWriter dictWriter = new BufferedWriter(dictOutput == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(dictOutput, CHARSET)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you have a local variable for this when you can assign to the field directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason is that this.dictWriter is a Writer while SAMSequenceDictionaryCodec constructor takes a BufferedWriter...I could cast...but that might be uglier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice that
|
||
this.fastaStream = new CountingOutputStream(Files.newOutputStream(fastaFile)); | ||
this.indexWriter = indexFile == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(Files.newOutputStream(indexFile), CHARSET); | ||
final BufferedWriter dictWriter = new BufferedWriter(dictFile == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(Files.newOutputStream(dictFile), CHARSET)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to avoid the code duplication with the other constructor, if possible. Using a factory method or builder pattern is one way to do this. Or you could move the shared code to a private method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue is that we want to check the basesPerLine before the OutputFile is created...and as the comment (in the other constructor says) this is hard to do without code duplication....
Everything else has to be in the constructor since it's final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm. changed the signature and I think it works now.
this.dictWriter = dictWriter; | ||
this.dictCodec = new SAMSequenceDictionaryCodec(dictWriter); | ||
this.dictCodec.encodeHeaderLine(false); | ||
this.sequenceNamesAndSizes = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is always initialized the same way. If so, moving the initialization assignment to the field declaration would make this clearer to the reader.
// checks that a sequence name is valid. | ||
private static void checkSequenceName(final String name) { | ||
ValidationUtils.nonNull(name ,"Sequence name should not be null"); | ||
ValidationUtils.nonEmpty(name,"Sequence name should not be empty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space after ,
*/ | ||
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){ | ||
nonNull(collection, "The collection is null: " + message); | ||
if(collection.isEmpty()){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add spaces before/after ()
* @return the original collection | ||
* @throws IllegalArgumentException if collection is null or empty | ||
*/ | ||
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space before {
. many missing spaces in this file, would be easier to re-format in the IDE to fix them all
* @return the original collection | ||
* @throws IllegalArgumentException if collection is null or empty | ||
*/ | ||
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to parameterize on I
here? Is <T extends Collection<?>>
not sufficient?
nonNull(collection, "The collection is null: " + message); | ||
if(collection.isEmpty()){ | ||
throw new IllegalArgumentException("The collection is empty: " + message); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary else
after throw
* @param collection any Collection | ||
* @return true if the collection exists and has elements | ||
*/ | ||
public static boolean isNonEmpty(Collection<?> collection){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems confusing to me to have two different styles of validation here that do pretty much the same thing.
} | ||
currentSequenceName = sequenceName; | ||
currentBasesPerLine = basesPerLine; | ||
final StringBuilder builder = new StringBuilder(sequenceName.length() + nonNullDescription.length() + 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like what you really want is sequenceName.length() + nonNullDescription.length() + 2
. In practice does it make much difference (regarding code performance) to preallocate here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the performance benefit. I think that the idea is to preallocate if can.. but you are correct about the +2 (I think)
ok @nh13 it's using a builder now. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun I have stopped reviewing the FastaReferenceWriter
class as I think the API needs a lot of work to leverage the various htsjdk classes and make it a lot friendlier. It really feels like a quick port.
I'd like to see the public methods accept htsjdk classes rather than byte arrays, strings, and other things. If there is a reason to bass those directly, for example, because a contig is too big, I can understand that, but I feel like then the API can be improved.
protected FastaReferenceWriter(final int basesPerLine, | ||
final OutputStream fastaOutput, | ||
final OutputStream indexOutput, | ||
final OutputStream dictOutput) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ident is wrong
* @param dictOutput the output stream to the dictFile, if requested, {@code null} if none should be generated. | ||
* @throws IllegalArgumentException if {@code fastaFile} is {@code null} or {@code basesPerLine} is 0 or negative. | ||
*/ | ||
protected FastaReferenceWriter(final int basesPerLine, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cant this be defaulted to DEFAULT_BASES_PER_LINE
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the factory it does that.
} else if (Character.isISOControl(ch)) { | ||
throw new IllegalArgumentException("the input name contains control characters: '" + name + "'"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you stored SAMSequenceRecord
s this would take care of itself!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but I can't use that and still get things like description (which SAMSequenceRecord doesn't support) and also incremental buildup of a sequence..
* The value is the sequence length in bases. | ||
* </p> | ||
*/ | ||
private final Map<String, Long> sequenceNamesAndSizes = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you use a Map<String, SAMSequenceRecord>
?
protected FastaReferenceWriter(final int basesPerLine, | ||
final OutputStream fastaOutput, | ||
final OutputStream indexOutput, | ||
final OutputStream dictOutput) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great if index/dict were optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are (in builder)
return appendBases(bases, 0, bases.length); | ||
} | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've stopped reviewing, because I just don't understand why this class doesn't leverage the various htsjdk classes for it's API. For example, ReferenceSequence
and SAMSequenceRecord
. I get wanting to feed sub-parts of the reference FASTA at a time, but can't that be handled by giving it an iterator over sub-sequences that's evaluated lazily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds nice...but not a simple port. I was looking for something so that folks could write FASTA files with htsjdk.....however, I've realized that the length isn't used...so this is a set of names...I'll make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review the code (again) because I do not have too much time, but I am concern about users passing compressed streams if they do not work.
I agree with @nh13 about using also HTSJDK objects for the writer for better integration.
* You can specify a specific output stream to each file: the main fasta output, its index and its dictionary. | ||
* </p> | ||
* | ||
* @param fastaOutput the output fasta file path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this stream be passed as a bgzip compressed one, and in that case will it work properly? That should be specify here to help developers to refactor to include bgzip-support.
* | ||
* @param fastaOutput the output fasta file path. | ||
* @param indexOutput the output stream to the index file, if requested, {@code null} if none should be generated. | ||
* @param dictOutput the output stream to the dictFile, if requested, {@code null} if none should be generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify that both index/dict streams should not be compressed (the same in the builder).
Hi @yfarjoun, |
- Implemented some basic validation functionality rather than stripping it out of the implementation of FastaReferenceWriter) - Implemented a simple version of a RandomDNA creator (and modified SAMRecordSetBuilder to use it) - Ported the tests (for FastaReferenceWriter) from GATK.
a77e99e
to
98ada8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty good to me. Some minor comments and then I think 👍.
* @param fastaFile a {@link File} to the output fasta file. | ||
* @return this builder | ||
*/ | ||
public FastaReferenceWriterBuilder setFastaFile(File fastaFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just not add methods that take File
|
||
/** | ||
* Sets whether to automatically generate an dictionary file from the name of the fasta-file (assuming it is given | ||
* as a file). This can only happen if both the index file and output stream are null. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably say dict here.
if (indexFile == null && indexOutput == null) { | ||
indexFile = defaultFaiFile(makeFaiOutput, fastaFile); | ||
} | ||
if (dictFile == null && dictOutput == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should detect and error in the case where more than one of these is specified / if one is specified but it's set to not output.
- adding a test - fixed a bug - removed final from TWR, since its unneeded
@lbergelson back to you. |
src/main/java/htsjdk/samtools/reference/FastaReferenceWriter.java
Outdated
Show resolved
Hide resolved
*/ | ||
public FastaReferenceWriter appendString(final String basesString) | ||
throws IOException { | ||
return appendBases(basesString.getBytes()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify encoding ascii
src/main/java/htsjdk/samtools/reference/FastaReferenceWriter.java
Outdated
Show resolved
Hide resolved
@droazen, please take another look. |
merging due to thumb from louis. |
Description
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Checklist