Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ported GATK's FastaReferenceWriter #1172

Merged
merged 14 commits into from
Feb 1, 2019
Merged

Ported GATK's FastaReferenceWriter #1172

merged 14 commits into from
Feb 1, 2019

Conversation

yfarjoun
Copy link
Contributor

@yfarjoun yfarjoun commented Aug 8, 2018

  • Implemented some basic validation functionality rather than stripping it out of the implementation of FastaReferenceWriter)
  • Implemented a simple version of a RandomDNA creator (and modified SAMRecordSetBuilder to use it)
  • Ported the tests (for FastaReferenceWriter) from GATK.
  • Replaces Class for writing fasta format file. #1058

Description

Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Aug 8, 2018

@alecw will this work as well for you?

@codecov-io
Copy link

codecov-io commented Aug 8, 2018

Codecov Report

Merging #1172 into master will increase coverage by 1.963%.
The diff coverage is 79.111%.

@@               Coverage Diff               @@
##              master     #1172       +/-   ##
===============================================
+ Coverage     68.343%   70.306%   +1.963%     
- Complexity      8014      9891     +1877     
===============================================
  Files            541       611       +70     
  Lines          32716     38378     +5662     
  Branches        5531      6717     +1186     
===============================================
+ Hits           22359     26982     +4623     
- Misses          8136      8894      +758     
- Partials        2221      2502      +281
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/samtools/SAMRecordSetBuilder.java 89.256% <100%> (-0.175%) 70 <1> (-1)
...mtools/reference/ReferenceSequenceFileFactory.java 78.723% <100%> (ø) 22 <1> (ø) ⬇️
src/main/java/htsjdk/utils/ValidationUtils.java 40% <40%> (ø) 6 <6> (?)
...c/main/java/htsjdk/samtools/util/SequenceUtil.java 70.714% <40.909%> (-1.384%) 129 <9> (+3)
...amtools/reference/FastaReferenceWriterBuilder.java 84.314% <84.314%> (ø) 24 <24> (?)
...tsjdk/samtools/reference/FastaReferenceWriter.java 91.2% <91.2%> (ø) 41 <41> (?)
src/main/java/htsjdk/samtools/SAMTagUtil.java 54.444% <0%> (-41.634%) 4% <0%> (-2%)
.../variant/variantcontext/StructuralVariantType.java 80% <0%> (-20%) 2% <0%> (+1%)
...samtools/cram/encoding/readfeatures/Insertion.java 34.375% <0%> (-4.514%) 8% <0%> (+4%)
src/main/java/htsjdk/samtools/LinearIndex.java 56.818% <0%> (-3.182%) 20% <0%> (+10%)
... and 164 more

@alecw
Copy link
Contributor

alecw commented Aug 8, 2018

@yfarjoun I'll try this when I get a chance.

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some overall comments about the design (I did not check the implementation yet).

* THE SOFTWARE.
*/

package htsjdk;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can better live in htsjdk.utils

/**
* Simple functions that streamline the checking of values.
*/
public class ValidationUtils {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally a class like this in HTSJDK!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-)

* <ul>
* <li>Sequence names are valid (non-empty, without space/blank, control characters),</li>
* <li>sequence description are valid (without control characters),</li>
* <li>bases are valid nucleotides ore IUPAC redundancy codes and X [ACGTNX...] (lower or uppercase are accepted),</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: nucleotides ore IUPAC -> nucleotides or IUPAC

* @throws IllegalArgumentException if {@code fastaFile} is {@code null}.
* @throws IOException if such exception is thrown when accessing the output path resources.
*/
public FastaReferenceWriter(final Path fastaFile, final boolean makeFaiOutput, final boolean makeDictOutput)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to have a factory instead of too many constructors...

// for the sake of avoiding creating output if basesPerLine is invalid.
this.defaultBasePerLine = checkBasesPerLine(basesPerLine);

this.fastaStream = new CountingOutputStream(Files.newOutputStream(fastaFile));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for compressed FASTA files (with gzi/fai index) can be also added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

later perhaps...I wanted to get something in that works for the simple case...feel free to modify it afterwards...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please open an issue and mention it here (or in the factory when the stream is open)? It is easier to track that way

}
}

private void writeIndexEntry()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this functionality somewhere already? Otherwise, it will be nice to have an index-codec for decoding/encoding

Copy link
Member

@magicDGS magicDGS Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a FastaIndexing class to index on the fly streams

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently FastaSequenceIndex and FastaSequenceIndexCreator are geared towards making an index from a fasta-file, not for generating the index on the fly....so without some serious refactoring that's not going to be easy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I open #1173 for the refactor

* @param bases the sequence bases, cannot be {@code null}.
* @throws IOException if such exception is thrown when writing in the output resources.
*/
public static void writeSingleSequenceReference(final Path whereTo, final boolean makeIndex,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a factory/builder for the writer, the static methods can be moved there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would expect a static method like this to be in the class itself, not a factory method.

/**
* Implementation of a writer that does nothing
*/
public class NullWriter extends Writer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the main or the test sources?

Copy link
Contributor Author

@yfarjoun yfarjoun Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main....it's used in the production code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it belongs in htsjdk and not in your production code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until you have more than one use case in htsjdk, it could live inside the writer class as a static private inner class.

import java.util.stream.IntStream;
import java.util.stream.Stream;

public class FastaReferenceWriterTest extends HtsjdkTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be nice to have a test that reads an input FASTA, writes it down and checks if the files are the same (either MD5 or re-reading line by line).

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can review FastaReferenceWriter once the changes @magicDGS suggested about a factory instead of so many public constructors.

*/
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){
nonNull(collection, "The collection is null: " + message);
if(collection.isEmpty()){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

){ -> ) { here and throughout. Generally, the spacing in if/else and so on is not consistent here and throughout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals can be used here and throughout.

*/
public static <T> T nonNull(final T object, final Supplier<String> message) {
if (object == null) {
throw new IllegalArgumentException(message.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and below is ripe for a method like

public static T require(final T actual, final T expected, final Supplier<String> message) {
    if (actual == expected) return actual;
    else throw new IllegalArgumentException(message.get());
}

I'd just use require in scala :P

return nonEmpty(collection, "collection must not be null or empty.");
}

public static void validateArg(final boolean condition, final String msg){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs here and blow

for (int i = 0; i < length; ++i) {
bases[i] = BASES[this.random.nextInt(BASES.length)];
}
final byte[] bases = SequenceUtil.getRandomBases(random, length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of BASES now.

}

return bases;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one test please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for the reminder.

@@ -0,0 +1,32 @@
package htsjdk.utils;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license

/**
* Implementation of a writer that does nothing
*/
public class NullWriter extends Writer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it belongs in htsjdk and not in your production code.


/**
* Writes a FASTA formatted reference file.
* <p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this javadoc look correct to you when it's generated? I would change it in a few ways:

  • omit </p> as it will result in (unnecessary) extra vertical space
  • for a code block, wrap with <pre>..</pre> with proper indenting

also, </li> is unnecessary although it does no harm to include it

this.defaultBasePerLine = checkBasesPerLine(basesPerLine);
this.fastaStream = new CountingOutputStream(fastaOutput);
this.indexWriter = indexOutput == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(indexOutput, CHARSET);
final BufferedWriter dictWriter = new BufferedWriter(dictOutput == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(dictOutput, CHARSET));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why you have a local variable for this when you can assign to the field directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is that this.dictWriter is a Writer while SAMSequenceDictionaryCodec constructor takes a BufferedWriter...I could cast...but that might be uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice that


this.fastaStream = new CountingOutputStream(Files.newOutputStream(fastaFile));
this.indexWriter = indexFile == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(Files.newOutputStream(indexFile), CHARSET);
final BufferedWriter dictWriter = new BufferedWriter(dictFile == null ? NullWriter.NULL_WRITER : new OutputStreamWriter(Files.newOutputStream(dictFile), CHARSET));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to avoid the code duplication with the other constructor, if possible. Using a factory method or builder pattern is one way to do this. Or you could move the shared code to a private method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is that we want to check the basesPerLine before the OutputFile is created...and as the comment (in the other constructor says) this is hard to do without code duplication....

Everything else has to be in the constructor since it's final.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm. changed the signature and I think it works now.

this.dictWriter = dictWriter;
this.dictCodec = new SAMSequenceDictionaryCodec(dictWriter);
this.dictCodec.encodeHeaderLine(false);
this.sequenceNamesAndSizes = new LinkedHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is always initialized the same way. If so, moving the initialization assignment to the field declaration would make this clearer to the reader.

// checks that a sequence name is valid.
private static void checkSequenceName(final String name) {
ValidationUtils.nonNull(name ,"Sequence name should not be null");
ValidationUtils.nonEmpty(name,"Sequence name should not be empty");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after ,

*/
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){
nonNull(collection, "The collection is null: " + message);
if(collection.isEmpty()){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add spaces before/after ()

* @return the original collection
* @throws IllegalArgumentException if collection is null or empty
*/
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add space before {. many missing spaces in this file, would be easier to re-format in the IDE to fix them all

* @return the original collection
* @throws IllegalArgumentException if collection is null or empty
*/
public static <I, T extends Collection<I>> T nonEmpty(T collection, String message){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to parameterize on I here? Is <T extends Collection<?>> not sufficient?

nonNull(collection, "The collection is null: " + message);
if(collection.isEmpty()){
throw new IllegalArgumentException("The collection is empty: " + message);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary else after throw

* @param collection any Collection
* @return true if the collection exists and has elements
*/
public static boolean isNonEmpty(Collection<?> collection){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems confusing to me to have two different styles of validation here that do pretty much the same thing.

}
currentSequenceName = sequenceName;
currentBasesPerLine = basesPerLine;
final StringBuilder builder = new StringBuilder(sequenceName.length() + nonNullDescription.length() + 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like what you really want is sequenceName.length() + nonNullDescription.length() + 2. In practice does it make much difference (regarding code performance) to preallocate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the performance benefit. I think that the idea is to preallocate if can.. but you are correct about the +2 (I think)

@yfarjoun
Copy link
Contributor Author

ok @nh13 it's using a builder now. :-)

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun I have stopped reviewing the FastaReferenceWriter class as I think the API needs a lot of work to leverage the various htsjdk classes and make it a lot friendlier. It really feels like a quick port.

I'd like to see the public methods accept htsjdk classes rather than byte arrays, strings, and other things. If there is a reason to bass those directly, for example, because a contig is too big, I can understand that, but I feel like then the API can be improved.

protected FastaReferenceWriter(final int basesPerLine,
final OutputStream fastaOutput,
final OutputStream indexOutput,
final OutputStream dictOutput) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ident is wrong

* @param dictOutput the output stream to the dictFile, if requested, {@code null} if none should be generated.
* @throws IllegalArgumentException if {@code fastaFile} is {@code null} or {@code basesPerLine} is 0 or negative.
*/
protected FastaReferenceWriter(final int basesPerLine,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant this be defaulted to DEFAULT_BASES_PER_LINE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the factory it does that.

} else if (Character.isISOControl(ch)) {
throw new IllegalArgumentException("the input name contains control characters: '" + name + "'");
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you stored SAMSequenceRecords this would take care of itself!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but I can't use that and still get things like description (which SAMSequenceRecord doesn't support) and also incremental buildup of a sequence..

* The value is the sequence length in bases.
* </p>
*/
private final Map<String, Long> sequenceNamesAndSizes = new LinkedHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use a Map<String, SAMSequenceRecord>?

protected FastaReferenceWriter(final int basesPerLine,
final OutputStream fastaOutput,
final OutputStream indexOutput,
final OutputStream dictOutput) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be great if index/dict were optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are (in builder)

return appendBases(bases, 0, bases.length);
}

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've stopped reviewing, because I just don't understand why this class doesn't leverage the various htsjdk classes for it's API. For example, ReferenceSequence and SAMSequenceRecord. I get wanting to feed sub-parts of the reference FASTA at a time, but can't that be handled by giving it an iterator over sub-sequences that's evaluated lazily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds nice...but not a simple port. I was looking for something so that folks could write FASTA files with htsjdk.....however, I've realized that the length isn't used...so this is a set of names...I'll make that change.

Copy link
Member

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the code (again) because I do not have too much time, but I am concern about users passing compressed streams if they do not work.

I agree with @nh13 about using also HTSJDK objects for the writer for better integration.

* You can specify a specific output stream to each file: the main fasta output, its index and its dictionary.
* </p>
*
* @param fastaOutput the output fasta file path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this stream be passed as a bgzip compressed one, and in that case will it work properly? That should be specify here to help developers to refactor to include bgzip-support.

*
* @param fastaOutput the output fasta file path.
* @param indexOutput the output stream to the index file, if requested, {@code null} if none should be generated.
* @param dictOutput the output stream to the dictFile, if requested, {@code null} if none should be generated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify that both index/dict streams should not be compressed (the same in the builder).

@alecw
Copy link
Contributor

alecw commented Aug 21, 2018

Hi @yfarjoun,
I did a quick test. Works for me.
Thanks!

Yossi Farjoun added 8 commits September 5, 2018 13:38
- Implemented some basic validation functionality rather than stripping it out of the implementation of FastaReferenceWriter)
- Implemented a simple version of a RandomDNA creator (and modified SAMRecordSetBuilder to use it)
- Ported the tests (for FastaReferenceWriter) from GATK.
@yfarjoun yfarjoun force-pushed the yf_FastaWriter_fromGATK branch from a77e99e to 98ada8d Compare September 5, 2018 17:39
@yfarjoun yfarjoun added the Waiting for Review This PR is waiting for a reviewer to respond label Sep 10, 2018
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty good to me. Some minor comments and then I think 👍.

* @param fastaFile a {@link File} to the output fasta file.
* @return this builder
*/
public FastaReferenceWriterBuilder setFastaFile(File fastaFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just not add methods that take File


/**
* Sets whether to automatically generate an dictionary file from the name of the fasta-file (assuming it is given
* as a file). This can only happen if both the index file and output stream are null.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say dict here.

if (indexFile == null && indexOutput == null) {
indexFile = defaultFaiFile(makeFaiOutput, fastaFile);
}
if (dictFile == null && dictOutput == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should detect and error in the case where more than one of these is specified / if one is specified but it's set to not output.

@yfarjoun yfarjoun self-assigned this Oct 1, 2018
@yfarjoun yfarjoun added Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond and removed Waiting for Review This PR is waiting for a reviewer to respond labels Oct 1, 2018
Yossi Farjoun added 3 commits October 23, 2018 11:51
- adding a test
- fixed a bug
- removed final from TWR, since its unneeded
@yfarjoun yfarjoun added Waiting for Review This PR is waiting for a reviewer to respond and removed Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond labels Oct 23, 2018
@yfarjoun
Copy link
Contributor Author

@lbergelson back to you.

*/
public FastaReferenceWriter appendString(final String basesString)
throws IOException {
return appendBases(basesString.getBytes());
Copy link
Contributor Author

@yfarjoun yfarjoun Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify encoding ascii

@yfarjoun
Copy link
Contributor Author

@droazen, please take another look.

@yfarjoun
Copy link
Contributor Author

yfarjoun commented Feb 1, 2019

merging due to thumb from louis.

@yfarjoun yfarjoun merged commit 942e3d6 into master Feb 1, 2019
@yfarjoun yfarjoun deleted the yf_FastaWriter_fromGATK branch February 1, 2019 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for Review This PR is waiting for a reviewer to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants