Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: handling UCSC 2bit sequences as ReferenceFile #1417

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

lindenb
Copy link
Contributor

@lindenb lindenb commented Aug 31, 2019

Description

This is just a proposal, please don't commit.

This PR is a proposal to add a support for the 2bit fasta sequences : https://genome.ucsc.edu/goldenpath/help/twoBit.html

I wrote

TwoBitSequenceFile: https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/TwoBitSequenceFile.java implements ReferenceSequenceFile and reads 2bit files.

I modified https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/ReferenceSequenceFileFactory.java#L133 to handle the new file extension

Tell me if you're interested with this commit and I'll add more tests and fix the formatting.

PS: Another way to handle those kind custom references sequences would be to move statics methods from ReferenceSequenceFileFactory to non-static and give a chance to set a default instance.

ReferenceSequenceFileFactory.setInstance(my_instance_handling_2bit);
(...)
ReferenceSequenceFileFactory.getInstance().getReferenceSequenceFile(path);

@yfarjoun
Copy link
Contributor

yfarjoun commented Sep 9, 2019

should the 2bit fasta be added as a hts-spec as well?

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's great to be able to read more formats. if you have a refactoring that would make doing that easier, feel free to refactor.

Regarding the format itself, perhaps it would make sense to see if it should be moved into hts-specs?

Assert.assertEquals(seq.getName(), "chr20");
Assert.assertEquals(seq.length(), 1_000_000);

final String chrM_100_120="GGAGCCGGAGCACCCTATGTC";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have a test that also checks the Ns and the masked regions


@Test(dataProvider="homosapiens")
public void testOpenFile(final Path sequenceFile) throws IOException {
TwoBitSequenceFile tbf = new TwoBitSequenceFile(sequenceFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use try-with-resources


public class TwoBitSequenceFile implements ReferenceSequenceFile {
/** standard suffix of 2bit files */
public static final String SUFFIX = ".2bit";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this get indexed?

@lbergelson
Copy link
Member

@lindenb Sorry for the very slow response. I think this would be a great addition to hstjdk. We were actually reading 2-bit in gatk using the ADAM project implementation, so it would be nice to have an htsjdk built in.

I did see you have an error due to a missing file at the moment:

- testBit2File *** FAILED ***
  htsjdk.samtools.SAMException: Error opening .2bit : src/test/resources/htsjdk/samtools/reference/Homo_sapiens_assembly18.trimmed.2bit
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:134)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:83)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactoryTests.testBit2File(ReferenceSequenceFileFactoryTests.java:49)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:124)
  at org.testng.internal.Invoker.invokeMethod(Invoker.java:583)
  ...
  Cause: java.io.FileNotFoundException: src/test/resources/htsjdk/samtools/reference/Homo_sapiens_assembly18.trimmed.2bit (No such file or directory)
  at java.io.RandomAccessFile.open0(Native Method)
  at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
  at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
  at htsjdk.samtools.seekablestream.SeekableFileStream.<init>(SeekableFileStream.java:47)
  at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:103)
  at htsjdk.samtools.seekablestream.SeekableStreamFactory$DefaultSeekableStreamFactory.getStreamFor(SeekableStreamFactory.java:77)
  at htsjdk.samtools.reference.TwoBitSequenceFile.<init>(TwoBitSequenceFile.java:151)
  at htsjdk.samtools.reference.TwoBitSequenceFile.<init>(TwoBitSequenceFile.java:146)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:132)
  at htsjdk.samtools.reference.ReferenceSequenceFileFactory.getReferenceSequenceFile(ReferenceSequenceFileFactory.java:95)
  ...

@lindenb
Copy link
Contributor Author

lindenb commented Sep 17, 2019

@yfarjoun @lbergelson thank you for having a look at this. I'm still thinking of really integrating this in htsjdk: Reading '.2bit' files remains slow compared to plain fasta files :

e.g: I tested reading 1E6 random sequences using 3 referencesequencefile (using the same random seed).


t=23731 secondes class=class htsjdk.samtools.reference.TwoBitSequenceFile
t=253126 secondes class=class htsjdk.samtools.reference.BlockCompressedIndexedFastaSequenceFile
t=5705 secondes class=class htsjdk.samtools.reference.IndexedFastaSequenceFile 

I'm don't know well the APIs for decoding bits, may be my implementation could be improved.

Furthermore '2bit' are not standard in the HTS workflows, and it doesn't fit well with classes.

But they can be (slowly) accessed over http.
Nevertheless, I will try to fix the current errors.

@yfarjoun yfarjoun added the Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond label Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Waiting for revisions This PR has received comments from reviewers and is waiting for the Author to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants