-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: handling UCSC 2bit sequences as ReferenceFile #1417
base: master
Are you sure you want to change the base?
Conversation
should the 2bit fasta be added as a hts-spec as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's great to be able to read more formats. if you have a refactoring that would make doing that easier, feel free to refactor.
Regarding the format itself, perhaps it would make sense to see if it should be moved into hts-specs?
Assert.assertEquals(seq.getName(), "chr20"); | ||
Assert.assertEquals(seq.length(), 1_000_000); | ||
|
||
final String chrM_100_120="GGAGCCGGAGCACCCTATGTC"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to have a test that also checks the Ns and the masked regions
|
||
@Test(dataProvider="homosapiens") | ||
public void testOpenFile(final Path sequenceFile) throws IOException { | ||
TwoBitSequenceFile tbf = new TwoBitSequenceFile(sequenceFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use try-with-resources
|
||
public class TwoBitSequenceFile implements ReferenceSequenceFile { | ||
/** standard suffix of 2bit files */ | ||
public static final String SUFFIX = ".2bit"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this get indexed?
@lindenb Sorry for the very slow response. I think this would be a great addition to hstjdk. We were actually reading 2-bit in gatk using the ADAM project implementation, so it would be nice to have an htsjdk built in. I did see you have an error due to a missing file at the moment:
|
@yfarjoun @lbergelson thank you for having a look at this. I'm still thinking of really integrating this in htsjdk: Reading '.2bit' files remains slow compared to plain fasta files : e.g: I tested reading 1E6 random sequences using 3 referencesequencefile (using the same random seed).
I'm don't know well the APIs for decoding bits, may be my implementation could be improved. Furthermore '2bit' are not standard in the HTS workflows, and it doesn't fit well with classes. But they can be (slowly) accessed over http. |
Description
This is just a proposal, please don't commit.
This PR is a proposal to add a support for the 2bit fasta sequences : https://genome.ucsc.edu/goldenpath/help/twoBit.html
I wrote
TwoBitSequenceFile: https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/TwoBitSequenceFile.java implements ReferenceSequenceFile and reads 2bit files.
I modified https://github.com/lindenb/htsjdk/blob/pl_2bit/src/main/java/htsjdk/samtools/reference/ReferenceSequenceFileFactory.java#L133 to handle the new file extension
Tell me if you're interested with this commit and I'll add more tests and fix the formatting.
PS: Another way to handle those kind custom references sequences would be to move statics methods from ReferenceSequenceFileFactory to non-static and give a chance to set a default instance.