Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression - Decompression : Problem with large file (output bigger) #4

Open
rick-heig opened this issue Jan 21, 2022 · 1 comment

Comments

@rick-heig
Copy link

Hello.

I tried compressing a large BCF file I use as a reference and it seems something goes wrong because decompressed file is different (much larger) than input file.

I use the following command to compress

gtshark compress-db mybcf.bcf compressed

and following command to decompress

gtshark decompress-db -b compressed decompressed_bcf.bcf

My BCF file has 1,000,000 diploid phased samples and 2,271,035 variant entries.
This is a roughly 10GB BCF file.

I compressed it with gtshark and it resulted in a 5.8MB _db and 26M _gt.
This seemed a bit suspicious because the file size seems very small.
The compression finished in about 12 hours with no error message or error code

meta size: 60
header size: 516
samples size: 3068156
chrom size: 796
pos size: 1201908
id size: 796
ref size: 796
alt size: 796
qual size: 796
filter size: 1788
info size: 1736548

Processing time: 43270.2 seconds.

I launched the decompression, it has been running for more than a day, the output BCF is more than 26GB in size.
This seems off because the input file was about 10 GB. I checked that the output file was BCF (internally gzip compressed).

Output of software was :

Opening file of size: 6013013
Opening file of size: 27086378
2271035
Processing time: 129542 seconds.

No error messages or anything.

Did you run gtshark on large BCf files (millions of samples * millions of variants) ?

I am sorry I cannot share the BCF file because of size, I'll run some tests on the output file and keep you up to date on what I find.

Regards.
Rick

@rick-heig
Copy link
Author

If I open the output file with bcftools view | less I can only get the first variant site, then nothing more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant