Skip to content

refresh-bio/DSRC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

61620bc · May 20, 2022

History

34 Commits
May 20, 2022
May 20, 2022
Mar 14, 2014
Mar 14, 2014
Mar 30, 2014
Jun 4, 2015
Mar 30, 2014
Sep 30, 2014
Mar 28, 2014
Mar 30, 2014
Mar 30, 2014
May 20, 2022
Dec 3, 2013
Dec 3, 2013

Repository files navigation

DSRC

GitHub downloads Bioconda downloads

DSRC is a toolkit designed for efficient high-performance compression of sequencing reads stored in FASTQ format, where it's main features are:

  • Effective multithreaded compression of FASTQ files.
  • Full support for Illumina, ABI SOLiD, and 454/Ion Torrent dataset formats with non-standard (AGCTN) IUPAC base values.
  • Support for lossy quality values compression using Illumina binning scheme.
  • Support for lossy IDs compression keeping only key fields selected by user.
  • Pipes support for easy integration with current pipelines.
  • Python and C++ libraries allowing to integrate DSRC archives in own applications.
  • Availability for Linux, Mac OSX and Windows 64-bit operating systems.
  • Open source C++ code under GNU GPL 2 license.

Building

Build prerequisites

Linux

DSRC binaries and C++ library can be compiled in two ways, depending on the selection of multithreading support library - for each a different makefile file is provided. In the first case, boost::threads library will be used, which is needed to be present on the build system. In the second - g++ compiler with c++11 support (version >= 4.8).

By default, binaries and libraries are compiled using g++, however compiling using Clang or Intel icpc should also succeeed without any problems.

Mac OSX

On Mac OSX Clang compiler will be used with c++11 support, so make sure to have Clang in version >= 3.3 installed.

Windows

To compile DSRC under Windows OS, Microsoft Visual Studio 2010 or 2012 is required. DSRC binaries and C++ library can be compiled in two ways, depending on the selection of multithreading support library - for each a different VS solution file is provided. When compiling using VS2010 the boost::threads library will be used to provide multithreading support, so make sure to have boost::threads library installed and boost library paths properly configured in Visual Studio. In case of using VS2012 c++11 standard implementation will be used to provide threading support.

There should be also no problems when compiling DSRC using MinGW-32-x64 with provided Makefile files.

Python library

To build DSRC Python library, boost::python library in development version and boost::build tool bjam are need to be present on the system. Next, in the Jamroot configuration file in py directory a local boost installation directory needs to be specified:

# To compile DSRC Python module please specify your boost installation directory below
#
use-project boost 
	: /absolute/path/to/boost/directory/ ;

Python library will be built using a default compilation toolset available on the build platform (auto selected by bjam), however in order to specify a different one append

<toolset>name

to the compilation flags as exmplained in the Jamroot file

# Specify toolset according to your platform manually in case of compilation problems in form: '<toolset>gcc'
# Available toolsets:
#	- Windows: msvc-*
#	- Linux: gcc, clang
#	- Mac OSX: darwin, gcc
	: <variant>release <address-model>64 <link>shared <runtime-link>shared <debug-symbols>off <inlining>full <optimization>speed <warnings>on <cxxflags>"-O2 -m64 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DUSE_BOOST_THREAD" ;

Building on Linux

Binary

To compile DSRC using boost::threads with static linking, in the main directory type:

make bin

To compile DSRC using g++ >= 4.8 with c++11 standard and dynamic linking:

make -f Makefile.c++11 bin

The resulting dsrc binary will be placed in bin subdirectory.

C++ library

To compile C++ DSRC library using boost::threads:

make lib

To compile DSRC using g++ >= 4.8 with c++11:

make -f Makefile.c++11 lib

The resulting libdsrc.a library will be placed in lib subdirectory.

Python library

To compile DSRC Python library:

make pylib

The resulting pydsrc.so library will be available in py subdirectory.

Building on Mac OSX

Binary

To compile DSRC binary, in the main directory type:

make -f Makefile.osx bin

The resulting dsrc binary will be placed in bin subdirectory.

C++ library

To compile DSRC C++ library:

make -f Makefile.osx lib

The resulting libdsrc.a library will be placed in lib subdirectory.

Python library

To compile DSRC Python library:

make -f Makefile.osx pylib

The resulting pydsrc.so library will be available in py subdirectory.

Building on Windows 64-bit

Binary

To compile DSRC using Visual Studio 2010 with boost::threads for multithreading support use the dsrc20-vs2k10.sln solution file. However, to compile DSRC using Visual Studio 2012 with c++11 threads use the dsrc20-vs2k12.sln.

To compile DSRC executable, select Release|x64 configuration and build.

The resulting dsrc.exe executable will be placed in bin subdirectory.

C++ library

To compile DSRC library, select Release Lib|x64 configuration and build.

The resulting dsrc.lib library will be placed in lib subdirectory.

Python library

To compile DSRC Python library in the py subdirectory type:

bjam

The resulting pydsrc.pyd library will be available in py subdirectory.

Usage

DSRC can be run from the command prompt:

dsrc <c|d> [options] <input_file_name> <output_file_name>

in one of two modes:

  • c — compression,
  • d — decompression.

Available options

Compression options

  • -d<n> — DNA compression mode: 0–3, default: 0
  • -q<n> — Quality compression mode: 0–2, default: 0
  • -f<1,...> — keep only those fields no. in ID field string, default: (keep all)
  • -b<n> — FASTQ input buffer size in MB, default: 8
  • -m<n> — Automated compression mode (one of the three preset combination of other pa- rameters): 0–2
  • -o<n> — Quality offset, 0 for auto selection, default: 0
  • -l — use Quality lossy mode (Illumina binning scheme), default: false
  • -c — calculate and check CRC32 checksum calculation per block (slows the compression about twice), default: false

Automated compression modes

  • -m0 — fast mode, equivalent to: -d0 -q0 -b8
  • -m1 — medium mode, equivalent to: -d2 -q2 -b64
  • -m2 — best mode, equivalent to: -d3 -q2 -b256

Options for both compression and decompression

  • -t<n> — processing threads number, default: max available hardware threads
  • -s — use stdin/stdout for reading/writing raw FASTQ files data (stderr is used for info/warning messages)

Usage examples

Compress SRR001471.fastq file saving DSRC archive to SRR001471.dsrc:

dsrc c SRR001471.fastq SRR001471.dsrc

Compress file in the fast mode with CRC32 checking and using 4 threads:

dsrc c -m0 -c -t4 SRR001471.fastq SRR001471.dsrc

Compress file using DNA and Quality compression level 2 and using 512 MB buffer:

dsrc c -d2 -q2 -b512 SRR001471.fastq SRR001471.dsrc

Compress file in the best mode with lossy Quality mode and preserving only 1–4 fields from record IDs:

dsrc c -m2 -l -f1,2,3,4 SRR001471.fastq SRR001471.dsrc

Compress in the best mode reading raw FASTQ data from stdin:

cat SRR001471.fastq | dsrc c -m2 -s SRR001471.dsrc

Decompress SRR001471.dsrc archive saving output FASTQ file to SRR001471.out.fastq:

dsrc d SRR001471.dsrc SRR001471.out.fastq

Decompress archive using 4 threads and streaming raw FASTQ data to stdout:

dsrc d -t4 -s SRR001471.dsrc > SRR001471.out.fastq

Citing

Roguski, L., Deorowicz, S. (2014) DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 30(15):2213–2215.

Deorowicz, S., Grabowski, Sz. (2011) Compression of DNA sequences in FASTQ format, Bioinformatics, 27(6):860–862.