With high-throughput sequencing generating large amounts of genomic data, archiving can be a critical part of an analysis toolkit. The
starch and
unstarch tools provide methods for efficient compression and extraction of UCSC BED-formatted data.
Both
starch and
unstarch provide large file support (
http://en.wikipedia.org/wiki/Large_file_support) on 64-bit operating systems, enabling compression and extraction of more than 2 GB of data, a common restriction on 32-bit systems.
Data can be stored with one of two backend compression methods, either
bzip2 or
gzip, providing a reasonable tradeoff between speed and storage performance that can be useful for very large archives.
The
unstarch utility can extract data from the entire archive or directly retrieve the specified chromosome from within the archive, which makes it an excellent choice for parallelization of analysis pipelines.
The ability to jump directly to the data of interest makes it well-suited for analysis pipelines where work can be parallelized. The end user can easily extract each chromosome on different nodes of a computational cluster, gaining very significant extraction time performance improvements over common compression alternatives.
Unstarch can also output a text- or JSON-formatted summary of the archive's metadata, which can be useful for downstream analysis and bookkeeping.
Visit the Google Code site for usage information and examples:
http://code.google.com/p/bedops/wiki/starchAndUnstarch