Author Topic: About starch and unstarch  (Read 4922 times)

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
About starch and unstarch
« on: July 28, 2011, 05:29:47 PM »
With high-throughput sequencing generating large amounts of genomic data, archiving can be a critical part of an analysis toolkit. The starch and unstarch tools provide methods for efficient compression and extraction of UCSC BED-formatted data.

Both starch and unstarch provide large file support (http://en.wikipedia.org/wiki/Large_file_support) on 64-bit operating systems, enabling compression and extraction of more than 2 GB of data, a common restriction on 32-bit systems.

Data can be stored with one of two backend compression methods, either bzip2 or gzip, providing a reasonable tradeoff between speed and storage performance that can be useful for very large archives.

The unstarch utility can extract data from the entire archive or directly retrieve the specified chromosome from within the archive, which makes it an excellent choice for parallelization of analysis pipelines.

The ability to jump directly to the data of interest makes it well-suited for analysis pipelines where work can be parallelized. The end user can easily extract each chromosome on different nodes of a computational cluster, gaining very significant extraction time performance improvements over common compression alternatives.

Unstarch can also output a text- or JSON-formatted summary of the archive's metadata, which can be useful for downstream analysis and bookkeeping.

Visit the Google Code site for usage information and examples: http://code.google.com/p/bedops/wiki/starchAndUnstarch
« Last Edit: July 28, 2011, 05:53:18 PM by AlexReynolds »