If you use the Starch compression format (
http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starch.html) then you can use starchcat to merge Starch files. I explain below why that might be useful for you.
First, for an overview about starchcat, here's a link to documentation for starchcat:
http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starchcat.htmlStarchcat operates most efficiently when records are merged from starch files for separate chromosomes.
As an example, let's say we start with 10 starch files, each file containing records for chromosomes 1 through 10, exclusively. In other words, each starch file only contains records for that chromosome. Starch file 1A has records only for chr1, file 2A for chr2, file 3A for chr3, and so on. Starch and unstarch compress and extract BED data:
$ starch 1A.bed > 1A.starch
$ unstarch 1A.starch > 1A.bed
You want to merge these ten files starch to one starch file, to make a final product. You can use starchcat to do this.
$ starchcat 1A.starch 2A.starch ... 10A.starch > 1A_through_10A.starch
Because each file contains records specific to a chromosome, starchcat knows that it can merge the compressed data as-is, without any costly extraction and re-compression. It just copies over the compressed chromosomes to the output.
That's the simple case. Let's say you do some sequencing work and only need to update data in chromosome 4. You have a starch file called 4B.starch. So you use starchcat again.
This time, this operation will copy compressed data as-is -- just the raw bytes -- from unchanged chromosomes 1-3 and 5-10. However, on chromosome 4, starch files 4A and 4B are extracted, a merge sort is performed on BED elements from 4A and 4B, and the merged data are written to a newly-compressed chromosome 4. The only extra work done is on chromosome 4.
By segregating work by chromosome, and using starchcat to only recompressed updated records from an updated chromosome, you can make and manage an updated final product with much less work (time) than recompressing the entire BED file.
Also, if you have a computational cluster (like a Grid Engine or AWS instances, etc.) you can easily parallelize the work by segregating work by chromosome.
You might farm out 10 compression tasks to 10 computational nodes, each node working on compressing records for one chromosome. When all 10 tasks are complete, you trigger a final collation task with starchcat to merge the 10 starch files into one final product. If you have disk space, you could trigger another task after that, to make a separate tabix file for convenience.
As a format, Starch is also generally more efficient than gzip or bzip2. It preprocesses BED data to reduce redundancy, before a second compression step. Given 28 GB of raw bedGraph, for example, bzip2 reduces this to about 24% of the original file size, while Starch gets to about 5% of the original file size.
As a second alternative approach to sort-bed, you could also use "bedops -u" or "bedops --everything" on as many BED files as you specify, which I believe internally does a merge sort on all the specified files:
$ bedops -u A.bed B.bed C.bed ... N.bed > A_through_N.bed
I think this will run a little slower than sort-bed, when sort-bed operates within system memory. Sort-bed has a mode where, if you specify a memory allocation, it will use a merge sort on temporary files, which will run slower but allow sorting of BED files larger than system memory:
$ sort-bed --max-mem 2G A.bed B.bed C.bed ... N.bed > A_through_N.bed
So, to recap, here are some approaches you could explore:
- starchcat on multiple Starch archives, no parallelization (copy of unchanged chromosomes, merge sort on changed chromosomes)
- starchcat on multiple Starch archives, with parallelization
- bedops -u/--everything on multiple BED files (merge sort)
- sort-bed on multiple BED files (quicksort, within system memory)
- sort-bed --max-mem on multiple BED files (merge sort)
Hopefully this helps. If you try out Starch and like it, let us know!