Author Topic: combine sort multiple compressed files?  (Read 5645 times)

parlar

  • Newbie
  • *
  • Posts: 2
combine sort multiple compressed files?
« on: October 01, 2015, 10:00:30 AM »
I have an application where I need to store sequencing coverage information for a large number of experiments that also continually grows. I need to merge the new coverage info with the old.

I have been thinking that using a proper database might be the best for scalability (as opposed to relying on tabix-indexed files). However, I want to keep it simple and sort-bed is quite efficient. What I would like to do is to keep the coverage info for the different samples/experiments as compressed bed files, and then use sort-bed for merging/sorting them into a big file everytime new data comes comes in.

Q1. Is it possible to sort/merge multiple compressed (to save space) bed files somehow using sort-bed?
Q2. Is anyone aware of some more clever strategy? Principally one might think that combining/joining a small sorted bed file with a big sorted bed file could be done much faster than just "catting" the two files together and sorting the whole lot?

 

 

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: combine sort multiple compressed files?
« Reply #1 on: October 01, 2015, 11:00:26 AM »
If you use the Starch compression format (http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starch.html) then you can use starchcat to merge Starch files. I explain below why that might be useful for you.

First, for an overview about starchcat, here's a link to documentation for starchcat: http://bedops.readthedocs.org/en/latest/content/reference/file-management/compression/starchcat.html

Starchcat operates most efficiently when records are merged from starch files for separate chromosomes.

As an example, let's say we start with 10 starch files, each file containing records for chromosomes 1 through 10, exclusively. In other words, each starch file only contains records for that chromosome. Starch file 1A has records only for chr1, file 2A for chr2, file 3A for chr3, and so on. Starch and unstarch compress and extract BED data:

$ starch 1A.bed > 1A.starch
$ unstarch 1A.starch > 1A.bed

You want to merge these ten files starch to one starch file, to make a final product. You can use starchcat to do this.

$ starchcat 1A.starch 2A.starch ... 10A.starch > 1A_through_10A.starch

Because each file contains records specific to a chromosome, starchcat knows that it can merge the compressed data as-is, without any costly extraction and re-compression. It just copies over the compressed chromosomes to the output.

That's the simple case. Let's say you do some sequencing work and only need to update data in chromosome 4. You have a starch file called 4B.starch. So you use starchcat again.

This time, this operation will copy compressed data as-is -- just the raw bytes -- from unchanged chromosomes 1-3 and 5-10. However, on chromosome 4, starch files 4A and 4B are extracted, a merge sort is performed on BED elements from 4A and 4B, and the merged data are written to a newly-compressed chromosome 4. The only extra work done is on chromosome 4.

By segregating work by chromosome, and using starchcat to only recompressed updated records from an updated chromosome, you can make and manage an updated final product with much less work (time) than recompressing the entire BED file.

Also, if you have a computational cluster (like a Grid Engine or AWS instances, etc.) you can easily parallelize the work by segregating work by chromosome.

You might farm out 10 compression tasks to 10 computational nodes, each node working on compressing records for one chromosome. When all 10 tasks are complete, you trigger a final collation task with starchcat to merge the 10 starch files into one final product. If you have disk space, you could trigger another task after that, to make a separate tabix file for convenience.

As a format, Starch is also generally more efficient than gzip or bzip2. It preprocesses BED data to reduce redundancy, before a second compression step. Given 28 GB of raw bedGraph, for example, bzip2 reduces this to about 24% of the original file size, while Starch gets to about 5% of the original file size.

As a second alternative approach to sort-bed, you could also use "bedops -u" or "bedops --everything" on as many BED files as you specify, which I believe internally does a merge sort on all the specified files:

$ bedops -u A.bed B.bed C.bed ... N.bed > A_through_N.bed

I think this will run a little slower than sort-bed, when sort-bed operates within system memory. Sort-bed has a mode where, if you specify a memory allocation, it will use a merge sort on temporary files, which will run slower but allow sorting of BED files larger than system memory:

$ sort-bed --max-mem 2G A.bed B.bed C.bed ... N.bed > A_through_N.bed

So, to recap, here are some approaches you could explore:

- starchcat on multiple Starch archives, no parallelization (copy of unchanged chromosomes, merge sort on changed chromosomes)
- starchcat on multiple Starch archives, with parallelization
- bedops -u/--everything on multiple BED files (merge sort)
- sort-bed on multiple BED files (quicksort, within system memory)
- sort-bed --max-mem on multiple BED files (merge sort)

Hopefully this helps. If you try out Starch and like it, let us know!
« Last Edit: October 01, 2015, 11:03:19 AM by AlexReynolds »

parlar

  • Newbie
  • *
  • Posts: 2
Re: combine sort multiple compressed files?
« Reply #2 on: October 02, 2015, 12:05:26 AM »
Brilliant, thank you for your elaborate answer!

Just another question? Maybe I will not need to convert from starch -> bgzip+tabix index?
Since I will have nested features, I will need to use bedops --Element-of for extraction. Do you know how fast this is for (compressed) bed files with millions of rows as compared to using tabix chr:xxxx-yyyy ? If the speed is comparable, there is really no need to convert to bgzip/tabix-index. I could not find any benchmarks on this.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: combine sort multiple compressed files?
« Reply #3 on: October 02, 2015, 04:51:23 PM »
The trade-off between speed and size is pretty straightforward within BEDOPS.  If query speed and/or merging file speed is the main goal, then using raw BED files is the best approach.  If space-saving is the main goal, then Starch is the best route.

BED:
When looking up some relatively small number of regions (say 30,000 or fewer), the fastest query speed by far happens with bedextract, and there is a way to use this tool even when there are fully-nested elements (see the note under "Retrieving elements which overlap target elements" at https://bedops.readthedocs.org/en/latest/content/reference/set-operations/bedextract.html#retrieving-elements-which-overlap-target-elements).
> bedextract a.bed b.bed

Just using bedops -e performs well too, and it's useful to use the --chrom flag when you can in order to speed up the retrieval of elements from a specific chromosome:
> bedops --chrom chr7 -e 10 a.bed b.bed

The bedops program (for the -e operation) actually merges elements in b.bed above.  You could use bedmap instead if this is an issue, though it will be a bit slower:
> bedmap --echo --chrom chr7 --bp-ovr 10 --skip-unmapped a.bed b.bed

(--faster makes bedmap much faster when it's applicable, but it is not applicable if you have fully-nested elements).

When unionizing bed files, it's best to sort things up front 1 time and then use bedops -u each time you get a new file(s) to merge.
> bedops -u current-repository.bed new-file1.sorted.bed new-file2.sorted.bed ... > new-repository.bed

Using sort-bed does work for unionizing information across files, but it will not scale as well as bedops -u

Starch:
The bedextract option is not applicable here.

The bedops and bedmap options still apply exactly as above, and you may continue to use the --chrom operator to help speed things up.  The files are compressed so there is a bit more work, and these options are necessarily slower here than when used with raw bed files.

For merging information, Alex already showed the main applications for starchcat.  You may also use bedops -u with Starch file inputs.  Using starchcat is much faster here if all inputs contain info from separate chromosomes as he described.  This is probably most applicable when you are gluing together results after operating on things per chromosome on a cluster or multiple processors.

I don't use SAMTools so I can't speak to relative speeds.  If speed is your main interest, then I recommend the bedextract trick for queries along with bedops -u when merging new+old files.  If space savings is the main goal, Starch is a no-brainer, and query speeds are still pretty decent with the --chrom option in bedops/bedmap.
« Last Edit: October 05, 2015, 04:28:52 PM by sjn »