Author Topic: total coverage  (Read 4969 times)

melissa

  • Newbie
  • *
  • Posts: 7
total coverage
« on: December 10, 2013, 07:08:21 AM »
Hi,
I have a very simple feature request. Quite often when I use bedops I am after a single number- the total number of unique bases covered by a bed file. If there was an option to bedops to return this (or perhaps a separate utility "bedcoverage", if you are opposed to bedops returning something other than a bed file), it would be very useful. I have my own script to do this, but making it part of the package would be great for instructional purposes.
Thanks,
Melissa

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: total coverage
« Reply #1 on: December 10, 2013, 04:46:11 PM »
Hi Melissa,
Thank you for the feature request.  The starch format that we support (a highly-compressed BED format) does have this and a number of other features that are returned immediately.  They are precomputed when making a starch archive from a BED file.  Since several programs (bedops, bedmap, closest-features) recognize the starch format, it doesn't change their normal usage.


unstarch --bases your-file.starch                          # total bases
unstarch chr10 --bases your-file.starch              # total bases on chr10
unstarch --elements your-file.starch                    # number of rows in the file
unstarch chr10 --element your-file.starch          # number of rows for chr10 in the file
unstarch --bases-uniq your-file.starch                # number of unique bases in the file
unstarch chr10 --bases-uniq your-file.starch     # number of unique bases on chr10 in the file
unstarch --list-chromosomes your-file.starch    # give back all unique chromosomes found in the file


I realize this isn't what you're after exactly, but thought I'd mention it.  Our tests of a BED file compressed to starch format is that it's ~33% more compressed as compared to an equivalent bam file (that is, we used bam2starch on a bam file to create a starch file for this comparison).

A bedcoverage bash or tcsh script might look like:
bedops --ec -m $1 | awk 'BEGIN {s=0} ; { s+=$3-$2; } END { print s; }'


where the user passes in the file of interest (or modify for any number of files).

or, if by chromosome
bedops --ec --chrom $1 -m $2 | awk 'BEGIN {s=0} ; { s+=$3-$2; } END { print s; }'


where the user passes in some chromosome, like chr10, and the file of interest (or modify for any number of files).

Again, not built-in like you request, but thought I'd mention how I'd do it now.  I'll talk with Alex and others about your request.

It's worth noting that while this sort of calculation can be performed (by chrom) using bedmap and a sorted chromInfo file from UCSC's table browser, it gives the worst case memory performance possible using bedmap (it will read in an entire chromosome's worth of data at a time).  This was a point we made in the supplement of our publication under worst case memory performance.

Thanks,
Shane
« Last Edit: December 21, 2013, 11:52:02 PM by sjn »

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: total coverage
« Reply #2 on: December 13, 2013, 01:22:35 AM »
I'd second taking a look at trying out compressing files to Starch format. Access to base and element statistics is instantaneous and bedops and bedmap operate on Starch files (as well as BED).