Author Topic: concordance btw. bedtools and bedops  (Read 5208 times)

mcgreevj

  • Newbie
  • *
  • Posts: 1
concordance btw. bedtools and bedops
« on: October 16, 2013, 04:51:03 AM »
I'm getting different answers from BedTools and bedops with what I think should be the same operations.

The bedtools command is

intersectBed -a $ref -b $db -wa -f 0.75 -u > $db.Huang.S1gt.99.bed

the bedops command is

bedops -e -75%  $ref $db  > $db.Huang.S1gt.99.bed


The Bedtools command returns what looks like the correct intersection when viewed manually in IGV, bedops appears to be missing a few. Any ideas?

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: concordance btw. bedtools and bedops
« Reply #1 on: October 16, 2013, 08:10:17 AM »
Thanks for using BEDOPS.  I think you're looking for what the 'bedmap' program does.

I'm not very familiar with BedTools, but looking quickly at the documentation, the intersectBed command appears to give the same region from $ref as many times as there are qualified overlapping elements in $db.  The -u option skips elements from $ref that have no overlapping elements.

The bedops -e (--element-of) is a set operation that gives back the elements of $ref that have qualified overlapping elements in $db.  That is, the subset of elements in $ref that have overlapping elements in $db. Each element is reported at most 1 time (it is an element of or not an element of the regions found in $db).

The bedmap program in BEDOPS seems to be more analogous to the output of intersectBed in this case.

bedmap --fraction-ref 0.75 --echo --echo-map --skip-unmapped $ref $db

--skip-unmapped causes no output in cases where $ref has no qualifyied overlapping element in $db (this option is available in BEDOPS 2.3).  --echo will spit out elements from $ref, while --echo-map spits out all qualified overlapping elements in $db.

The default output format will be different from that in BedTools, and might look something like:
chr1  1  100  ref-file  0.567  +|chr1  2  400  id-A  10;chr1 20  100  id-B  20

in contrast to BedTools
chr1  1  100  ref-file  0.567  +  chr1  2 400 id-A 10
chr1  1  100  ref-file  0.567  +  chr1 20 100 id-B 20

By default, bedmap separates output columns using the '|' symbol.  You can change this with the --delim option.  Elements from --echo-map are separated further by the ';' symbol, which you can change with the --multidelim option.

The program accepts options in any order and respects the order given on output, so that:
'bedmap --skip-unmapped --echo-map --echo $ref $db'
swaps the output columns.  bedmap offers a rich set of overlap options.  Above, I use --fraction-ref, but --fraction-map, --fraction-either, and --fraction-both are all popular alternative options.

If you are just interested in the IDs or the scores in $db, you could use --echo-map-id or --echo-map-score.  Many other options are available.  See docs and bedmap --help for details.

By default, bedmap does no input error checking so you might consider using --ec while you're getting used to the program (slows it down by half).

Let us know if I've read the BedTools' documentation incorrectly or if you have any more questions.
« Last Edit: October 17, 2013, 12:39:56 PM by sjn »

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: concordance btw. bedtools and bedops
« Reply #2 on: October 16, 2013, 11:48:21 AM »
If you are defining the criteria for missing elements on the basis of fewer lines in the BEDOPS/bedops result set as compared with the BedTools result set, then you should definitely review the BEDOPS/bedmap result set (as suggested in Shane's answer) to see if there are multiple semi-colon-delimited elements that are all contained on one line.

If Shane's answer does not resolve the discrepancy, we would like to offer to review this further. If you can provide a copy of affected datasets used as inputs, the versions of BedTools and BEDOPS in use, and the specific commands used (after switching from bedops to bedmap), we can then do tests on our end.
« Last Edit: October 16, 2013, 11:55:26 AM by AlexReynolds »

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: concordance btw. bedtools and bedops
« Reply #3 on: October 18, 2013, 09:01:05 AM »
After re-reading, I may not have answered your question completely.  bedops -e -75% A B should give you the correct answer.  Remember that end coordinates mark 1 bp off the end of the region of interest - [start, end) in typical notation.  One side-effect with bedops is that it merges all regions in file B before calculating the overlap percentage.  It's a more efficient version of:

bedops -m B | bedops -e -75% A -

but that doesn't sound like an issue as you claim that it gives back fewer regions than you think it should.  Other than the 1 bp thing mentioned, I cannot think of anything that might cause an incorrect output, unless your data are not sorted correctly.  You could throw on the --ec flag to ensure things are sorted as expected by BEDOPS.  Please do send some files or snippets from your files where you think things are incorrect if these ideas do not help.  Also include the version you're using, please: bedops --version.

One thing to keep in mind is that sorted order is of utmost importance with BEDOPS, and the sort utility with BedTools does not conform to these expectations.  We have additional constraints to remove ambiguities.  Our sort-bed utility has the additional benefit of being faster with better in-memory performance.  Again, using --ec with bedops/bedmap will ensure that everything is sorted and will work properly with those utilities.
« Last Edit: October 18, 2013, 03:18:54 PM by sjn »