Author Topic: Merging a BED file by % overlap  (Read 5592 times)

russell

  • Newbie
  • *
  • Posts: 2
Merging a BED file by % overlap
« on: July 16, 2015, 05:41:19 AM »
I have a bed file containing some redundant, but not identical features. I'd like to merge features overlapping by 90% to produce a cleaned bed file. I'd like to keep all features with a low level of overlap as overlapping, as well as all the non overlapping features.

bedmap  --echo-map-range --echo-map-id --fraction-both 0.75 sorted.bed almost does what I want, but doesn't produce single entries for merged features.

Any help would be much appreciated

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Merging a BED file by % overlap
« Reply #1 on: July 16, 2015, 06:33:55 AM »
I don't completely understand what you are looking for here - feel free to add a bit more detail.  It looks like the main issue is that you want a single output for features that have been merged together.

One thing about --echo-map-range is that it is not guaranteed to produce an output that is in sorted order, so you should pipe your results to a call to sort-bed.

How about:

bedmap --echo-map-range --fraction-both 0.75 sort1.bed sort2.bed \
  | sort-bed - \
  | uniq \
  | bedmap --echo --echo-map-id --fraction-map 1 - sort2.bed

You mention 90% but show 0.75 in your bedmap statement.  You might change what I have above to 0.9 if that's what you're after.

Something that might not be obvious is that a single input element can contribute to more than one output row even when uniq is used.  Consider the scenario where you have 3 elements in your input, labeled A, B, C.  If A and B overlap by 0.9, B and C overlap by 0.9, but A and C do not overlap by 0.9, then the output will have 3 distinct rows:  A+B merged, A+B+C merged, and B+C merged.

I sometimes use an awk statement to remove A+B merged and B+C merged (I call it a subsumption filter).  I can hunt down some code for that if it interests you.

russell

  • Newbie
  • *
  • Posts: 2
Re: Merging a BED file by % overlap
« Reply #2 on: July 21, 2015, 02:25:21 AM »
Thanks sjn, your answer was very helpful. If you could dig out the awk script you mentioned that would be great. I think with this final step it now does exactly what I need.

I was trying different overlap proportions of 75 and 90%, so thats why I had .75 in my example (but asked about .9)

Many thanks

Russell

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Merging a BED file by % overlap
« Reply #3 on: July 21, 2015, 06:13:10 AM »
Add in this awk statement after the uniq statement.  in between the first bedmap call and sort-bed.

    | awk 'BEGIN {OFS="\t"; c=""; s=-1; e=-1} ; { \
                a = int($2); \
                b = int($3); \
                if ( $1 == c ) { \
                  if ( a == s ) { \
                    # previous row subsumed \
                    e = b; \
                  } else if ( e < b ) { \
                    print c, s, e; \
                    s = a; \
                    e = b; \
                  } \
                  # else, current row subsumed \
                } else { \
                  if ( NR > 1 ) { \
                    print c, s, e; \
                  } \
                  c = $1; \
                  s = a; \
                  e = b; \
                } \
              } END { \
                if ( NR > 1 ) { \
                  print c, s, e; \
                } \
              }' \
« Last Edit: July 21, 2015, 10:25:03 AM by sjn »