Author Topic: Maximum overlap for intersection?  (Read 6225 times)

j-andrews7

  • Newbie
  • *
  • Posts: 9
Maximum overlap for intersection?
« on: November 30, 2015, 10:38:18 AM »
Here's a rather specific scenario I'm facing at the moment. I've called super enhancers using K27AC data from various tissue samples, each with several biological replicates. I plan on identifying overlapping super enhancers for each tissue type with the following code as explained by you (Alex) in this forum post.

Code: [Select]
$ bedops --everything file1.bed file2.bed ... fileN.bed \
    | bedmap --echo-map --fraction-both 0.6 - \
    | awk '(split($0, a, ";") > 1)' - \
    | sed 's/\;/\n/g' - \
    | sort-bed - \
    | uniq - \
    > answer.bed

My ultimate goal is to identify super enhancers unique to each tissue type. However, I don't want to define "unique" as having no overlap between the features, as they are often 100's of kbs in length and random overlap may occur to some extent. So my question is if there's a way to identify regions that only overlap by a specified percentage (or not at all). So rather than a minimum overlap, you use a maximum overlap.

I'm also having trouble getting the above code to output in a format that's amenable to what I'm doing.
As an example, given that my input samples look like:
Code: [Select]
chr1    1000    1500    sample1    nearest_gene
chr1    1450    1700    sample1    nearest_gene2
chr1    2000    2500    sample1    nearest_gene3

and

Code: [Select]
chr1    1400    1500    sample2    nearest_gene
chr1    1450    1700    sample2    nearest_gene2
chr1    3000    3500    sample2    nearest_gene3

I'd like an output like:
Code: [Select]
chr1    1000    1500    sample1;sample2    nearest_gene_s1;nearest_gene_s2
chr1    1450    1700    sample1;sample2    nearest_gene_s1;nearest_gene_s2
chr1    3000    3500    sample2    nearest_gene3

As it stands now, the elements are repeated as many times as they are found in the sample files (twice each for the elements that overlap in my example). I am admittedly pretty poor with awk/sed, and while whipping up a script to give the output I'd like would be a simple task, I was wondering if there was a better way.

I apologize if this is well documented somewhere. The multi-input methodology of this suite made me jump ship from bedtools to bedops, so you have a new fan, at the least. Any help would be much appreciated. Thanks!
« Last Edit: November 30, 2015, 03:59:18 PM by j-andrews7 »

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Maximum overlap for intersection?
« Reply #1 on: November 30, 2015, 05:32:42 PM »
I have to think on your first question a little. I can't think of an easy way to do it, other than maybe preprocessing the original N inputs, and postprocessing the answer file to feed per-input answer subsets to "bedops -n", to effectively apply a maximum overlap threshold. Shane might have a smarter approach, I'd bet.

To answer your second question, you could preprocess your input to merge the fourth and fifth columns into one ID field with a unique delimiter that you can split on afterwards.

So something like this:

chr1    1000    1500    sample1    nearest_gene
chr1    1450    1700    sample1    nearest_gene2
chr1    2000    2500    sample1    nearest_gene3

Becomes:

chr1    1000    1500    sample1-1
chr1    1450    1700    sample1-2
chr1    2000    2500    sample1-3

Likewise:

chr1    1400    1500    sample2    nearest_gene
chr1    1450    1700    sample2    nearest_gene2
chr1    3000    3500    sample2    nearest_gene3

Becomes:

chr1    1400    1500    sample2-1
chr1    1450    1700    sample2-2
chr1    3000    3500    sample2-3

To do this, you could use something like:

$ awk '{ print $0"-"NR; }' input1.bed > input1.mergedID.bed
$ awk '{ print $0"-"NR; }' input2.bed > input2.mergedID.bed
...
$ awk '{ print $0"-"NR; }' inputN.bed > inputN.mergedID.bed

This will depend on "nearest_gene" and the number in its suffix. You might need something more complicated than printing the row number ("NR"), instead parsing the string "nearest_geneX" to get the value of X, and then printing that at the end of the sample ID.

In any case, when you do your set operations on preprocessed inputs, you would get something like:

chr1    1000    1500    sample1-1;sample2-2
chr1    1450    1700    sample1-1;sample2-2
chr1    3000    3500    sample2-3

Then post-process with awk (or Perl, Python, etc.), splitting once on ";" to get separate sample IDs, and then split the sample ID on the "-" character to get the original sample's nearest_gene index. Once you have split on these delimiters, you can print output in your desired format.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Maximum overlap for intersection?
« Reply #2 on: December 01, 2015, 08:21:23 AM »
You're not looking for regions that are specific to one input, but, if you were:
> bedops --symdiff file1.bed file2.bed ... fileN.bed > regions-of-interest.bed

would give you those regions.  Then, you'd map on the one ID from column 4 from the correct file.

> bedops --everything file1.bed file2.bed ... fileN.bed \
        | bedmap --echo --echo-map-id-uniq regions-of-interest.bed -

If I understand correctly, you are interested in regions that have minimal overlaps with other elements.  How about something like:
> bedops --everything file1.bed file2.bed ... fileN.bed \
      | bedmap --fraction-both 0.6 --echo --count - \
      | awk -F"|" '($2 == 1)'

That should return the list of regions (including all columns found in your original input files) that have no other overlapping elements that meet your overlap criterion.

« Last Edit: December 01, 2015, 08:38:26 AM by sjn »

j-andrews7

  • Newbie
  • *
  • Posts: 9
Re: Maximum overlap for intersection?
« Reply #3 on: December 01, 2015, 10:33:53 AM »
Thanks for the quick responses. I've answered my second question by tweaking my original command and doing a bit of postprocessing. I now have several files, each containing samples for a specific cell type. For example:

chr1   1757937   1803240   CB021314   ['GNB1', 'NADK', 'TMEM52', 'CALML6']
chr1   2572802   2630150   CB011514;CB012214   ['TTC34', 'MMEL1']
chr1   3565069   3596796   CB021314   ['WRAP73', 'TP73', 'RP5-1092A11.5', 'WDR8', 'MEGF6', 'TPRG1L']

I know that I'm cheating with the 5th column not being numeric, but I don't plan on doing operations with the column - I should have specified in my original example that they are actual gene symbols. In any case, I can drop them and re-annotate after processing if needed. I may have to do that regardless, we'll see.

Anyway, Shane, could you elaborate on your last sentence:

"That should return the list of regions (including all columns found in your original input files) that have no other overlapping elements that meet your overlap criterion."

I think what you've given me is what I want. To be certain, I want to take files (like my example above) and compare them to other such files to identify elements that are "unique", which is defined as either no overlap with any of the elements in the other files or a specified maximum overlap (like 0.25, etc). In the code you've specified, it's returning elements for all specified files that do not meet the minimum overlap criteria (60% in your example), correct?

Thanks again for the help.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Maximum overlap for intersection?
« Reply #4 on: December 01, 2015, 11:31:39 AM »
BEDOPS doesn't require that you have a measurement in field 5 unless you use a numerical calculation with bedmap, so what you're doing is valid and common.

I think what I showed gives you what you're after.  When you give bedmap one input file, it maps the file onto itself.  Every element overlaps itself by 100% so you are guaranteed to get at least 1 with --count no matter what overlap criterion you choose.  I specifically filtered for those elements with an overlap --count of 1.  Depending upon your overlap specification, a --count may be greater than 1.  In such a case, it means that two elements (or more) overlap too much and you do not consider them as unique.  These will not be found in your output.

j-andrews7

  • Newbie
  • *
  • Posts: 9
Re: Maximum overlap for intersection?
« Reply #5 on: December 01, 2015, 01:23:28 PM »
Perfect, exactly what I needed. Thanks again.

j-andrews7

  • Newbie
  • *
  • Posts: 9
Re: Maximum overlap for intersection?
« Reply #6 on: December 01, 2015, 02:28:06 PM »
So upon trying that, altering the minimum-fraction for overlap doesn't result in different results. I always get the same results, no matter if I use 0.01 or 0.99 for the overlap.

Thoughts?

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Maximum overlap for intersection?
« Reply #7 on: December 01, 2015, 02:59:35 PM »
We should be able to help.  I assumed that all of your inputs are sorted.  You might throw a --ec option onto your bedops --everything call to double check that.  If things run through and you're not getting what you expect, could you post an excerpt of the problem?


j-andrews7

  • Newbie
  • *
  • Posts: 9
Re: Maximum overlap for intersection?
« Reply #8 on: December 01, 2015, 03:49:18 PM »
Actually, after diving deeper, I think it's a result of the way the features were called upstream. They either overlap completely or not at all. So much effort trying to do something that was moot anyway!

Well, I'm a dolt but still learned a useful trick for the future. Sorry for wasting your time!

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Maximum overlap for intersection?
« Reply #9 on: December 01, 2015, 05:06:24 PM »
It's no waste.  I'm glad you're trying out BEDOPS - feel free to ask questions anytime.

Shane