Here's a rather specific scenario I'm facing at the moment. I've called super enhancers using K27AC data from various tissue samples, each with several biological replicates. I plan on identifying overlapping super enhancers for each tissue type with the following code as explained by you (Alex) in
this forum post.
$ bedops --everything file1.bed file2.bed ... fileN.bed \
| bedmap --echo-map --fraction-both 0.6 - \
| awk '(split($0, a, ";") > 1)' - \
| sed 's/\;/\n/g' - \
| sort-bed - \
| uniq - \
> answer.bed
My ultimate goal is to identify super enhancers unique to each tissue type. However, I don't want to define "unique" as having no overlap between the features, as they are often 100's of kbs in length and random overlap may occur to some extent. So my question is if there's a way to identify regions that only overlap by a specified percentage (or not at all). So rather than a minimum overlap, you use a maximum overlap.
I'm also having trouble getting the above code to output in a format that's amenable to what I'm doing.
As an example, given that my input samples look like:
chr1 1000 1500 sample1 nearest_gene
chr1 1450 1700 sample1 nearest_gene2
chr1 2000 2500 sample1 nearest_gene3
and
chr1 1400 1500 sample2 nearest_gene
chr1 1450 1700 sample2 nearest_gene2
chr1 3000 3500 sample2 nearest_gene3
I'd like an output like:
chr1 1000 1500 sample1;sample2 nearest_gene_s1;nearest_gene_s2
chr1 1450 1700 sample1;sample2 nearest_gene_s1;nearest_gene_s2
chr1 3000 3500 sample2 nearest_gene3
As it stands now, the elements are repeated as many times as they are found in the sample files (twice each for the elements that overlap in my example). I am admittedly pretty poor with awk/sed, and while whipping up a script to give the output I'd like would be a simple task, I was wondering if there was a better way.
I apologize if this is well documented somewhere. The multi-input methodology of this suite made me jump ship from bedtools to bedops, so you have a new fan, at the least. Any help would be much appreciated. Thanks!