Author Topic: Bedops merge on feature ID  (Read 4772 times)

IrsanKooi

  • Newbie
  • *
  • Posts: 1
Bedops merge on feature ID
« on: January 30, 2013, 03:50:22 AM »
Hi Bedops masters :-)

Suppose I have a bed-file with that looks like this:

chr1    100     200     tag1
chr1    101     201     tag1
chr3    102     202     tag1
chr4    300     400     tag4

Can I merge each entry based on feature ID (so in this case tag1 and tag2) and count the amount of times each feature is present:

tag1   3    (list of all positions)
tag2   1    (list of all positions)

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Bedops merge on feature ID
« Reply #1 on: January 30, 2013, 08:22:45 AM »
Hi IrsanKooi,

Thanks for joining our forum.
I'd recommend using awk for something like this.  If the problem was a little more involved, I'd go with a unix 'join' command (combined with awk).

Here is a preferred solution without BEDOPS:

Code: [Select]
awk '{ \
      if ( $4 in arr ) { arr[$4] = arr[$4]";"$1"\t"$2"\t"$3; ++cnts[$4]; } \
      else { arr[$4] = $1"\t"$2"\t"$3; cnts[$4]=1; } \
    } END { \
      for ( i in cnts ) \
        print i"\t"cnts[i]"|"arr[i]; \
    }' bed-file

The output may not be sorted in a way that you like.  You can pipe results to 'sort -k1,1' to fix that.  This kind of problem doesn't require any sophisticated windowing or set-like logic, which are the types of things that BEDOPS is best used for.  You can do some acrobatics and use BEDOPS in answering this kind of question, yet such an approach still requires a number of standard unix tools (like cut, awk, uniq).  I recommend using the above approach, understanding its mechanics and why BEDOPS is probably not the best tool for this job.  Because you are just matching on a single field, and, specifically, because you are not using BED chromosome or coordinate information at all, BEDOPS is of limited utility without forcing a solution.  Yet, simple tools provided in linux and macosx are ideal for this sort of thing (the awk code shown above).

Here is a somewhat awkward solution that includes bedmap and sort-bed calls.

Code: [Select]
awk 'BEGIN {OFS="\t"} ; { print $4, 0, 1, $1"-"$2"-"$3; }' bed-file \
  | sort-bed - \
  | tee all-entries.bed \
  | cut -f1-3 \
  | uniq \
  > uniq-entries.bed

bedmap --exact --echo --count --echo-map-id --delim "\t" uniq-entries.bed all-entries.bed \
  | cut -f1,4- \
  > answer.txt

I've taken some liberties with delimiters between the output fields you want.  These can be changed with the 'tr' command (or with '--delim' option in the bedmap call).

We try to work well with standard unix tools rather than replicate what they offer, either directly or with a couple lines of code.   I mentioned unix join above.  This is a powerful little utility that is already on your system that, with a small amount of awk, can do everything a 'groupby' gives you in other tool suites.
« Last Edit: January 30, 2013, 08:36:48 PM by sjn »