Author Topic: Tips for working with --complement and left and right edges of chromosome  (Read 7652 times)

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
There is one complication to the usage case for -c (or --complement) with a 0-based coordinate system (and only with this coordinate system), where it is impossible to get back any result that produces base 0 on the output, for any chromosome.

One might think of this situation as wanting to find the region between the "left" edge of a chromosome and the first input element. Consider the following three elements:

chr1   1   4
chr1   100 110
chr2   50  75


Running bedops -c on this input will produce the following result:

chr1   4   100

The optional -L argument may be given with -c, so that bedops -c -L with the elements above produces, instead:

chr1   0   1
chr1   4   100
chr2   0   50


This handles the "left" edge case. What about the "right" edge — the extent of the chromosome?

Hard-coded bounds for various genome assemblies are not part of BEDOPS, but one can easily specify any maximum coordinate bound as an input region, in order to retrieve coordinates as far out to the right edge (i.e., chromosome length, or "maximum bound") as desired.  It often makes sense to download the chromInfo table for the genome of interest from the UCSC table browser (http://genome.ucsc.edu/), do some small processing and use that as input to bedops -c.  For example, say you're working in 0-based coordinates and you want complemented regions to both ends of a chromosome where applicable:

awk '(NR > 1) { print $1, $2, $2+1; }' chromInfo | sort-bed - | bedops -c -L myfile.bed - > answer.bed

will give you what you're after though the downloaded chromInfo file may have 'extra' chromosome information (such as chrUn_gl000212).  If you're not interested in those extras, but only the chromosomes in your myfile.bed file, then:

bedexract --listchr myfile.bed > tmp
grep -w -F -f tmp chromInfo | awk '(NR > 1) { print $1, $2, $2+1; }' | sort-bed - | bedops -c -L myfile.bed - > answer.bed
rm tmp

will clean things up for you.  We grab the unique chromosomes in your myfile.bed file using bedextract.  We then grep out rows in chromInfo that have the same chromosome names as in your file, mark the end of the chromosomes, sort, and then apply the complement operation.
« Last Edit: August 04, 2012, 01:16:16 PM by sjn »