Author Topic: error wig2bed tool  (Read 7321 times)

Assa

  • Newbie
  • *
  • Posts: 1
error wig2bed tool
« on: February 19, 2015, 02:34:35 AM »
Hi,
I am using the bedops version 2.4.9. I am trying to convert a bigWig file to bed.
 I have downloaded the bigwig from the encode web site - http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/wgEncodeSydhTfbsK562Pol3StdSig.bigWig.

I than converted it to wig using the UCSC provided software - http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bigWigToWig.

After that I tried to run the wig2bed script, but it stops with an error:
Code: [Select]
$ wig2bed < Pol3StdSig.wig
Start coordinate is too large.  Max decimal digits allowed is 13 in BEDOPS.Constants.hpp.  See line 15844175 in -.

When checking this position, I couldn't find anything suspicious about it.
Code: [Select]
$ sed -n '15844170, 15844180'p Pol3StdSig.wig
chr16 89591258 89591268 4.7
chr16 89591268 89591276 4.6
chr16 89591276 89591624 0
chr16 89591624 89591625 5.9
chr16 89591625 89591628 5.8
chr16 89591628 89591629 5.7
chr16 89591629 89591631 5.6
chr16 89591631 89591633 5.5
chr16 89591633 89591640 5.4
chr16 89591640 89591644 5.3
chr16 89591644 89591651 5.2

I have found another message in the forum about a similar problem with a different error message. In it, you recommended to download an older version of bedops (https://bedops.googlecode.com/files/bedops_linux_x64-v1.2.5b.tgz).
I have than tried to run the older command and it worked.

I'm not sure why this has worked and the newer version didn't. What also disturb me is the fact, that I don't have the same number of lines in the wig and the bed files.
Code: [Select]
$wc -l Pol3StdSig.bed
44605785 Pol3StdSig.bed
$ wc -l Pol3StdSig.wig
44649357 Pol3StdSig.wig

Is there a reason for this difference in the row numbers?

I am writing this message here, just to let you know that there might be a bug in the script
I would appreciate if you can let me know why I get this error message and if I did something wrong.

Thanks
Assa

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: error wig2bed tool
« Reply #1 on: February 19, 2015, 08:42:42 AM »
An overflow appears to occur when a wig entry has a start position of 0, which affects downstream elements. I'll likely push out a new version of wig2bed today with a fix for this condition.

It seems odd that 1-based wig elements should have a start position of 0 (see: https://www.biostars.org/p/84686/ for an explanation), but perhaps this is just an oddity of that wig file. It should be easy enough to fix, in any case.

Converted BED data are passed internally through sort-bed, which strips out headers or comment lines. The old wig2bed tool from Google Code does not use sort-bed, and may leave headers in the output. I expect that the difference in the number of lines is due to the removal of those lines in the newer version, but I'll also double-check that when I look at the overflow issue.

Thanks for the report!

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: error wig2bed tool
« Reply #2 on: February 19, 2015, 01:28:56 PM »
There are two elements in the wig file which have start coordinates of 0:

Code: [Select]
balsamea:Github alexpreynolds$ sed -n '15859650, 15859660'p wgEncodeSydhTfbsK562Pol3StdSig.wig
chr16 90289051 90289078 0
chr16 90292542 90292837 0
#bedGraph section chr17:0-70149
chr17 0 256 0
chr17 256 331 4.3
chr17 331 507 0
chr17 507 510 4.4
chr17 510 558 4.3
chr17 558 582 4.2
chr17 582 834 0
chr17 834 839 4.8
balsamea:Github alexpreynolds$ sed -n '43067090, 43067100'p wgEncodeSydhTfbsK562Pol3StdSig.wig
chr9 141146429 141146462 0
chr9 141146472 141146505 0
#bedGraph section chrM:0-1297
chrM 0 1 2146.8
chrM 1 2 2202.4
chrM 2 3 2200.8
chrM 3 4 2217.7
chrM 4 5 2243.6
chrM 5 6 2259.8
chrM 6 7 2257.8
chrM 7 8 2264.7

I'm contacting the UCSC folks to see if the bigWigToWig tool is not applying the correct indexing, or if the wig data itself is not indexed as expected, or if the indexing problem is specific to this particular bigWig file.

There will likely be one of the following two fixes applied:

1. Skip over 0-indexed elements and print a warning message to standard error; or,
2. Remove adjustments by wig2bed to the wig coordinates

We'll see what UCSC has to say and then I'll probably post an update later today or tomorrow.

Regarding headers, the wig2bed tool strips headers before reaching sort-bed, unless the --keep-headers option is specified. However, --keep-headers rewrites header lines as pseudo-BED elements, so that they can be sorted with sort-bed. So this may not be exactly what you were expecting, although this would resolve line count questions.

The disparity in line numbers between input wig and output bed files is specifically due to the removal of header lines from the wig input, during conversion. We can compare "before" and "after" line counts to see how this works:

Code: [Select]
balsamea:Github alexpreynolds$ grep '^#' wgEncodeSydhTfbsK562Pol3StdSig.wig | wc -l
   43572
balsamea:Github alexpreynolds$ grep -v '^#' wgEncodeSydhTfbsK562Pol3StdSig.wig | wc -l
 44605785
balsamea:Github alexpreynolds$ grep '^#' wgEncodeSydhTfbsK562Pol3StdSig.bed | wc -l
       0
balsamea:Github alexpreynolds$ wc -l wgEncodeSydhTfbsK562Pol3StdSig.bed
 44605783 wgEncodeSydhTfbsK562Pol3StdSig.bed

In other words, there are 43572 header lines in the original wig file, and 44605785 non-header lines. In the resulting bed file, a modified version of wig2bed that skips the two 0-based elements has 44605783 lines. (If the two skipped elements were included, we would then have all 44605785 elements from the original wig file.)

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: error wig2bed tool
« Reply #3 on: February 21, 2015, 09:54:00 PM »
In the current beta (v2.4.10) I have added a --zero-indexed option to deal with data sourced from bigWigToWig, which does not always output WIG coordinates in 1-based indexing, but instead leaves the coordinates untouched.

According to UCSC, if the bigWig file was derived from BAM- or bedGraph-formatted data, for instance, which have a 0-based index, then the resulting WIG data will also have a 0-based index, even though the WIG format specifies that its coordinates are 1-based.

Thus, adding --zero-indexed to the wig2bed, wig2starch or convert2bed statement will leave the WIG data coordinates unadjusted, which will prevent overflow (and sorting problems downstream).

In the default case, if the WIG data contain an element with a start coordinate of 0 and the --zero-indexed option is not specified, then the conversion tool will exit early with an EINVAL error. The end user can then add --zero-indexed to the conversion step and run it again.

This feature will likely be pushed out to the current release on Monday or Tuesday. I'll post a quick note here when that is done. Thanks for the bug report.