Author Topic: Problem with starch --header  (Read 6549 times)

melissa

  • Newbie
  • *
  • Posts: 7
Problem with starch --header
« on: February 11, 2015, 11:02:39 AM »
Hi bedops team,
I've been having problems with starched files that have headers. After some debugging I can reproduce the error with a very simple bed file with a header + one line. The problem is demonstrated with the following bash history (the problem occurs with the final command):

melissa@virgil:~$ cat tiny.bed
# header
chr1   1   10
melissa@virgil:~$ starch --header tiny.bed > tiny.bed.starch
melissa@virgil:~$ unstarch tiny.bed.starch
# header
chr1   1   10
melissa@virgil:~$ unstarch tiny.bed.starch | bedops --header --partition -
chr1   1   10
melissa@virgil:~$ bedops --header --partition tiny.bed.starch
chr1   11   20

As you can see, the last output is wrong. I observe similar problems if I use this starch file with bedmap. Everything works fine without the header, but I thought that using headers in starch files is permitted, and that starch files can be sent directly to bedops/bedmap. I am using the newest version of bedops (2.4.8, downloaded today), on an x86_64 linux machine.
Thanks,
Melissa

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #1 on: February 11, 2015, 03:37:34 PM »
Hi Melissa,
Thanks for the post.

The behavior you show is as expected.  While bedops/bedmap/closest-features all accept headers with the --header option, they will never output that same information.  Consider the case of:
bedops -u f1.bed f2.bed f3.bed

where each input file has header information.  The only reasonable way to output the information and use the downstream result with BEDOPS utilities is to put everything at the top of the output.  Other cases are more complex (such as with closest-features).  All 3 of these utils simply accept and ignore header information.

If you run your file through sort-bed, the header will be stripped.  bedextract can accept header information, which will be ignored.  starch/unstarch are a bit different in that they will store/extract header information.  You can keep the header information in a starch file and get it back with unstarch.  But, if you pass it to another BEDOPS utility (with --header), the result is stripped as you say.

We used to support no header information as most folks with headers have not called sort-bed to ensure everything is sorted as expected.  In truth --header is just an alias to --ec.  I can't speak for everyone, but I don't like headers in my BED-like files so our support is minimal.

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #2 on: February 11, 2015, 06:09:42 PM »
Shane is correct that this is expected behavior. The use of --header with starch is intended to help make the compression format "non-lossy", so that "unstarch" will be able to recover that data.

However, starch/unstarch do not place any requirements on how other tools will process headers from extracted data, and a consequence of generating sorted input for use with other BEDOPS tools will generally strip this header data. Also, the starchcat tool will not concatenate starch files with header data.

One exception is the use of --header with data acquired from other genomic formats, using the convert2bed tool (gff2bed, etc.). In this case, metadata elements at the start of non-BED data are given a pseudo-chromosome name and genomic interval when transformed, which allows them to be sorted with sort-bed as if they were normal BED elements. This allows them to be preserved in downstream BEDOPS operations.

So one option is to mimic this approach, such that given the example input:

#some_metadata
#some_more_metadata
chrN    X    Y


...one might instead make a fake BED element with a pseudo-chromosome name of "_header" or similar, e.g.:

_header    0    1    some_metadata
_header    1    2    some_more_metadata
chrN       X    Y


Unless there are chunks of metadata strewn throughout the BED file, the final result of any BED operations on this kind of data is easy to revert to the original header format.

melissa

  • Newbie
  • *
  • Posts: 7
Re: Problem with starch --header
« Reply #3 on: February 11, 2015, 06:32:15 PM »
Thank you for the response. However, the problem I was reporting was not the stripping of the header... I understand that the header is lost. The problem is that the coordinates have inexplicably changed- from position 1-10 to 11-20. This seems like it must be a bug. Can you replicate this behavior?

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #4 on: February 11, 2015, 06:50:35 PM »
Ah, well then...
yep - it looks like a bug, thank you.

> cat tiny.bed | bedops -p --header -

works

> unstarch tiny.starch | bedops -p --header -

works

> bedops -p --header tiny.starch

does not.

I'll dig in, thanks.
Shane
« Last Edit: February 11, 2015, 06:54:34 PM by sjn »

melissa

  • Newbie
  • *
  • Posts: 7
Re: Problem with starch --header
« Reply #5 on: February 11, 2015, 06:54:54 PM »
Thanks! I'll refrain from headers in starched files in the meantime.

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #6 on: February 12, 2015, 01:06:27 AM »
There is indeed a bug with how headers are handled. I'm working on a patch.

melissa

  • Newbie
  • *
  • Posts: 7
Re: Problem with starch --header
« Reply #7 on: February 12, 2015, 07:04:45 AM »
Hi,
Maybe this is a related problem, but I found another issue with starch --header. Usually if you do "starch --header file.bed > file.bed.starch", and then unstarch the file, the header is preserved. However, if any line in the header has a tab character in it, then only the header up to the tab is preserved in that line. For example:

> cat test.bed
#chrom   chromStart   chromEnd (separated by tabs)
#chrom chromStart chromEnd (separated by spaces)
chr   1   10
> starch --header test.bed > test.bed.starch
> unstarch test.bed.starch
#chrom
#chrom chromStart chromEnd (separated by spaces)
chr   1   10

Note that the first header is truncated, but the second is not.
Thanks,
Melissa

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #8 on: February 15, 2015, 06:06:39 PM »
Thanks for the report - we're considering things!

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #9 on: February 16, 2015, 06:18:40 PM »
I have a patch for starch in the works which adds support for this. I expect to release v2.4.9 in a day or two which will fix this. Thanks for the bug report!

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #10 on: February 17, 2015, 03:48:54 PM »
A new release of BEDOPS is available which includes patched Starch tools:

https://github.com/bedops/bedops/releases/tag/v2.4.9

Please give this a try and let us know if you still run into difficulties.

melissa

  • Newbie
  • *
  • Posts: 7
Re: Problem with starch --header
« Reply #11 on: February 19, 2015, 07:41:00 PM »
I gave it a try and it seems to be working. Thanks for your help and quick fix!
-Melissa

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: Problem with starch --header
« Reply #12 on: February 21, 2015, 09:46:20 PM »
Thanks for the feedback!