Author Topic: --echo-map without --echo: effect of non-matches on output  (Read 5724 times)

erynes

  • Newbie
  • *
  • Posts: 1
--echo-map without --echo: effect of non-matches on output
« on: October 10, 2012, 12:57:47 PM »
With bedmap, any row in the reference file that doesn't map to anything in the mapping file yields the empty string for the map, naturally.  When this happens, if the user has specified both --echo and --echo-map, this works fine as-is; s/he can grep -v '|$' to get only the rows which map, or grep '|$' to get only the rows which don't map.  More to my point, the output is still in BED format.

But if the user doesn't specify --echo, then she gets empty lines in the middle of the output, which makes the output non-BED-format, which can silently cause problems downstream if she doesn't realize she needs to explicitly address this.

Example:

$ cat aa.bed
chr3   500   600   line1   123
chr3   1300   1400   line2   12345
chr3   1600   1800   line3   1234
$ cat bb.bed
chr3   55   56   rs123456
chr3   1701   1702   rs1000
$ cat cc.bed
chr3   1000   2000

With --echo and --echo-map both specified:

$ bedmap --echo --echo-map aa.bed bb.bed
chr3   500   600   line1   123|
chr3   1300   1400   line2   12345|
chr3   1600   1800   line3   1234|chr3   1701   1702   rs1000

With only --echo-map specified:

$ bedmap --echo-map aa.bed bb.bed


chr3   1701   1702   rs1000

Piping this to another bedmap call, with no error-checking, in this example yields what I'd expect:
$ bedmap --echo-map aa.bed bb.bed | bedmap --echo cc.bed -
chr3   1000   2000

But doing the same and specifying --ec, although it gives a helpful error message, yields no output, which would mess up a series of additional pipes in a script parsing large files:

$ bedmap --echo-map aa.bed bb.bed | bedmap --ec --echo cc.bed -
May use bedmap --help for more help.

Error: in -
Empty line found.
See row: 1

(I got a different error message in the actual script I was running on actual data that didn't mention an empty line and just said "improper BED format," along with the first offending line number.)

Of course I can just take care to prevent empty lines from passing through the pipe, like this:

$ bedmap --echo-map aa.bed bb.bed | awk '{if ($0 != ""){print;}}' | bedmap --ec --echo cc.bed - | bedmap --range 450 --echo-map aa.bed -
chr3   1000   2000
chr3   1000   2000
chr3   1000   2000

but I thought I'd post this example to the forum in case it's helpful and/or spurs a discussion.  The output I expect from --echo-map depends on whether --echo is also specified, so I need some re-edjumication.  But if other users make the same assumption I did, then this might be good food for thought and/or documentation.

Thanks!
Eric
« Last Edit: October 10, 2012, 01:01:23 PM by erynes »

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: --echo-map without --echo: effect of non-matches on output
« Reply #1 on: October 10, 2012, 02:24:49 PM »
A couple of comments:

bedmap is a program that produces a single line of output for each line of input from the first (Reference) file.  You should not use the results given via echo-map or echo-map-range directly as input to a downstream BEDOPS program other than sort-bed, as the data are not guaranteed to be sorted (or even exist as in your example).  The only safe operation to pass downstream to BEDOPS utils without further processing is --echo (when given as the first output option - i.e.; bedmap --echo --echo-map-id --echo-map-score --mean).  That isn't to say that it isn't possible to receive a correctly sorted output with just --echo-map-range or even --echo-map under very stringent and special circumstances, but these options are not designed to ensure sorted order.

An example of a generally unsafe set of commands that you show is:
Quote from: erynes
bedmap --echo-map aa.bed bb.bed | bedmap --echo cc.bed -
(this command set does nothing more than `cat cc.bed` though it is unsafe as written)

The best bet is just to reverse the file ordering since you're really interested in elements from bb.bed.  The --indicator operation can be handy here too.  If you're interested in bb.bed elements that overlap something in aa.bed, use --indicator and --echo with bb.bed as the first (Reference) file instead.

bedmap --indicator --echo bb.bed aa.bed | awk -F"|" '{ if (int($1) == 1) { print $2; } }'

This approach guarantees that your output is ordered properly just like the input, and it can be used immediately with downstream BEDOPS utils (no sorting needed).

Quote from: erynes
(I got a different error message in the actual script I was running on actual data that didn't mention an empty line and just said "improper BED format," along with the first offending line number.)
This is an error message from sort-bed.  In the next release (version 2.0) of BEDOPS this will no longer be an error in sort-bed as all empty lines will be stripped automatically.  However, other BEDOPS utils still won't work with empty input lines.  While it becomes possible to use sort-bed (version 2.0) to clean up these kinds of problems, such an approach would not be as efficient as the approach I show above with --indicator and swapping the files around.

Finally, --echo-map returns a list of any number of full rows from the second file, separated by ';' (by default).  Using the result of --echo-map as a BED file (even a sorted one) for downstream analyses may be full of unwanted surprises.  In general, you often do not receive a 1:1 correspondence between elements in the two input files, but may instead receive 5 elements from the second (Mapping) file that map onto 1 line of the first (Reference) file.
« Last Edit: October 19, 2012, 07:18:49 PM by sjn »