Author Topic: bed format question  (Read 7318 times)

zhilianghu

  • Newbie
  • *
  • Posts: 11
bed format question
« on: November 23, 2013, 07:52:30 PM »
I have used your vcf2bed tool to successfully convert vcf files to .bed. The example of the converted 'bed' file is as this line (just one line):

Chr1    570     571     .       106.56  T       C       PASS    AC=1;AF=0.50;AN=2;DP=6;Dels=0.00;HRun=0;HaplotypeScore=0.0000;MQ=44.01;MQ0=0;QD=17.76;SB=12.54 sumGLbyD=22.76       GT:AD:DP:GQ:PL  0/1:1,5:5:29.68:137,0,30

However when I checked against the 'bed' format information at
http://useast.ensembl.org/info/website/upload/bed.html or at
http://genome.ucsc.edu/FAQ/FAQformat#format1 , I cannot match the column definitions.
Could you advice?

Thanks,
Zhiliang

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #1 on: November 24, 2013, 07:37:45 PM »
Hi Zhilianghu,

BEDOPS supports a relaxed/more-general form of the standard, strict BED format.  The bedops, sort-bed, closest-features, starch, unstarch, and bedextract programs require only the first 3 fields defined in BED (chromosome, start-coord, end-coord).  The bedmap program requires either 3, 4, or 5 fields depending upon the options selected.  After the required minimum number of columns, all additional fields may be anything you desire.  For cases of conversions between formats, such as VCF -> BED, all information is maintained (often in the extra fields) such that you could convert back to VCF if you really wanted to.  In this way, our simple relaxation of the BED format can be used to losslessly represent the dozen or so other formats commonly found in our field.  Most often, this can be performed using a small bit of awk code.

In the row you show, the 5th column represents something numerical which might prove useful for the various bedmap calculations, such as mean, stdev, trimmed-mean, etc.

If you want strict, standard BED format, then awk could also be used to do this, but you will be throwing out information that VCF gives which strict BED does not support.

Shane

zhilianghu

  • Newbie
  • *
  • Posts: 11
Re: bed format question
« Reply #2 on: November 25, 2013, 12:12:45 PM »
Shane,

Thanks for taking your time to reply.

Because of lacking (perhaps access to) documentations to match the columns in your BED files, let me make an attempt to draw a picture:

Col 1: Chr.# (default, required)
Col 2: Chr_start (default, required)
Col 3: Chr_end (default, required)
Col 4: (??)
Col 5: "bedmap parameters/statistics" (from your reply)
Col 6: Genotype_allele_A (??)
Col 7: Genotype_allele_B (??) alt??
Col 8: Quality label
Col 9: Various defined parameters (in "name=value" pairs).
Col 10: Format/name of data in column 11 (??)
Col 11: Values defined in column 10 (??)

Could you fill in the gapes, correct me, or confirm my guesses on columns 4-11?

Thank you,

Zhiliang

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #3 on: November 25, 2013, 01:20:29 PM »
Have you taken a look at the documentation/example here?

https://bedops.readthedocs.org/en/latest/content/reference/file-management/conversion/vcf2bed.html#example

If this doesn't resolve your questions, let us know.

Wikipedia has a blurb on VCF with a link to a pdf that describes all of its fields: http://en.wikipedia.org/wiki/Variant_Call_Format

Shane
« Last Edit: November 25, 2013, 02:19:02 PM by sjn »

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #4 on: November 25, 2013, 05:31:45 PM »
In addition to reading the BEDOPS documentation, you can read the vcf2bed script code and the comments at the beginning of the script document how VCF columns are migrated to BED columns.

zhilianghu

  • Newbie
  • *
  • Posts: 11
Re: bed format question
« Reply #5 on: November 25, 2013, 08:29:35 PM »
The Python code header helps.

Finally, could you confirm columns 10 and 11?

Zhiliang

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #6 on: November 26, 2013, 01:19:08 AM »
"If present, genotype data in FORMAT and subsequence sample IDs are placed into tenth and subsequent columns"

zhilianghu

  • Newbie
  • *
  • Posts: 11
Re: bed format question
« Reply #7 on: December 12, 2013, 11:27:59 AM »
I have a new question on the same "bed format" thread:

When a vcf file contains data for multiple individuals, I fail to see how the individual IDs are preserved in the converted .bed file. Or I am missing something?

Thanks,
Zhiliang

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #8 on: December 12, 2013, 12:38:03 PM »
Hi Zhiliang,
Can you post the first 100 lines of the file you are trying to convert, or attach the file you're interested in if not very big?

Shane

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #9 on: December 12, 2013, 09:36:00 PM »
You're probably not missing anything. Mixed variants and ID names for alternate alleles are handled better in vcf2bed in the 2.4 release of BEDOPS, which we hope to release very soon. Though you may still want to post a snippet of your input data just in case I am misunderstanding your issue.
« Last Edit: December 13, 2013, 10:39:08 AM by AlexReynolds »

zhilianghu

  • Newbie
  • *
  • Posts: 11
Re: bed format question
« Reply #10 on: December 19, 2013, 06:38:54 PM »
Thanks Alex -- You are right I am not loosing information but only missing the individual IDs.  For example, in column 10 there is "format" info like "GT:AD:DP:GQ:PL"; followed by multiple columns of "data", one for each individual, as in:
column 11: 1/1:0,8:8:23.98:210,24,0
column 12: 0/1:7,1:8:5.57:6,0,198
column 13: /1:0,11:11:29.98:267,30,0
column 14: . . . .
I fail to find the individual "ID"s in the converted BED file, although I can match them up manually from the original VCF file.

Thanks,
Zhiliang

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #11 on: December 23, 2013, 02:43:13 PM »
The FORMAT column itself will map into the 10th column of the BED file. However, header data are stripped because they interfere with BED operations, so vcf2bed will lose descriptive metadata for FORMAT fields. If you keep the original VCF file, you can recover the original headers and thus rebuild the VCF file or otherwise retrieve metadata from the headers.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #12 on: December 23, 2013, 05:09:09 PM »
It's hacky, and some may not like it, but have you considered adding a dummy chrom/start/end to header lines to keep them around in a bed file?  Maybe add a:
--keep-header option to your scripts?

header_info  0  1  # actual header information?

Shane
« Last Edit: December 24, 2013, 11:16:23 AM by sjn »

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #13 on: December 25, 2013, 10:44:45 PM »
This seems like a workable idea, so long as the choice of chromosome name doesn't collide with an existing name. Maybe header_info for the default --keep-header and some other name for --keep-header <custom-name>? Let me stew it over.

sjn

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Re: bed format question
« Reply #14 on: December 26, 2013, 08:37:33 AM »
It's ugly but gets back to supporting the 20 fully convertible formats out there.  If you do it, just remember to number every header line a bit differently so header info is in the same order.  The biggest problem is that the header info might not be at the top of the file after sorting.

header_info 0 1 # first line of header
header_info 1 2 # second line of header
...

You could try to keep header info at the top.  Actually, ascii '#' is alphabetically before any usual character.  You could:
# 0 1 # first line of header
# 1 2 # second line of header.
...

where # is the 'chromosome' name.  The one thing that might be odd though is that we have the silly --header options in the other utils that look for this.  Sillier still is that it's just an alias to --ec.  But, I have no qualms about removing the checks for header info crud.  sort-bed strips such things currently so we'd probably remove that silliness altogether if you go this route.

Shane
« Last Edit: December 26, 2013, 08:39:31 AM by sjn »