Author Topic: Conversion errors with gff2bed  (Read 11396 times)

AlexReynolds

  • Administrator
  • Jr. Member
  • *****
  • Posts: 72
Conversion errors with gff2bed
« on: December 04, 2012, 03:24:21 PM »
TAIR9 and TAIR10 GFF3 data from Arabidopsis.org seem to have some validation problems related to trailing semi-colons in the attributes field. For example:

Chr1    TAIR9    five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1
Chr1    TAIR9    CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;


The second line (as compared with the first) has a trailing semi-colon that shouldn't be there. If you run this through gff2bed.py, then you'll get the following error:

Traceback (most recent call last):
 File "gff2bed.py", line 103, in <module>
   sys.exit(main(*sys.argv))
 File "gff2bed.py", line 67, in main
   attrd[attr[0]] = attr[1]
IndexError: list index out of range


The easiest way to address this is to simply strip it and pipe the fixed results to the conversion step, e.g.:

$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | gff2bed.py | sort-bed - > TAIR9_GFF3_genes.bed

If you have difficulties converting GFF files, please run your data through the modENCODE GFF3 Validator at: http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

If your data are valid and yet still have conversion problems, please feel free to follow up with us on the forum and we will be happy to assist.