Conversion errors with gff2bed


Conversion errors with gff2bed
« on: December 04, 2012, 03:24:21 PM »
TAIR9 and TAIR10 GFF3 data from seem to have some validation problems related to trailing semi-colons in the attributes field. For example:

Chr1    TAIR9    five_prime_UTR    3631    3759    .    +    .    Parent=AT1G01010.1
Chr1    TAIR9    CDS    3760    3913    .    +    0    Parent=AT1G01010.1,AT1G01010.1-Protein;

The second line (as compared with the first) has a trailing semi-colon that shouldn't be there. If you run this through, then you'll get the following error:

Traceback (most recent call last):
 File "", line 103, in <module>
 File "", line 67, in main
   attrd[attr[0]] = attr[1]
IndexError: list index out of range

The easiest way to address this is to simply strip it and pipe the fixed results to the conversion step, e.g.:

$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | | sort-bed - > TAIR9_GFF3_genes.bed

If you have difficulties converting GFF files, please run your data through the modENCODE GFF3 Validator at:

If your data are valid and yet still have conversion problems, please feel free to follow up with us on the forum and we will be happy to assist.