TAIR9 and TAIR10 GFF3 data from Arabidopsis.org seem to have some validation problems related to trailing semi-colons in the attributes field. For example:
Chr1 TAIR9 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
Chr1 TAIR9 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein;The second line (as compared with the first) has a trailing semi-colon that shouldn't be there. If you run this through
gff2bed.py, then you'll get the following error:
Traceback (most recent call last):
File "gff2bed.py", line 103, in <module>
sys.exit(main(*sys.argv))
File "gff2bed.py", line 67, in main
attrd[attr[0]] = attr[1]
IndexError: list index out of rangeThe easiest way to address this is to simply strip it and pipe the fixed results to the conversion step, e.g.:
$ awk '{gsub(/;$/,"");print}' TAIR9_GFF3_genes.gff | gff2bed.py | sort-bed - > TAIR9_GFF3_genes.bedIf you have difficulties converting GFF files, please run your data through the modENCODE GFF3 Validator at:
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_onlineIf your data are valid and yet still have conversion problems, please feel free to follow up with us on the forum and we will be happy to assist.