Towards improving the genome annotation of the honey bee (Apis mellifera) (#206)
Honey bees (Apis mellifera) are important pollinators in managed agriculture as well as natural ecosystems. Recently, the sequencing and annotation of the honey bee genome has allowed proteomics to become a powerful technique to probe aspects of honey bee biology; however, one troubling trend that emerged from these studies is that honey bee samples consistently result in lower peptide identification rates compared to other organisms. This suggests that either the genome annotation can be substantially improved, or some atypical biological process is interfering with the mass spectrometry (MS) workflow.
We used a publically available MS dataset (Peptide Atlas; 1,472 raw files) in a proteogenomic approach to search for missing genes, new exons, and to revive discarded annotations. To do this, we searched the data against a six-frame genome translation, a three-frame refSeq RNA translation and a database including sequences that were removed from previous annotations. We also considered unexpected post-translational modifications (PTMs), high genetic diversity and endogenous proteolysis as alternative explanations for low peptide identification rates.
While we found no significant effects of PTMs, sequence diversity or proteolysis in tissues other than the gut, we did discover 1,454 new coding regions matched by two or more peptides (1% FDR), including twelve sequences that were previously annotated as non-coding RNAs. In a separate search, we matched 748 previously annotated proteins that were not retained in the current official gene set (OGS). Importantly, when the sequences were added to the OGS protein database (increasing the database size by 13.5%), this improved MS identification rates across tissues.
Using this proteogenomic strategy, we have improved the completeness of the honey bee genome annotation. The information we present here can facilitate further research on this important insect and this workflow template can be used to aid the genome annotation of other under-studied species.