Large genome model: Open source AI trained on trillions of bases
Airfind news item
By John Timmer
Published on March 4, 2026.
Open source AI system, Evo 2, has been developed by the team behind Evo, which was trained on massive numbers of bacterial genomes. The system was able to identify genes from a cluster of related genes when prompted with sequences from a specific gene or suggesting a new protein. This is a challenge as bacteria tend to cluster related genes together, which is not true in organisms with complex genome structures. The team developed an open source AI that has been trained on genomes from all three domains of life (bacteria, archaea, and eukaryotes) and developed internal representations of key features in complex genomes, such as regulatory DNA and splice sites. The training took place in two stages, with initial focusing on teaching the system to identify important genome features by feeding sequences rich in them in chunks about 8,000 bases long and then feeding sequences to a million bases at a time to identify large-scale genome features. The researchers trained two versions of their system using a dataset called OpenGenome2, which contains 8.8 trillion bases from various domains. They also found that while making evolutionary comparisons and looking for sequences conserved, the training required massive amounts of data and computing time to process it, creating an effective AI training program remained a challenge. The training was required to identify these features.
Read Original Article