Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 11;53(2):gkae1310.
doi: 10.1093/nar/gkae1310.

GENA-LM: a family of open-source foundational DNA language models for long sequences

Affiliations

GENA-LM: a family of open-source foundational DNA language models for long sequences

Veniamin Fishman et al. Nucleic Acids Res. .

Abstract

Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs. Notably, integrating the newly developed recurrent memory mechanism allows these models to process even larger DNA segments. We provide pre-trained versions of GENA-LM, including multispecies and taxon-specific models, demonstrating their capability for fine-tuning and addressing a spectrum of complex biological tasks with modest computational demands. While language models have already achieved significant breakthroughs in protein biology, GENA-LM showcases a similarly promising potential for reshaping the landscape of genomics and multi-omics data analysis. All models are publicly available on GitHub (https://github.com/AIRI-Institute/GENA_LM) and on HuggingFace (https://huggingface.co/AIRI-Institute). In addition, we provide a web service (https://dnalm.airi.net/) allowing user-friendly DNA annotation with GENA-LM models.

PubMed Disclaimer

Figures

Graphical Abstract
Graphical Abstract
Figure 1.
Figure 1.
The GENA-LM family of foundational DNA language models. (A) The GENA-LM transformer-based architecture is pre-trained on DNA sequences using an MLM objective. GENA-LMs encompass a variety of models that differ in their pre-training data and architecture, as detailed in Table 2. All models adhere to the same workflow: DNA sequences are tokenized using a BPE algorithm before being processed through transformer layers, which generate representations of the input sequences that are suitable for downstream applications. Post pre-training, this foundational DNA model incorporates a downstream task-specific head, which utilizes DNA representations to address specific genomic tasks during the fine-tuning process. (B) GENA’s evaluation tasks include predictions related to promoter and enhancer activities, splicing sites, chromatin profiles and polyadenylation site strength (not all shown). (C) Task-specific fine-tuned models can be queried via web service (https://dnalm.airi.net/). (D) Post-BPE tokenization, the median token length stands at nine bp, as reflected in the token length distribution. (E) Illustration of repetitive element representation for the 100 longest tokens. (F) GENA’s model accuracies for pre-training on the MLM task demonstrate that models with a higher parameter count achieve superior performance.
Figure 2.
Figure 2.
GENA-LM identifies DNA motifs essential for TF binding. In panels (A)–(D), each row pertains to a distinct factor, labeled to the left. (A) Logo representation of motifs for the three TFs considered in our analysis. (B) Profile of average token importance scores over the sequence length. Vertical dashed lines demarcate the 200-bp prediction region. (C) Bars represent the frequency of token occurrences in the ‘highly important’ category (tokens with scores in the top 5th percentile). The X-axis shows the proportion of these occurrences relative to all occurrences for that token. A vertical reference line marks the 0.05 fraction threshold; only tokens exceeding this fraction are displayed. (D) Boxplots detail the distribution of importance scores for tokens, categorized by different FIMO q-values. They display the median, interquartile range as well as the 5th and 95th percentiles.
Figure 3.
Figure 3.
GENA-LMs demonstrate generalization across species. (A–C) GENA-LM fine-tuned on human promoters (A), CTCF (B) or H3K27 (C) binding sites evaluated on different species. (D) Effect of multispecies versus species-specific pertaining on promoter activity prediction.
Figure 4.
Figure 4.
Sequence embeddings from pre-trained GENA-LMs facilitate species classification. t-Distributed Stochastic Neighbor Embedding (tSNE) projections (A) of sequences sampled from 27 species (B), representing a spectrum across the tree of life. (C) Classification performance for different sequence lengths plotted against divergence time. (D) Classification performance of embeddings taken from different layers of three models. Data are presented for sequence lengths of 5 kbp (for ‘gena-lm-bert-base-lastln-t2t’ and ‘gena-lm-bert-large-t2t’) and 30 kbp (for ‘gena-lm-bigbird-base-t2t’).
Figure 5.
Figure 5.
Leveraging recurrent memory to enhance the input capacity of GENA-LM models yields improved performance in downstream tasks. (A) The RMT architecture. A vocabulary of the model is augmented with a memory token denoted as ‘mem’ in the figure. Memory augmented model is fine-tuned to write relevant information in memory tokens and pass it to subsequent segments. (B) The augmentation of GENA-LM with RMT with 3× (left), 8× (center) and 50× (right) larger sequence lengths. Models with memory achieve superior results in splice site annotation and promoter prediction tasks when compared with all other GENA-LMs, including those utilizing sparse attention (Wilcoxon test P-value ≤0.043 in all comparisons). On the species classification task, RMT with GENA-LM outperforms the HyenaDNA model designed for long sequences. RMT+P refers to models that have not only been fine-tuned with RMT, but also pre-trained with it.

References

    1. Kim S., Wysocka J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell. 2023; 83:373–392. - PMC - PubMed
    1. Whalen S., Schreiber J., Noble W.S., Pollard K.S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 2022; 23:169–181. - PubMed
    1. Libbrecht M.W., Noble W.S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015; 16:321–332. - PMC - PubMed
    1. Belokopytova P.S., Nuriddinov M.A., Mozheiko E.A., Fishman D., Fishman V. Quantitative prediction of enhancer–promoter interactions. Genome Res. 2020; 30:72–84. - PMC - PubMed
    1. Sindeeva M., Chekanov N., Avetisian M., Shashkova T.I., Baranov N., Malkin E., Lapin A., Kardymon O., Fishman V. Cell type-specific interpretation of noncoding variants using deep learning-based methods. GigaScience. 2023; 12:giad015. - PMC - PubMed

Grants and funding

LinkOut - more resources