. 2025 Jan 11;53(2):gkae1310.

doi: 10.1093/nar/gkae1310.

GENA-LM: a family of open-source foundational DNA language models for long sequences

Veniamin Fishman^{1

2}, Yuri Kuratov^{1

3}, Aleksei Shmelev^{1

4}, Maxim Petrov¹, Dmitry Penzar¹, Denis Shepelin¹, Nikolay Chekanov¹, Olga Kardymon¹, Mikhail Burtsev⁵

Affiliations

¹ AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia.
² Institute of Cytology and Genetics, Prospekt Akademika Lavrent'yeva, 10, Novosibirsk, 630090, Russia.
³ Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow, 141701, Russia.
⁴ HSE University, International laboratory of statistical and computational genomics, Moscow, 109028, Russia.
⁵ London Institute for Mathematical Sciences Royal Institution, 21 Albemarle St, London W1S 4BS, UK.

PMID: 39817513
PMCID: PMC11734698
DOI: 10.1093/nar/gkae1310

GENA-LM: a family of open-source foundational DNA language models for long sequences

Veniamin Fishman et al. Nucleic Acids Res. 2025.

. 2025 Jan 11;53(2):gkae1310.

doi: 10.1093/nar/gkae1310.

Authors

Veniamin Fishman^{1

2}, Yuri Kuratov^{1

3}, Aleksei Shmelev^{1

4}, Maxim Petrov¹, Dmitry Penzar¹, Denis Shepelin¹, Nikolay Chekanov¹, Olga Kardymon¹, Mikhail Burtsev⁵

Affiliations

¹ AIRI, Presnenskaya embankment, 6 st22, Moscow, 123112, Russia.
² Institute of Cytology and Genetics, Prospekt Akademika Lavrent'yeva, 10, Novosibirsk, 630090, Russia.
³ Moscow Institute of Physics and Technology, 9 Institutskiy per., Dolgoprudny, Moscow, 141701, Russia.
⁴ HSE University, International laboratory of statistical and computational genomics, Moscow, 109028, Russia.
⁵ London Institute for Mathematical Sciences Royal Institution, 21 Albemarle St, London W1S 4BS, UK.

PMID: 39817513
PMCID: PMC11734698
DOI: 10.1093/nar/gkae1310

Abstract

Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities in interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent in DNA function. A significant challenge, however, resides in accurately decoding genomic sequences, which inherently involves comprehending rich contextual information dispersed across thousands of nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite of transformer-based foundational DNA language models capable of handling input lengths up to 36 000 base pairs. Notably, integrating the newly developed recurrent memory mechanism allows these models to process even larger DNA segments. We provide pre-trained versions of GENA-LM, including multispecies and taxon-specific models, demonstrating their capability for fine-tuning and addressing a spectrum of complex biological tasks with modest computational demands. While language models have already achieved significant breakthroughs in protein biology, GENA-LM showcases a similarly promising potential for reshaping the landscape of genomics and multi-omics data analysis. All models are publicly available on GitHub (https://github.com/AIRI-Institute/GENA_LM) and on HuggingFace (https://huggingface.co/AIRI-Institute). In addition, we provide a web service (https://dnalm.airi.net/) allowing user-friendly DNA annotation with GENA-LM models.

PubMed Disclaimer

Figures

**Figure 1.**
The GENA-LM family of foundational DNA language models. (A) The GENA-LM transformer-based architecture is pre-trained on DNA sequences using an MLM objective. GENA-LMs encompass a variety of models that differ in their pre-training data and architecture, as detailed in Table 2. All models adhere to the same workflow: DNA sequences are tokenized using a BPE algorithm before being processed through transformer layers, which generate representations of the input sequences that are suitable for downstream applications. Post pre-training, this foundational DNA model incorporates a downstream task-specific head, which utilizes DNA representations to address specific genomic tasks during the fine-tuning process. (B) GENA’s evaluation tasks include predictions related to promoter and enhancer activities, splicing sites, chromatin profiles and polyadenylation site strength (not all shown). (C) Task-specific fine-tuned models can be queried via web service (https://dnalm.airi.net/). (D) Post-BPE tokenization, the median token length stands at nine bp, as reflected in the token length distribution. (E) Illustration of repetitive element representation for the 100 longest tokens. (F) GENA’s model accuracies for pre-training on the MLM task demonstrate that models with a higher parameter count achieve superior performance.

**Figure 2.**
GENA-LM identifies DNA motifs essential for TF binding. In panels (A)–(D), each row pertains to a distinct factor, labeled to the left. (A) Logo representation of motifs for the three TFs considered in our analysis. (B) Profile of average token importance scores over the sequence length. Vertical dashed lines demarcate the 200-bp prediction region. (C) Bars represent the frequency of token occurrences in the ‘highly important’ category (tokens with scores in the top 5th percentile). The X-axis shows the proportion of these occurrences relative to all occurrences for that token. A vertical reference line marks the 0.05 fraction threshold; only tokens exceeding this fraction are displayed. (D) Boxplots detail the distribution of importance scores for tokens, categorized by different FIMO q-values. They display the median, interquartile range as well as the 5th and 95th percentiles.

**Figure 3.**
GENA-LMs demonstrate generalization across species. (A–C) GENA-LM fine-tuned on human promoters (A), CTCF (B) or H3K27 (C) binding sites evaluated on different species. (D) Effect of multispecies versus species-specific pertaining on promoter activity prediction.

**Figure 4.**
Sequence embeddings from pre-trained GENA-LMs facilitate species classification. t-Distributed Stochastic Neighbor Embedding (tSNE) projections (A) of sequences sampled from 27 species (B), representing a spectrum across the tree of life. (C) Classification performance for different sequence lengths plotted against divergence time. (D) Classification performance of embeddings taken from different layers of three models. Data are presented for sequence lengths of 5 kbp (for ‘gena-lm-bert-base-lastln-t2t’ and ‘gena-lm-bert-large-t2t’) and 30 kbp (for ‘gena-lm-bigbird-base-t2t’).

**Figure 5.**
Leveraging recurrent memory to enhance the input capacity of GENA-LM models yields improved performance in downstream tasks. (A) The RMT architecture. A vocabulary of the model is augmented with a memory token denoted as ‘mem’ in the figure. Memory augmented model is fine-tuned to write relevant information in memory tokens and pass it to subsequent segments. (B) The augmentation of GENA-LM with RMT with 3× (left), 8× (center) and 50× (right) larger sequence lengths. Models with memory achieve superior results in splice site annotation and promoter prediction tasks when compared with all other GENA-LMs, including those utilizing sparse attention (Wilcoxon test P-value ≤0.043 in all comparisons). On the species classification task, RMT with GENA-LM outperforms the HyenaDNA model designed for long sequences. RMT+P refers to models that have not only been fine-tuned with RMT, but also pre-trained with it.

See this image and copyright information in PMC

References

1. Kim S., Wysocka J. Deciphering the multi-scale, quantitative cis-regulatory code. Mol. Cell. 2023; 83:373–392. - PMC - PubMed
1. Whalen S., Schreiber J., Noble W.S., Pollard K.S. Navigating the pitfalls of applying machine learning in genomics. Nat. Rev. Genet. 2022; 23:169–181. - PubMed
1. Libbrecht M.W., Noble W.S. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015; 16:321–332. - PMC - PubMed
1. Belokopytova P.S., Nuriddinov M.A., Mozheiko E.A., Fishman D., Fishman V. Quantitative prediction of enhancer–promoter interactions. Genome Res. 2020; 30:72–84. - PMC - PubMed
1. Sindeeva M., Chekanov N., Avetisian M., Shashkova T.I., Baranov N., Malkin E., Lapin A., Kardymon O., Fishman V. Cell type-specific interpretation of noncoding variants using deep learning-based methods. GigaScience. 2023; 12:giad015. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

Grants and funding

AIRI

LinkOut - more resources

Full Text Sources
- PubMed Central
- Silverchair Information Systems
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

GENA-LM: a family of open-source foundational DNA language models for long sequences

Affiliations

GENA-LM: a family of open-source foundational DNA language models for long sequences

Authors

Affiliations

Abstract

Figures

References

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources