Article
Open access
Published: 11 March 2026

Retentive Network promotes efficient RNA language modeling of long sequences

Communications Biology volume 9, Article number: 575 (2026) Cite this article

2715 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

The latent features of RNA sequences are crucial for our understanding of their functions. Thus, Transformer-based nucleotide language models have received widespread attention; however, the O(n²) complexity of Transformer limits their ability to process long sequences. In this work, we propose RNAret, an RNA language model based on Retention Network, which achieves training parallelism, low computational overhead, and long-sequence processing through a retention mechanism, with O(n) complexity. We pretrain RNAret using a self-supervised masked language modeling approach on 29.8 million RNA sequences. Experiments demonstrate the merit of RNAret as an RNA language model, achieving superior performance on a range of tasks, including RNA-RNA interaction prediction, RNA secondary structure prediction, and mRNA/lncRNA classification. RNAret shows strong potential for extracting latent features from RNA sequences and advancing our understanding of RNA biology.

Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning

Article Open access 13 May 2024

Transcriptomics in the era of long-read sequencing

Article 28 March 2025

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Article Open access 07 June 2024

Introduction

RNA is a fundamental component of the central dogma, playing a vital role in gene expression, protein synthesis, and various regulatory processes¹. The ability to accurately predict RNA structure and function from its sequence holds profound biological significance and broad applicability. Unlike DNA, which is typically double-stranded, RNA is predominantly single-stranded and has various types. This diversity leads to more intricate and varied patterns and dependencies within RNA sequences compared to DNA, making their efficient and accurate analysis a formidable challenge². Traditional experimental methods for determining the features of massive RNA sequences are often costly and time-intensive, which has drawn attention to the development of machine learning methods to analyze RNA sequences.

Conventional machine learning methods for nucleotide sequences typically rely on manual feature engineering to capture essential information for specific tasks. However, this approach requires the construction of task-specific feature sets, and the feature space is difficult to generalize across different tasks³. Consequently, each RNA-related task requires tailored feature design, limiting the scalability and transferability of the models.

Transformer-based language models⁴ have achieved remarkable success in natural language processing and autoregressive tasks⁵. Transformer architectures include encoder-only (e.g., BERT), decoder-only (e.g., GPT), and encoder-decoder (e.g., T5), each employing different attention masking mechanisms. While decoder-only models are suitable for autoregressive text generation, seq2seq models are designed for sequence-to-sequence mapping tasks, and encoder-only models are particularly suited for representation learning, thus attracting significant interest from researchers in the life sciences. Nucleotides, represented by “A, T/U, C, G”, form a unique language system that encodes the genetic information of organisms. Language models are well-suited to capture the conditional distributions of patterns within nucleotide sequences, thereby modeling their intricate dependencies. RNA-FM⁶, RNA-MSM⁷, RNAErnie⁸, and RhoFold+⁹, as encoder-only RNA language models, have demonstrated advanced capabilities in various tasks. However, the O(n²) time and space complexity of the Transformer architecture poses challenges when dealing with long sequences. Some models, such as uni-RNA¹⁰ and RiNALMo¹¹, employ FlashAttention^12,13 to optimize memory usage and computational pipelines, improving efficiency to some extent. Genomic models Enformer¹⁴ and Evo¹⁵ use convolutional layers to compress data and expand the receptive field, enabling longer context lengths. However, these approaches do not fundamentally address the computational cost of the Transformer when treating long sequences.

To overcome these limitations, we propose RNAret, a pretrained RNA language model based on the Retentive Network (RetNet) architecture¹⁶, which can be fine-tuned for various downstream tasks. RetNet employs a retention mechanism that enables parallel training, low-cost inference, and strong performance, making it particularly effective for modeling long sequences. RNAret is a lightweight and efficient model with 12 million parameters, making it accessible to academic teams equipped with consumer-grade GPUs and limited computational resources compared to commercial entities.

To assess the interpretability and biological relevance of RNAret, we evaluated its performance across multiple sub-tasks, with topics related to structure, function, and type. Our analysis of both pretraining and downstream tasks demonstrates that RNAret effectively captures the features of RNA sequences. As an RNA language model, RNAret directly extracts high-dimensional features and generates task-general embeddings, thereby eliminating the need for manually designing specific features in new downstream tasks. The results confirm the feasibility of using an encoder-only architecture based on Retentive Network for RNA language modeling and highlight its efficacy in addressing biological challenges. This work underscores the potential of leveraging advancements in large language models for biological applications, revealing the complex features embedded within RNA sequences of varying types and functions through the application of advanced algorithmic design.

Results

Self-supervised pretraining and feature extraction for RNA sequences

We developed an RNA language model based on the Retentive Network with an Encoder representation architecture, incorporating three different settings of $k\in \{{\mathrm{1,3,5}}\}$. Here, $k$ refers to the k-value in the K-mer tokenizer and vocabulary. During pretraining, we employed the MLM (Masked Language Model) self-supervised approach on the RNAcentral database¹⁷, where the input consisted solely of RNA sequences without RNA type annotations (Fig. 1). For an RNA sequence of L-length, the RNAret pretraining model generates an embedding matrix of dimensions $L\times {Hidden\; Dim}$. The initial vocabulary sizes for $k\in \{{\mathrm{1,3,5}}\}$ differ, resulting in significant variations in their initial loss values. As training progresses, the pretraining models with different k-values gradually converge to a loss of approximately 0.40 (Fig. 2b), which means that RNAret gradually learns statistical patterns of sequence structures.

**Fig. 1: Overview of the design of RNAret model and its pretraining and application pipelines.**

**Fig. 2: RNAret captures RNA features and patterns after pretraining.**

Our analysis compares the RNA embeddings generated by the pretrained RNAret model against those from a randomly initialized model, using 5-mer statistics (feature dimension: 1024)¹⁸ as a reference benchmark. For model embeddings, we average the RNA embedding features (feature dimension: 384). We subsample up to 10,000 for each RNA type from the RNAcentral dataset.

The pretrained 5-mer RNAret extracted embedding features cluster ncRNAs with similar types and functions in the dimensionality-reduced space, including atlases of all abundant RNA types in the pretraining dataset (Fig. 2c), 6 types of long RNA (Figs. 2d), and 5 types of small regulatory RNA (Fig. 2e). Although RNAret only learns from the masked representations of RNA sequences, without being exposed to RNA type annotations, it still successfully captures the distinctions among RNA features, with its extracted embeddings showing a more organized distribution in the low-dimensional space.

Quantitative evaluation of the t-SNE dimensionality reduction¹⁹ includes three metrics: Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index (Supplementary Table 1). Bold metrics indicate the best performance within the group. The pretrained RNAret consistently outperforms the randomly initialized model, demonstrating excellent clustering metrics by extracting order from chaos and effectively distinguishing features of different RNA types. The 5-mer metric achieves superior performance in the Davies-Bouldin Index for long RNAs, which may be attributed to its prior extraction of RNA length information. Another pretrained RNA language model, RNA-FM, has 640-dimensional embedding features and a larger number of parameters. It performs well on small RNAs but shows lower embedding performance than RNAret on long sequences and the global RNA atlas. These results demonstrate that RNAret effectively uncovers nucleotide (and motif) correlations and conditional probabilities, especially in long RNAs, successfully capturing RNA types and their functional characteristics through unsupervised learning. Therefore, RNAret is able to perform well in downstream tasks.

miRNA-mRNA interaction prediction

To assess the performance of RNAret in RNA-RNA interaction tasks, we employed the MirTarRAW dataset^20,21,22, which comprises 13,860 positive pairs and 13,860 negative pairs. Each pair contains a miRNA sequence and an mRNA 3’UTR sequence, as most miRNA target sites are located in the mRNA 3’UTR region. We use 72% of the dataset as the training set, 8% as the validation set, and 20% as the test set. We further use the DeepMirTarLeft dataset as an independent dataset, which comprises 443 positive pairs and 385 negative pairs.

We benchmarked our model against the baseline methods, including DeepMirTar²⁰ and RNA language models: RNABERT²³, RNA-MSM⁷, RNA-FM⁶ and RNAErnie⁸, which also use embeddings of RNA sequences as features without incorporating structural or type information (RNA-MSM additionally introduces multiple sequence alignment information) (Table 1). RNAret, particularly the 5-mer model, outperformed other approaches across various metrics without feature engineering or the design of complex classifiers. In particular, 5-mer RNAret achieved an impressive F1 score of 0.962 and an accuracy of 0.962 in MirTarRAW dataset. Experiments demonstrate the strong performance of the fine-tuned 5-mer RNAret, with the [CLS] pooled features extracted by RNAret effectively distinguishing positive and negative samples in the Principal Component Analysis subspace (Fig. 3a). This observation highlights the ability of the RNAret language model in representing RNA sequences.

**Fig. 3: Interpretability of RNAret in miRNA-mRNA interactions prediction.**

Table 1 Performance of RNAret on miRNA–mRNA interaction prediction task

Full size table

We calculated the average retention score map across all samples for 5-mer RNAret (Fig. 3b). The high retention scores of neighboring tokens likely reflect the need for continuous base pairing between miRNA and mRNA sequences, and vice versa. The retention scores of the [CLS] token across different positions reveal which regions contribute more significantly to the [CLS] feature representation (Fig. 3c). Higher retention scores are detected at positions 2-7, which correspond to the seed region of the miRNA, an essential segment for the binding of miRNA to mRNA²⁴. After examining the dataset by browsing the starBase²⁵, we confirm that the peak around position 55 is close to the complementary pairing site of the seed region.

RNA secondary structure prediction

We employed two widely used benchmarks for RNA secondary structure prediction. Benchmark 1 includes the RNAStrAlign²⁶ and ArchiveII²⁷ datasets. We identified redundancy within and between both datasets, which led to data leakage. To address this issue, we removed duplicate sequences within and across the datasets, and only retained samples with fewer than 600 nucleotides. Afterwards, we still had 20,527 samples from RNAStrAlign, which we split into training and validation sets at a 9:1 ratio, and 1574 samples from ArchiveII, which we reserved as the test set. Benchmark 2 includes the bpRNA-1m dataset²⁸, which consists of three distinct subsets: TR0 (10,814 structures) for training, TV0 (1300 structures) for validation, and TS0 (1,305 structures) for testing. By employing these deduplicated datasets, we were able to conduct a fairer and more accurate comparison.

We compared RNAret and baseline models on the benchmarks mentioned above. Unlike our other experimental results, in this particular task, 1-mer RNAret achieved better performance. This may be explained by the fact that RNA secondary structure prediction requires assessment of RNA base-pair interactions. Compared to base pairs (16 possible combinations, excluding [UNK]), K-mer pairs (4^k possibilities) have significantly lower abundance in the dataset, making it more challenging to train the model effectively. The 1-mer RNAret performs well across all metrics, particularly in F1 score and precision, implying that it generates fewer incorrect base pairings (Table 2).

Table 2 Performance of RNAret on RNA secondary structure prediction task

Full size table

For 1-mer RNAret, the average F1 scores vary across different RNA families in the ArchiveII dataset. Similar to other models, 1-mer RNAret has the weakest performance in the 23 s rRNA family, which may be due to the absence of this family in the training set (Fig. 4a). The difficulty in achieving cross-family generalization for RNA secondary structure prediction through deep learning is widely observed²⁹. Nevertheless, RNAErnie demonstrated a higher F1 score compared to the widely used baseline UFold, both in the overall evaluation and across multiple RNA families (Supplementary Fig. 1). We also illustrate the relation between the F1 scores and sequence lengths, indicating that RNAret performs better on shorter sequences, a trend consistent with other models (Fig. 4b).

**Fig. 4: Detailed performance evaluation of RNAret in secondary structure prediction on the ArchiveII dataset.**

We display the diagonalized logits output of 1-mer RNAret for two samples (16s rRNA H. volcanii domain 4 and RNase P RNA R. norvegicus), as well as the probability maps obtained through the sigmoid activation, and the contact maps after post-processing (Figs. 4c, 4e). These results demonstrate that RNAret produces robust probability maps with less noise. We make comparisons between the predictions of 1-mer RNAret, those from the UFold web server³⁰, and RNAfold web server³¹ along with the ground truths, visualized with Forna³² (Fig. 4d, 4f). RNAret’s predictions are closer to the real structures, especially in the case of RNase P RNA R. norvegicus. While both UFold and RNAfold struggle with this challenging structure, RNAret still manages to reconstruct its secondary structure to a good extent.

These results show the ability of the RNAret language model in structural modeling. Additionally, the post-processing module we employed, particularly the idea of solving an assignment problem, although it differs from the commonly used dynamic programming algorithms³³, has shown good feasibility.

mRNA/lncRNA classification

Predicting the coding potential of transcripts is a fundamental and crucial problem in biology. The innovative architecture of RNAret enables it to process long sequences efficiently without truncation or segmentation. Here, we utilize the lncRNA_H and lncRNA_M datasets from RNAErnie, which are derived from protein-coding transcripts and lncRNA sequences of human and mouse in GENCODE³⁴. The human-derived lncRNA_H dataset contains 77,778 training samples, 8641 validation samples, and 21,605 test samples. The mouse-derived lncRNA_M dataset contains 37,765 training samples, 4196 validation samples, and 10,491 test samples from mice. Notably, both datasets contain partial-length transcripts with incomplete CDS regions. As described in the method section, we evaluate the sequence-level classification performance of RNAret against the baseline models.

We perform the comparison between CPC2³⁵, CPAT³⁶ and RNA language models on the lncRNA_H and lncRNA_M datasets (Table 3). The absence of start or stop codons in partial-length transcripts limits the performance of conventional machine learning approaches that depend on identifying the longest open reading frame. RNABERT, constrained by its 440-nucleotide input limitation, fails in this context. In contrast, RNA-FM and RNA-MSM, with their 1024-nucleotide capacity, successfully capture substantial sequence patterns. RNAErnie addresses the long-sequence challenge through segmentation of sequences and aggregation of logits. Meanwhile, RNAret’s architecture naturally supports long-sequence processing, enabling it to achieve superior performance, particularly on the lncRNA_H dataset, with an accuracy of 0.948.

Table 3 Performance of RNAret on mRNA/lncRNA classification task

Full size table

The RNAret 5-mer fine-tuned model reveals band-like patterns in retention scores (Fig. 5a). This observation prompted us to calculate column-wise averages of retention scores for deeper analysis. We spotted that the positions with the 1% highest retention scores exhibited non-random, biologically meaningful patterns. Among these, stop codons (UAG, UAA, UGA) stood out with significantly high retention scores, despite their relatively low frequency in CDS. However, the start codon (AUG) receives only moderately high scores, likely because most AUG codons are not the actual initiation signals of CDS due to their high frequency³⁷. (Fig. 5b)

**Fig. 5: Analysis of high retention score motifs and codons in mRNA/lncRNA classification.**

We further examined the 9-mers with high retention scores and discovered a striking presence of low-complexity regions. Humans and mice exhibited similar motif distributions, with relatively lower variations in base changes within ±2 positions (Fig. 5c). Motifs such as “GCCGCCGCC”, “AAAAACAAA”, and “AAAUAAAAA” were particularly prominent, especially those characterized by extended poly-A stretches. These regions may play an important role in differentiating coding from non-coding regions in RNA sequences. Unbiased analysis of sequence and context preferences has proved that human RNA-binding proteins (RBPs) tend to bind low-complexity RNA motifs³⁸.

The computational efficiency of RNAret

To highlight the computational efficiency of RNAret, we assessed its data processing speed during the training and inference phase of our experiments. In this section, we collected the time cost of the 5-mer model. All experiments were conducted on a single A800 80GB GPU with mixed-precision autocast enabled. In the pretraining and mRNA/lncRNA classification tasks, we fully utilized the GPU memory to its maximum capacity.

During the pretraining phase, our model achieved a TPS of 10⁵-level, suggesting that it can handle roughly $1.5\times {10}^{5}$ bases or K-mer per second under experimental conditions. In downstream tasks, computational efficiency drops due to additional classifiers. Especially in RNA secondary structure prediction, the classifier involves feature concatenation and 2D convolution, which notably slows down the speed. (Supplementary Table 2)

During inference, the absence of gradient backpropagation allows the process to run approximately twice as fast as training. However, in RNA secondary structure prediction, the additional post-processing module, which requires solving an assignment problem, becomes the primary computational bottleneck due to the O(n³) time complexity of the Hungarian algorithm (Supplementary Table 3).

We report the runtime efficiency of RNA-FM and RNA-MSM under the same environment (Supplementary Table 4). Both models exhibit lower TPS than RNAret during training and inference, and their capability to handle long sequences is also limited. In summary, RNAret manifests significant computational efficiency while maintaining good performance in biological tasks, enabling cost-effective processing of large-scale biological data.

Discussion

Retentive Network is regarded as an innovative and promising large language architecture, and we see its potential for biological applications, inspired by the idea that the retention mechanism may share similarities with the interaction between nucleotides/motifs. In this study, we develop RNAret, an innovative RNA language model built upon the RetNet. Our work introduces the successful implementation of a bidirectional Retentive Network as an Encoder representation for constructing RNA language models.

We explored different K-mer tokenization strategies and observed that larger k-values generally performed better (except in our RNA secondary structure prediction task), aligning with the findings of DNABERT³⁹. This phenomenon likely stems from the ability of longer K-mers to suitably capture interactions between adjacent bases.

RNAret offers multi-purpose applications, serving as a tool for extracting RNA embeddings and a foundation for task-specific adaptation through downstream classifiers. RNAret boasts advantageous characteristics: it has a lightweight structure, ensuring efficient training and inference processes, while maintaining state-of-the-art performance across diverse downstream tasks. We employed several evaluation metrics like F1 score, Precision, Recall, accuracy, and AUC, and are convinced that RNAret exhibits remarkable interpretability and robust capabilities, including in long-sequence modeling tasks.

Our research primarily concentrated on establishing the feasibility and assessing the performance of RNAret. However, we did not explore the hyperparameter configurations of RNAret, since the parameter space is too large and complex. In our code repository, we have provided training and evaluation scripts and welcome interested researchers to examine the influence of varying hyperparameters on the model’s efficacy. Also, our pretraining approach includes only the MLM task, without complex pretraining objectives based on type or structure. Moreover, RNAret is currently confined to RNA modeling and has not been expanded to the domains of DNA and protein sequences. We will further improve RNAret to make it a competitive tool for the language modeling—and potentially generation—of biomolecular sequences.

Methods

Bidirectional encoder representation from Retentive Network

The Retentive Network is an emerging language model architecture that introduces a retention mechanism in its retention layer. This layer shares structural similarities with the Transformer layer, comprising two main components: multi-scale retention (replacing multi-head self-attention) and a feedforward neural network (FFN), along with layer normalization⁴⁰ and residual connections⁴¹. The structure of the ${l}^{th}$ retention layer can be described as follows:

$${Y}_{l}={{\rm{M}}}{{\rm{ulti}}}{{\rm{S}}}{{\rm{cale}}}{{\rm{R}}}{{\rm{etention}}}\left({{\rm{LayerNorm}}}\left({X}_{l}\right)\right)+{X}_{l}$$

(1)

$${X}_{l+1}={{\rm{F}}}{{\rm{eed}}}{{\rm{F}}}{{\rm{orward}}}{{\rm{N}}}{{\rm{etwork}}}({{\rm{LayerNorm}}}\left({Y}_{l}\right))+{Y}_{l}$$

(2)

Here, ${X}_{l}$ is the input to the layer, and ${X}_{l+1}$ is the output of the layer as well as the input to the next layer. The feedforward neural network part is computed as Feed Forward Network(X) = GELU (XW₁)W₂.

In the multi-scale retention mechanism, each retention head operates in parallel during both training and single-step inference. The calculation formulas are as follows:

$$Q=({{XW}}_{Q})\odot \varTheta, K=\left(X{W}_{K}\right)\odot \overline{ \varTheta },V={{XW}}_{V}$$

(3)

$${\varTheta }_{{\mbox{n}}}={e}^{{in}\theta },{D}_{{nm}}=\left\{\begin{array}{cc}{\gamma }_{n-m}, & n\ge m\\ 0, & n < m\end{array}\right.$$

(4)

$${{\rm{Retention}}}\left(X\right)=\left({{QK}}^{{{\rm{\top }}}}\odot D\right)V$$

(5)

Here, $Q$ and $K$ represent the feature projections derived from the input sequence $X$, while ${D}_{{nm}}$ denotes the causal masking decay matrix. The retention mechanism substantially lowers the computational cost in language models, with a O(n) time and space complexity in parallel, which enables the effective modeling of long sequences. Decay factor γ and rotation matrix ${\varTheta }_{n}$ are integrated into xPos rotational position encoding⁴², allowing RNAret to extrapolate for longer sequences.

A multi-scale retention layer contains multiple retention heads assigned with different ${{\rm{\gamma }}}$. The outputs from these heads are concatenated and subsequently projected to form the output of the multi-scale retention layer.

$${{\rm{\gamma }}}=1-{2}^{-5-{{\rm{arange}}}\left(0,h\right)}\in {R}^{h},{{{\rm{head}}}}_{i}={{\rm{Retention}}}(X,{{{\rm{\gamma }}}}_{i})$$

(6)

$$Y={{{\rm{GroupNorm}}}}_{h}({{\rm{Concat}}}({{{\rm{head}}}}_{1},\ldots ,{{{\rm{head}}}}_{h}))$$

(7)

$${{\rm{MultiScaleRetention}}}({{\rm{X}}})=\left({{\rm{swish}}}\left(X{W}_{G}\right)\odot Y\right){W}_{O}$$

(8)

However, the original implementation of the Retentive Network is designed as a decoder-only autoregressive model, where each token can only see preceding tokens, resulting in unidirectional information retention. While this architecture is well-suited for generation tasks, it has limitations in encoder representation tasks that demand access to global context. Inspired by the idea of RMT⁴³, we adapt the decay matrix to

$${D}_{{nm}}={{{\rm{\gamma }}}}^{|n-m|}$$

(9)

It implements the bidirectional retention mechanism, that is, a bidirectional Retentive Network for sequence encoder representation.

RNAret model architecture

RNAret is built on the torchscale implementation and modified as described above. RNAret consists of 8 retentive layers with a feature embedding dimension of 384, an FFN embedding dimension of 512, and a value embedding dimension of 512. Other hyperparameters include: 4 retention heads, GeLU gated activation⁴⁴, 0.2 dropout rate for both the FFN and GeLU activation, and $1\times {10}^{-6}$ LayerNorm epsilon value. With roughly 12 million trainable parameters, RNAret is designed as a lightweight language model.

RNA tokenization

We implemented a K-mer tokenization strategy to segment RNA sequences into continuous, overlapping tokens, following a similar approach to DNABERT³⁹. Unlike simplified splitting methods, this approach captures local contextual dependencies between adjacent nucleotides. Specifically, for a given RNA sequence of length $L$, a sliding window of size $k$ moves across the sequence with a stride of one nucleotide. This process generates a sequence of $L-k+1$ K-mers from the vocabulary, which encompasses all ${4}^{k}$ possible nucleotide permutations.

To ensure that the final number of tokens corresponds exactly to the original sequence length $L$, which is crucial for base-level prediction tasks, especially for RNA secondary structure prediction, we applied a specific padding strategy using a special filler token [FIL]. We append $\frac{k-1}{2}$ [FIL] tokens to both the beginning and end of each sequence. Consequently, the input sequence is transformed into a token sequence of length $L$.

For instance, considering the RNA sequence ‘AUGGCU’ with length $L=6$: For $k=1$, it is tokenized directly as {A, U, G, G, C, U}, equivalent to base-level tokenization. For $k=3$, the tokenization yields {[FIL], AUG, UGG, GGC, GCU, [FIL]}. For $k=5$, the sequence is tokenized as {[FIL], [FIL], AUGGC, UGGCU, [FIL], [FIL]}. During the experimental phase, we trained and evaluated the model using different tokenization methods with $k\in \{{\mathrm{1,3,5}}\}$.

In addition to ${4}^{k}$ K-mer tokens and [FIL] token, our vocabulary also includes several special tokens: [PAD] for aligning batch sequences to a uniform length; [UNK] for representing K-mers containing non-standard bases; [SEQ] as a separator between concatenated sequences; and [MASK], which is exclusively used during the pretraining phase to denote masked positions. The [CLS] token is introduced during fine-tuning to aggregate global sequence representations for classification tasks.

Pretraining strategies

We utilized the RNA sequences from RNAcentral release 21.0¹⁷ as our pretraining dataset, which comprises approximately 29.8 million non-coding RNA (ncRNA) sequences from diverse families. Though the dataset excludes some RNA types such as mRNA, downstream tasks demonstrate that RNAret can generalize to RNA types not present in the pretraining dataset.

During the pretraining phase, we followed the Masked Language Model (MLM) self-supervised task, which is well-suited for encoder-only architectures. We randomly mask regions of continuous $k$ tokens, with 15% of the tokens being masked in total (Fig. 2a). For masked tokens, 80% are replaced with [MASK] tokens, 10% are substituted with random K-mers from the vocabulary, and the remaining 10% are left unchanged⁴⁵. The model takes the partially masked sequence as input and reconstructs the original sequences through an output projection. The optimization objective is to minimize the cross-entropy loss between the real tokens and the predicted probabilities:

$${Loss}=-\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{j=1}^{C}{y}_{{ij}}{ln}\left({\hat{y}}_{{ij}}\right)$$

(10)

Each RNAret model was trained on a single A800 80GB GPU. The training configuration included a batch size of 100, a maximum sequence length of 2000, and automatic mixed precision. Training spanned approximately 2 epochs, or 600,000 steps, taking roughly 9 days to complete. By decreasing the batch size or sequence length, the training process can be replicated on consumer-grade GPUs. We employed the AdamW optimizer⁴⁶, and the initial learning rate was set at $1\times {10}^{-4}$. We employed a cosine annealing schedule to adjust the learning rate, with a cycle length of 50,000 steps and a minimum learning rate of $1\times {10}^{-5}$.

Fine-tuning strategies for downstream tasks

After the pretraining phase, RNAret has developed a foundational understanding of the distribution of bases and K-mers within RNA sequences. In this study, we evaluated RNAret on three types of downstream tasks: interaction, structure, and classification. The features used in these tasks include the pooled features extracted from the [CLS] token, the global embeddings of RNA sequences, and the fusion of embedding features. During the fine-tuning phase, we did not freeze the parameters of RNAret.

We compared RNAret with 4 RNA language models based on Bidirectional Encoder Representations from Transformers (BERT). Their numbers of trainable parameters are as follows: RNABERT (480 kilo)²³, RNA-MSM (95.9 million)⁷, RNA-FM (99.5 million)⁶, and RNAErnie (105 million)⁸. For RNABERT, RNA-MSM, and RNA-FM, we trained the models using the implementation of BERT-like baselines provided by RNAErnie⁴⁷, with the language model parameters unfrozen. For RNAErnie, we used the official model weights provided.

RNA-RNA interaction prediction

RNA-RNA interaction prediction focuses on determining the likelihood of interaction between two RNA sequences. The two sequences are concatenated with a separator token [SEQ], and a classification token [CLS] is prepended at the beginning of the combined sequence. The embedding features of the [CLS] token are delivered to a dense classifier for the binary interaction label.

To explain the interpretability of 5-mer RNAret, we visualized the retention scores across the test set as described: retention scores are calculated and normalized as follows:

$$R=Q{K}^{{{\rm{\top }}}}\odot D,{\widetilde{R}}_{{nm}}=\frac{{R}_{{nm}}}{\max \left(|{\sum }_{i=1}^{n}{R}_{{ni}}|,1\right)}$$

(11)

We extracted the normalized retention scores from the last retention layer and averaged them across 4 heads. To identify the most influential sequence elements that contribute to the [CLS] token representation, we specifically focus on the [CLS] row of the retention scores. Here, higher retention scores indicate that the information from the corresponding position in the sequence is more likely to be retained in the pooled features extracted from the [CLS] token.

RNA secondary structure prediction

The aim of RNA secondary structure prediction is to identify which base pairs within an RNA molecule engage in hydrogen bonding interactions, or in other words, to generate an $L\times L$ binary contact map $X$ that satisfies the following fundamental physical constraints⁴⁸:

(i).
Base pairings are restricted to canonical pairs: G-C, A-U, and G-U;
(ii).
Sharp loops are not permitted;
(iii).
Duplicate pairings are not allowed. Each row and column contains no more than one “1”;
(iv).
The matrix must be symmetric, reflecting the bidirectional nature of base pairing interactions.

We apply outer concatenation to the RNAret embedding features for a 2D feature map, where the pairwise feature between token ${s}_{i}$ and ${s}_{j}$ is represented as $[{s}_{i}C{oncat}{{s}}_{j}]$ (feature dimension: 768). The feature map is passed to 16 residual blocks⁴² to obtain a probability map $A$. Inspired by E2Efold⁴⁹, we employ a post-processing module to derive the contact map $X$ from $A$.

To effectively make use of prior physical constraints, for each RNA sequence, we construct a masking matrix M based on rules (i) and (ii), defined as ${M}_{{ij}}=\left\{\begin{array}{cc}1, & {\mbox{if}}\,({s}_{i},{s}_{j})\,{\mbox{satisfies}}\,({\mbox{i}})\,{\mbox{and}}\,({\mbox{ii}})\\ 0, & {\mbox{otherwise}}\end{array}\right.$.To satisfy rule (iii) and (iv), we set A as a lower triangular matrix, discarding base pairs with scores below the threshold of 0.5, which yields a filtered matrix A′, where ${A}_{{ij}}^{{\prime} }=\left\{\begin{array}{cc}{A}_{{ij}}, & {\mbox{if}}\,i > j\,{\mbox{and}}\,{A}_{{ij}} > =0.5\\ 0, & {\mbox{otherwise}}\end{array}\right.$. We then frame this as an assignment problem and employ the Hungarian algorithm⁵⁰ to solve for the lower triangular matrix X′ that maximizes the objective ${\sum }_{i=1}^{n}{\sum }_{j=1}^{n}{A}_{{ij}}^{{\prime} }\cdot {X}_{{ij}}$, where ${\sum }_{j=1}^{n}{X}_{{ij}}\le 1\forall i,{\sum }_{i=1}^{n}{X}_{{ij}}\le 1\forall j,{X}_{{ij}}\in \{{\mathrm{0,1}}\}\forall i,j$. Finally, the contact map $X$ is computed as:

$${X}_{{ij}}=\left({X}_{{ij}}^{{\prime} }\cdot {M}_{{ij}}\right)+{\left({X}_{{ij}}^{{\prime} }\cdot {M}_{{ij}}\right)}^{T}$$

(12)

Finally, any conflicting pairs introduced in this step are eliminated by a simple greedy approach.

mRNA/lncRNA classification

Current methodologies mostly attempt to distinguish mRNA from lncRNA by capturing features of whole RNA sequences, like K-mer statistics, the longest open reading frame, or pooled features⁵¹. By extracting these features, they determine whether a transcript is more similar to mRNA or lncRNA. In our work, we regarded this task as a base-level analysis instead of a sequence-level classification. More specifically, our strategy aims to determine the probability that each base belongs to a CDS region. For an RNA sequence of length $L$, the model outputs a length-L CDS probability set $\{{p}_{1},{p}_{2},\ldots ,{p}_{L}\}$ through 16 residual blocks.

Compared to sequence-level classification methods, this approach is more fine-grained and extensible. For comparison with baseline models, we employed a sliding window strategy. A window size (WS) of 30, which is approximately the minimum length of an effective CDS, is moved along the sequence of length $L$ and the average CDS probability $\bar{{p}_{n}}=\frac{1}{{WS}}{\sum }_{k=n}^{n+{WS}-1}{p}_{k},\,1\le n\le L-{WS}+1$ within each window. $p=\max \left\{\bar{{p}_{1}},\bar{p}_{2},\ldots ,\overline{{p}_{L-{{WS}}+1}}\right\}$ represents the coding potential of the sequence. Sequences with $p\ge 0.5$ are classified as mRNA, while those with $p < 0.5$ are classified as lncRNA.

We used the retention scores of the 5-mer RNAret described in the RNA-RNA Interaction Prediction section to identify the most significant promoters and motifs for CDS recognition. Specifically, for each sample in the test set, we first averaged the normalized retention scores across the 4 heads of the last retention layer and then calculated the column-wise mean. Next, we pinpointed the top 1% of high-scoring positions in each sample, and extracted the codon as well as 9-mer centered on the corresponding base. Finally, we derived comprehensive statistical insights from the lncRNA_H and lncRNA_M test sets.

Evaluation of computational efficiency

In this section, we evaluate the token processing capability of 5-mer RNAret on a single A800 GPU. Specifically, we measure the total sequence length (${Batch\; Size}\times {Max}{imum}{Length}$) that the model can input and process per second while returning results.

In the pretraining phase, the model optimized one step per batch, and we recorded the time cost per step. Tokens per Second (TPS) was calculated as follows:

$${{\mbox{TPS}}}=\frac{{{\mbox{Maximum}}}\, {{\mbox{Sequence}}}\, {{\mbox{Length}}} \times {{\mbox{Batch}}}\, {{\mbox{Size}}}}{{{\rm{Time}}}\, {{\mbox{per}}}\, {{\mbox{Step}}}\times {60}}$$

(13)

In downstream fine-tuning phase, we measured the average time per epoch for both training and validation sets. TPS was calculated as follows:

$${{\mbox{TPS}}}=\frac{{{\mbox{Maximum}}}\, {{\mbox{Sequence}}}\, {{\mbox{Length}}}\times {{\mbox{Sample}}}\, {{\mbox{Number}}}} {{{\rm{Times}}}\, {{\mbox{per}}}\, {{\mbox{Epoch}}}\times {60}}$$

(14)

Statistics and reproducibility

Data processing and statistical analyses were performed on the Python platform. Specifically, we utilized biopython (1.78)⁵² for processing biological data; fairscale (0.4.0), torch (2.4.0), and torchscale (0.3.0) for model construction and training; and scikit-learn (1.6.1)⁵³ for calculating statistical metrics (including F1 score, Accuracy, Precision, Recall, and AUC). Compatible versions of these packages are also permissible.

Sample sizes were determined based on the availability of high-quality sequences in public databases. For the ArchiveII test dataset, we excluded samples that overlapped with RNAStrAlign to prevent data leakage and ensure fair comparison. In the RNA secondary structure prediction task, sequences longer than 600 nucleotides were not considered.

To ensure reproducibility, we provide the complete code repository and pre-split datasets, facilitating the replication of the entire workflow. With the default hyperparameter settings provided in our scripts, the model training converges consistently.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All the datasets used for analyses in this work are publicly available online. The RNAcentral dataset is available at https://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/21.0/. Datasets for downstream fine-tuning tasks are available at https://bis.zju.edu.cn/rnaret/download/ and have been deposited in Zenodo (https://doi.org/10.5281/zenodo.18313475)⁵⁴. Source data for figures are provided in Supplementary Data 1-3.

Code availability

The RNAret source code, including scripts for pretraining, training, and inference, is available on GitHub (https://github.com/DrBlackZJU/RNAret/) and archived on Zenodo (https://doi.org/10.5281/zenodo.18271233)⁵⁵. The RNAret web server is accessible at https://bis.zju.edu.cn/rnaret/. Model weights are available at the project website (https://bis.zju.edu.cn/rnaret/download/) and have also been deposited on Zenodo (https://doi.org/10.5281/zenodo.18313475)⁵⁴.

References

Caprara, M. G. & Nilsen, T. W. RNA: versatility in form and function. Nat. Struct. Biol. 7, 831–833 (2000).
Article CAS PubMed Google Scholar
Holbrook, S. R. RNA structure: the long and the short of it. Curr. Opin. Struct. Biol. 15, 302–308 (2005).
Article CAS PubMed PubMed Central Google Scholar
Chen, Z., Ain, N. U., Zhao, Q. & Zhang, X. From tradition to innovation: conventional and deep learning frameworks in genome annotation. Brief. Bioinform. 25, bbae138 (2024).
Article CAS PubMed PubMed Central Google Scholar
Vaswani, A. et al. Attention is all you need. In Adv. Neural Inf. Process. Syst. 30, 5998–6008 (NIPS, 2017).
Min, B. et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56, 30 (2023).
Google Scholar
Chen, J. et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. Preprint at https://arxiv.org/abs/2204.00300 (2022).
Zhang, Y. et al. Multiple sequence alignment-based RNA language model and its application to structural inference. Nucleic Acids Res 52, e3 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wang, N. et al. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nat. Mach. Intell. 6, 548–557 (2024).
Article Google Scholar
Shen, T. et al. Accurate RNA 3D structure prediction using a language model-based deep learning approach. Nat. Methods 21, 2287–2298 (2024).
Article CAS PubMed PubMed Central Google Scholar
Wang, X. et al. Uni-RNA: universal pre-trained models revolutionize RNA research. Preprint at https://www.biorxiv.org/content/10.1101/2023.07.11.548588v1 (2023).
Penić, R. J., Vlašić, T., Huber, R. G., Wan, Y. & Šikić, M. RiNALMo: general-purpose RNA language models can generalize well on structure prediction tasks. Nat. Commun. 16, 5671 (2025).
Article PubMed PubMed Central Google Scholar
Dao, T., Fu, D., Ermon, S., Rudra, A. & Ré, C. FlashAttention: fast and memory-efficient exact attention with IO-Awareness. In Adv. Neural Inf. Process. Syst. 35, 16344–16359 (NIPS, 2022).
Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR, 2024).
Avsec, Ž et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat. Methods 18, 1196–1203 (2021).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, E. et al. Sequence modeling and design from molecular to genome scale with Evo. Science 386, eado9336 (2024).
Article CAS PubMed PubMed Central Google Scholar
Sun, Y. et al. Retentive Network: a successor to Transformer for large language models. Preprint at https://arxiv.org/abs/2307.08621 (2023).
Sweeney, B. et al. RNAcentral 2021: secondary structure integration, improved sequence search and new member databases. Nucleic Acids Res. 49, D212–D220 (2021).
Article CAS Google Scholar
Kirk, J. M. et al. Functional classification of long non-coding RNAs by k-mer content. Nat. Genet. 50, 1474–1482 (2018).
Article CAS PubMed PubMed Central Google Scholar
van der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Google Scholar
Wen, M., Cong, P., Zhang, Z., Lu, H. & Li, T. DeepMirTar: a deep-learning approach for predicting human miRNA targets. Bioinformatics 34, 3781–3787 (2018).
Article CAS PubMed Google Scholar
Pla, A., Zhong, X. & Rayner, S. miRAW: A deep learning-based approach to predict microRNA targets by analyzing whole microRNA transcripts. PLoS Comput. Biol. 14, e1006185 (2018).
Article PubMed PubMed Central Google Scholar
Gu, T., Zhao, X., Barbazuk, W. B. & Lee, J. miTAR: a hybrid deep learning-based approach for predicting miRNA targets. BMC Bioinform 22, 96 (2021).
Article CAS Google Scholar
Akiyama, M. & Sakakibara, Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR Genom. Bioinform. 4, lqac12 (2022).
Google Scholar
Bartel, D. P. MicroRNAs: target recognition and regulatory functions. Cell 136, 215–233 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, J., Liu, S., Zhou, H., Qu, L. & Yang, J. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data. Nucleic Acids Res, 42, D92–D97 (2014).
Article CAS PubMed Google Scholar
Tan, Z., Fu, Y., Sharma, G. & Mathews, D. H. TurboFold II: RNA structural alignment and secondary structure prediction informed by multiple homologs. Nucleic Acids Res. 45, 11570–11581 (2017).
Article CAS PubMed PubMed Central Google Scholar
Sloma, M. F. & Mathews, D. H. Exact calculation of loop formation probability identifies folding motifs in RNA secondary structures. RNA 22, 1808–1818 (2016).
Article CAS PubMed PubMed Central Google Scholar
Danaee, P. et al. bpRNA: large-scale automated annotation and analysis of RNA secondary structure. Nucleic Acids Res. 46, 5381–5394 (2018).
Article CAS PubMed PubMed Central Google Scholar
Szikszai, M., Wise, M., Datta, A., Ward, M. & Mathews, D. H. Deep learning models for RNA secondary structure prediction (probably) do not generalize across families. Bioinformatics 38, 3892–3899 (2022).
Article CAS PubMed PubMed Central Google Scholar
Fu, L. et al. UFold: fast and accurate RNA secondary structure prediction with deep learning. Nucleic Acids Res. 50, e14 (2022).
Article CAS PubMed PubMed Central Google Scholar
Gruber, A. R., Lorenz, R., Bernhart, S. H., Neuböck, R. & Hofacker, I. L. The Vienna RNA websuite. Nucleic Acids Res. 36, W70–W74 (2008).
Article CAS PubMed PubMed Central Google Scholar
Kerpedjiev, P., Hammer, S. & Hofacker, I. L. Forna (force-directed RNA): Simple and effective online RNA secondary structure diagrams. Bioinformatics 31, 3377–3379 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zuker, M. & Stiegler, P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res 9, 133–148 (1981).
Article CAS PubMed PubMed Central Google Scholar
Frankish, A. et al. GENCODE 2021. Nucleic Acids Res. 49, D916–D923 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kang, Y. et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 45, W12–W16 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wang, L. et al. CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model. Nucleic Acids Res. 41, e74 (2013).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, K., Payne, B., Feyertag, F. & Alvarez-Ponce, D. The codon statistics database: a database of codon usage bias. Mol. Biol. Evol. 39, msac157 (2022).
Article CAS PubMed PubMed Central Google Scholar
Dominguez, D. et al. Sequence, structure, and context preferences of human RNA binding proteins. Mol. Cell. 70, 854–867 (2018).
Article CAS PubMed PubMed Central Google Scholar
Ji, Y., Zhou, Z., Liu, H. & Davuluri, R. V. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 2112–2120 (2021).
Article CAS PubMed PubMed Central Google Scholar
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. Preprint at https://arxiv.org/abs/1607.06450 (2016).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Sun, Y. et al. A length-extrapolatable Transformer. In 61st Annual Meeting of the Association-for-Computational-Linguistics 14590-14604 (ACL, 2023)
Fan, Q., Huang, H., Chen, M., Liu, H. & He, R. RMT: Retentive Networks meet Vision Transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 2024).
Hendrycks, D. & Gimpel, K. Gaussian error linear units (GELUs). Preprint at https://arxiv.org/abs/1606.08415 (2016).
Kenton, J. & Toutanova, L. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT 2019 Vol. 1 (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. In International Conference on Learning Representations (ICLR, 2019).
Ning, W. CatIIIIIIII/RNAErnie: v.1.0. Zenodo https://doi.org/10.5281/zenodo.10847621 (2024).
Nowakowski, J. & Tinoco, I. RNA structure and stability. Semin. Virol. 8, 153–165 (1997).
Article CAS Google Scholar
Chen, X., Li, Y., Umarov, R., Gao, X. & Song, L. RNA secondary structure prediction by learning unrolled algorithms. In International Conference on Learning Representations (ICLR, 2020).
Kuhn, H. The Hungarian Method for the assignment problem. Nav. Res. Logist. 52, 7–21 (2005).
Article Google Scholar
Ventola, G. M. M. et al. Identification of long non-coding transcripts with feature selection: a comparative study. BMC Bioinform. 18, 187 (2017).
Article Google Scholar
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article CAS PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Shen, Y. RNAret - Datasets and Model Weights [Data set]. Zenodo https://doi.org/10.5281/zenodo.18313475 (2026).
Shen, Y. DrBlackZJU/RNAret: Retentive Network promotes efficient RNA language modeling of long sequences (v1.0). Zenodo https://doi.org/10.5281/zenodo.18271233 (2026).

Download references

Acknowledgements

This work was partially supported by the National Key Research and Development Program of China [2023YFE0112300]; National Science Foundation of China [32261133526; 32570787]; the Science and Technology Innovation Leading Scientist [2022R52035], the 151 talent project of Zhejiang Province (first level); and Collaborative Innovation Center for Modern Crop Production co-sponsored by the province and the ministry. The authors are grateful to the members of Ming Chen’s laboratory for helpful discussions and valuable comments, and to Jianghong Wu for assistance with computational resources.

Author information

Authors and Affiliations

Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
Yi Shen, Yueming Hu, Shilong Zhang & Ming Chen
State Key Laboratory of Pharmaceutical Biotechnology, School of Life Sciences, Nanjing University, Nanjing, China
Guangshuo Cao & Dijun Chen
College of Animal Science and Technology, Inner Mongolia Minzu University, Tongliao, China
Jianghong Wu
State Key Laboratory of Vegetation Structure, Function and Construction, College of Life Sciences, Zhejiang University, Hangzhou, China
Ming Chen

Authors

Yi Shen
View author publications
Search author on:PubMed Google Scholar
Guangshuo Cao
View author publications
Search author on:PubMed Google Scholar
Yueming Hu
View author publications
Search author on:PubMed Google Scholar
Shilong Zhang
View author publications
Search author on:PubMed Google Scholar
Jianghong Wu
View author publications
Search author on:PubMed Google Scholar
Dijun Chen
View author publications
Search author on:PubMed Google Scholar
Ming Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

M.C. and D.C. supervised and designed the study. Y.S. designed the study, implemented the model, and performed data analysis with support from J.W. and Y.H. Y.S. wrote the manuscript with input from G.C. and Y.H. S.Z. helped to build up the web server. All authors reviewed and approved the submitted manuscript.

Corresponding authors

Correspondence to Dijun Chen or Ming Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Biology thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editors: Professor Maria Anisimova and Dr. Nilanjan Banerjee, Dr. Aylin Bircan, Dr. Kaliya Georgieva. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Transparent Peer Review file (download PDF )

Supplementary Information (download PDF )

Description of Additional Supplementary files (download PDF )

Supplementary Data 1 (download TXT )

Supplementary Data 2 (download TXT )

Supplementary Data 3 (download XLSX )

Reporting Summary (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Shen, Y., Cao, G., Hu, Y. et al. Retentive Network promotes efficient RNA language modeling of long sequences. Commun Biol 9, 575 (2026). https://doi.org/10.1038/s42003-026-09757-x

Download citation

Received: 18 April 2025
Accepted: 16 February 2026
Published: 11 March 2026
Version of record: 27 April 2026
DOI: https://doi.org/10.1038/s42003-026-09757-x

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Self-supervised pretraining and feature extraction for RNA sequences

miRNA-mRNA interaction prediction

RNA secondary structure prediction

mRNA/lncRNA classification

The computational efficiency of RNAret

Discussion

Methods

Bidirectional encoder representation from Retentive Network

RNAret model architecture

RNA tokenization

Pretraining strategies

Fine-tuning strategies for downstream tasks

RNA-RNA interaction prediction

RNA secondary structure prediction

mRNA/lncRNA classification

Evaluation of computational efficiency

Statistics and reproducibility

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links