Article
Open access
Published: 09 March 2026

Predicting circRNA subcellular localization by fusing circRNA sequence and network information

Scientific Reports volume 16, Article number: 12775 (2026) Cite this article

1423 Accesses
1 Citations
Metrics details

Subjects

Abstract

CircRNAs have attracted more and more attentions in recent years as they play important roles in many biological processes. It is essential for determining the functions of circRNAs. The subcellular localizations of circRNAs are deemed to be related to their functions. Thus, it is necessary to determine the subcellular localizations of circRNAs. The traditional biochemical experiments are expensive and time-consuming in determining subcellular localizations of circRNAs. It is an alternative way to design computation models. In this study, a new computational model, namely CircLoc, was designed to predict subcellular localizations of circRNAs. This model employed circRNA sequences and networks, from which circRNA features were extracted through both traditional methods (e.g. k-mer), large language model (RNAErnie), and network representation learning algorithms (e.g. node2vec, graph attention auto-encoder). All features were processed by a self-attention layer and fed into a fully connected layer to make predictions. The model was evaluated by ten-fold cross-validation, yielding average AUC and AUPR of 0.7856 and 0.4055, respectively. Such performance was better than that of the models using traditional multi-label classification algorithms and the miRNA subcellular localization prediction models. The reasonableness of CircLoc was also elaborated using ablation tests. The CircLoc was effective in predicting circRNA subcellular localizations and can be a latent useful tool in circRNA study.

Circular RNA discovery with emerging sequencing and deep learning technologies

Article 17 April 2025

An updated resource for the detection of protein-coding circRNA with CircProPlus

Article Open access 16 August 2024

Specific expression and functions of circular RNAs

Article 15 February 2022

Introduction

Non-coding RNAs (ncRNAs) are now recognized as functionally important molecules, rather than mere “junk sequences”. They have been confirmed to play regulatory roles in neural processes¹. Many studies have demonstrated that ncRNAs are crucial participants in a wide range of biological processes, including cell differentiation, genomic imprinting², and gene expression regulation³. Circular RNAs (circRNAs) are an important type of ncRNAs, which were first discovered in the 1970s⁴. In early time, circRNAs were deemed as the low abundance byproducts produced by RNA splicing^5,6, which were not focused by investigators. This situation has changed with the advancement of biomedical research and development of RNA deep sequencing high-throughput technologies. Many studies have reported the biological significance of circRNAs^7,8.

CircRNAs can act as RNA-binding proteins (RBPs) or miRNA sponges, thereby suppressing the biological effects of miRNA-targeted molecules^9,10,11. Furthermore, aberrant expression of circRNAs has been closely associated with various pathological processes^12,13 and its stability has help them as important biomarkers and targets of various diseases^14,15,16,17. At present, circRNAs have become an important research field in RNA research. Similar to the study of messenger RNA (mRNA), a widely investigated RNA type containing the information for translating the genetic message from genomic DNA into proteins, it is essential to determine the functions of circRNAs. It is known that the subcellular localizations of circRNAs are important clues for inferring their functions, which is similar to mRNAs and proteins. In this era, traditional biochemical experiments are popular to determine the subcellular localizations of circRNAs. However, they are not perfect. For example, these methods are expensive and time-consuming. Designing efficient computational models is a feasible alternative, which can accelerate the determination of subcellular localizations of circRNAs and have low costs. To date, few computational models have been designed in this regard.

The prediction of subcellular localizations of proteins and mRNAs is a hot topic in bioinformatics. Lots of computational models have been proposed^{18,19,20,21,22,23,24,25,26}. The early methods extract features from the essential properties of proteins and mRNAs. Then, a classification algorithm was employed to construct the model. The recently proposed models further add a procedure to generate high-level features, such as convolutional neural network (CNN), graph neural network, thereby improving the prediction efficiency. In recent years, some computational models for predicting subcellular localizations of ncRNAs are proposed, especially the models for miRNAs^{27,28,29,30,31,32,33,34,35,36}. Although these models exhibit high performance, they cannot be directly used for predicting subcellular localizations of circRNAs as some differences exist between miRNAs and circRNAs. The models for predicting subcellular localizations of circRNAs are limited to our knowledge. Asim et al. proposed the first circRNA subcellular localization model, Circ-LocNet³⁷. It extracted features from RNA sequences and adopted five traditional classification algorithms for prediction. Zeng et al. designed cell line-specific models, CellCircLoc, to detect circRNA subcellular localizations³⁸. It was constructed using some deep learning technologies, such as CNNs, transformer blocks. Above two models only considered the circRNAs with exact one subcellular localization. In fact, several circRNAs have multiple subcellular localizations, reducing the application values of above models. Thus, it is still necessary to design efficient computational models for predicting subcellular localizations of circRNAs, especially for circRANs with multiple subcellular localizations. Inspired by the successful applications of deep learning algorithms and large language models (LLMs) in tackling various problems^{39,40,41,42,43,44,45,46}, especially the classification problems, it is expected that efficient computational models for predicting subcellular localizations of circRNAs can be constructed based on various deep learning algorithms and LLMs.

In this study, we designed a deep learning- and LLM-based computational model, namely CircLoc, for the prediction of subcellular localizations of circRNAs. To obtain more complete representations of circRNAs, several feature types were extracted, which were derived from circRNA sequences and networks. A deep learning algorithm, graph attention auto-encoder (GATE)⁴⁷, was employed to further generate high-level circRNA features. All features were processed by a self-attention layer and then fed into a fully connected layer to make predictions. The ten-fold cross-validation shown that the average AUC and AUPR were 0.7856 and 0.4055, respectively. Such performance was better than that of the models constructed by traditional multi-label classification algorithms and the miRNA subcellular localization prediction models, which were slightly transformed for predicting circRNA subcellular localizations. Some ablation tests were conducted to prove the reasonableness of CircLoc. Finally, the results on two test datasets proved that CircLoc had an ability to identify latent localizations of new circRNAs.

Materials and methods

Benchmark dataset

A well-defined dataset is essential for constructing efficient classification models. In this study, we employed the human circRNAs and their subcellular localizations reported in RNALocate (version 3.0, http://www.rnalocate.org/)⁴⁸. The original dataset contains 131,208 human circRNAs involving 16 distinct subcellular localizations. Among these localizations, seven of them have few human circRNAs, which are insufficient for constructing reliable models. Therefore, these localizations were excluded. For the rest nine subcellular localizations, two localizations (Extracellular exosome and Extracellular vesicle) contained too many circRNAs (more than 50000 circRNAs), which were much more than other localizations. Inclusion of them yielded an extremely imbalanced dataset. Thus, they were also removed. Furthermore, we removed the circRNAs without sequence information. Finally, the constructed dataset, denoted as $S\left(3.0\right)$, contained 1486 human circRNAs covering seven subcellular localizations, including Chromatin, Cytoplasm, Cytosol, Membrane, Nucleolus, Nucleus, and Nucleoplasm. The circRNA distribution across these seven localizations is shown in Table 1. For formulation, let us denote the circRNA sets for seven localizations as ${S}_{\left(\text{C}\text{h}\text{r}\text{o}\text{m}\text{a}\text{t}\text{i}\text{n}\right)}$,${S}_{\left(\text{C}\text{y}\text{t}\text{o}\text{p}\text{l}\text{a}\text{s}\text{m}\right)}$, ${S}_{\left(\text{C}\text{y}\text{t}\text{o}\text{s}\text{o}\text{l}\right)}$, ${S}_{\left(\text{M}\text{e}\text{m}\text{b}\text{r}\text{a}\text{n}\text{e}\right)}$, ${S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{l}\text{u}\text{s}\right)}$, ${S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{u}\text{s}\right)}$, and ${S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{p}\text{l}\text{a}\text{s}\text{m}\right)}$. Then, the dataset $S\left(3.0\right)$ can be represented by.

Table 1 Breakdown of circRNA datasets.

Full size table

$$\text{S}\left(3.0\right)={S}_{\left(\text{C}\text{h}\text{r}\text{o}\text{m}\text{a}\text{t}\text{i}\text{n}\right)}\cup{S}_{\left(\text{C}\text{y}\text{t}\text{o}\text{p}\text{l}\text{a}\text{s}\text{m}\right)}\cup{S}_{\left(\text{C}\text{y}\text{t}\text{o}\text{s}\text{o}\text{l}\right)}\cup{S}_{\left(\text{M}\text{e}\text{m}\text{b}\text{r}\text{a}\text{n}\text{e}\right)}\cup{S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{l}\text{u}\text{s}\right)}\cup{S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{u}\text{s}\right)}\cup{S}_{\left(\text{N}\text{u}\text{c}\text{l}\text{e}\text{o}\text{p}\text{l}\text{a}\text{s}\text{m}\right)}$$

(1)

It can be observed from Table 1 that the sum of the numbers of circRNAs in seven localizations was larger than the number of different circrNAs, meaning that some circRNAs belong to more than one localization. To illustrate this phenomenon, an upset graph was plotted, as shown in Fig. 1(A). It can be found that several circRNAs belong to two or more localizations. Thus, assigning localizations to circRNAs is a multi-label classification problem when localizations are termed as classes and circRNAs as samples.

To further test the model, we also employed the human circRNAs and their subcellular localizations collected in RNALocate (version 2.0)⁴⁹. Through the same data cleaning process, 880 human circRNAs were obtained. They were also labelled by above seven subcellular localizations. The number of circRNAs in each localization is also listed in Table 1 and an upset graph was plotted, as illustrated in Fig. 1(B), to show the intersection of circRNAs on seven localizations. This dataset was denoted as $S\left(2.0\right)$. This dataset and $S\left(3.0\right)$ were served as training datasets to build and evaluate the model. Furthermore, based on the differences between $S\left(3.0\right)$ and $S\left(2.0\right)$, we constructed one test dataset, denoted as TD₁. This dataset consisted of 606 newly added human circRNAs in $S\left(3.0\right)$, i.e., ${TD}_{1}=S\left(3.0\right)-S\left(2.0\right)$. The model built on $S\left(2.0\right)$ would be applied to this test dataset, thereby evaluating the generalization ability of the model. The numbers of circRNAs in TD₁ on seven subcellular localizations are listed in Table 1. It can be found that there were no circRNAs on three localizations (Chromatin, Membrane, and Nucleoplasm), few circRNAs (less than 10) owned two localizations (Cytosol and Nucleolus), only two localizations (Cytoplasm and Nucleus) had considerable circRNAs.

As the completeness of TD₁ was not very satisfied, we searched another comprehensive database, CSCD2 (http://geneyun.net/CSCD2/)⁵⁰, which contains a large number of circRNAs (~ 2.9 millions). We extracted the circRNAs with subcellular localizations and compared them with circRNAs in $S\left(3.0\right)$, resulting in 732 additional circRNAs. These circRNAs constituted another test dataset, denoted as TD₂. The numbers of circRNAs in this dataset on seven subcellular localizations are listed in Table 1. It can be found that each localization was assigned to some circRNAs, suggesting the completeness of TD₂.

In this study, we constructed and evaluated the model mainly on dataset $S\left(3.0\right)$, including parameter optimalization, ablation study, and comparison with other models. The datasets $S\left(2.0\right)$ and TD₁ as well as $S\left(3.0\right)$ and TD₂ would be used to evaluate the generalization ability of the proposed model.

Construction of circRNA sequence features

The RNA sequences are always the first-hand material to investigate RNA-related problems. An RNA sequence with length L can be represented by

$$S={R}_{1},{R}_{2},{R}_{3},\cdots,{R}_{L}$$

(2)

where ${R}_{i}$ denotes the i-th residue in the sequence and L represents the length of the sequence. Some essential properties of RNAs are contained in this sequence. To date, several methods have been proposed to encode the properties of RNAs contained in the sequences into fixed-length vectors^51,52,53,54. In this study, we first downloaded the sequences of human circRNAs from circBase (http://www.circbase.org/)⁵⁵. Then, three methods or models were employed to extract circRNA sequence features, yielding three feature types.

Features yielded by k-mer

The k-mer is a widely used method for encoding DNA, RNA, and protein sequences⁵⁶. The features yielded by this method partly reflect the main components in the sequence. Given a circRNA sequence with length L, as formulated by Eq. 2, k-mer subsequences can be extracted from this sequence using window sliding technique and its number is L-k + 1. The frequencies of possible k-mer subsequences are counted and constitute the features yielded by k-mer. It is clear that there are 4^k possible k-mer subsequences as circRNA sequences consist of 4 bases (A, U, C, G). Here, we set k = 2, 3, 4, 5, thereby generating 1360 (4²+4³+4⁴+4⁵) sequence features. For convenience, these features were called k-mer features.

Features yielded by reverse compliment k-mer

Reverse compliment K-mer (k-RevcKmer) is a type of deformation of k-mer^57,58, which considers the reverse complement of RNA sequences. The reverse complement k-length contiguous subsequences are discarded. For 2-mer, there are 16 possible k-mer sequences; whereas the 2-RevcKmer only keeps 10 subsequences. The numbers of different k-RevcKmer subsequences are reported in Liu et al.’s study⁵¹. In this study, we still set k to 2, 3, 4, and 5, obtaining 10, 32, 136, and 512, respectively, features. Accordingly, 690 (10 + 32+136 + 512) features were obtained. These features were called k-RevcKmer features.

Features yielded by RNAErnie model

In recent years, LLMs have been successfully applied to learn massive amounts of unlabeled natural language data. They can deeply investigate the large-scale data and generally yield informative embeddings. For RNA, some LLMs have been proposed^59,60,61. Here, we employed a recently proposed pretrained RNA LLMs, namely RNAErnie⁶⁰. It is built upon the Enhanced Representation through Knowledge Integration (ERNIE) framework and incorporates multi-layer and multi-head transformer blocks. The circRNA sequences were fed into this LLM for generating informative features. The dimension was set to 768. For convenience, these features were called RNAErnie features.

The detailed information of circRNA sequence features is listed in Table 2.

Table 2 Information on circRNA features.

Full size table

Construction of circRNA network features

The circRNA sequence features mainly focused on the essential properties of a single circRNA, which cannot reflect all aspects of circRNAs. In recent years, network is deemed to be efficient in organizing research objects and provides an effective form to conduct investigations^62,63,64. In this study, we extracted circRNA features from various networks containing circRNAs.

Network construction

Five networks were constructed in this section. Each network included circRNAs as nodes and some networks further contained other objects, such as drugs, diseases, miRNAs, and RBPs.

circRNA sequence similarity network

As mentioned above, RNA sequences are the first-hand material to investigate RNA-related problems. In this study, the circRNA sequences were adopted to measure the similarity of circRNAs. Smith-Waterman algorithm⁶⁵ was employed to calculate the similarity score of two circRNA sequences. For circRNAs c₁ and c₂, their similarity score yielded by Smith-Waterman algorithm was denoted as $\text{s}\text{p}\left({c}_{1},{c}_{2}\right)$. Let the sequences of c₁ and c₂ be ${a}_{1}{a}_{2}\cdots{a}_{n}$ and ${b}_{1}{b}_{2}\cdots{b}_{m}$. Waterman algorithm constructs a matrix $H\in{R}^{(n+1)\times(m+1)}$ with $H\left(k,0\right)=H\left(0,l\right)=0\ (0\le{k}\le{n}\ \:and\ \:0\le{l}\le{m})$. The value $H(i,j)$ indicates the maximum similarity of two segments ending in ${a}_{i}$ and ${b}_{j}$. The matrix $H$ is filled using a dynamic programming algorithm defined as

$$H\left(i,j\right)=\text{m}\text{a}\text{x}(H\left(i-1,j-1\right)+s\left({a}_{i},{b}_{j}\right),{max}_{k\ge1}\left(H\left(i-k,j\right)-{W}_{k}\right),{max}_{l\ge1}\left(H\left(i,j-l\right)-{W}_{l}\right),0)$$

(3)

where $s\left({a}_{i},{b}_{j}\right)$ is the match or mismatch score of two bases, ${W}_{k}$ is the penalty of the gap of length k. After the matrix $H$ is produced, $\text{s}\text{p}\left({c}_{1},{c}_{2}\right)$ is defined as the maximum value in $H$. This similarity score was refined using the following equation:

$$\text{S}\left({c}_{1},{c}_{2}\right)=\frac{\text{s}\text{p}\left({c}_{1},{c}_{2}\right)}{\sqrt{\text{s}\text{p}\left({c}_{1},{c}_{1}\right)\cdot\text{s}\text{p}\left({c}_{2},{c}_{2}\right)}}$$

(4)

where $\text{S}\left({c}_{1},{c}_{2}\right)$ represented the final sequence similarity score between c₁ and c₂, which was between 0 and 1. After calculating the sequence similarity score of any two circRNAs, the circRNA sequence similarity network was built. The 1486 circRNAs were defined as nodes in this network and two circRNAs were connected if and only if their sequence similarity score was larger than zero, yielding 1,103,355 edges. Furthermore, this score was assigned to the corresponding edge as its weight. This obtained network was denoted as ${N}_{s}$.

circRNA-disease association network

In recent years, an increasing number of studies have demonstrated that circRNAs play crucial roles in the initiation and progression of various diseases^66,67,68,69 and many prediction models have been built to predict circRNA-disease associations^{70,71,72,73,74,75}. Thus, the circRNA-disease associations can partly describe the functions of circRNAs, thereby giving new materials to encode circRNAs. In this study, we retrieved circRNA-disease associations from circR2Disease (version 2.0, http://bioinfo.snnu.edu.cn/CircR2Disease_v2.0) database⁷⁶. After restricting to 1486 circRNAs, we obtained 1006 circRNA-disease associations (not all circRNAs have associated diseases), covering 177 diseases. Then, the circRNA-disease association network was constructed, which defined 1486 circRNAs and 177 diseases as nodes. The edges connecting two circRNAs were same as those in ${N}_{s}$ and edges connecting circRNAs and diseases were determined by above-mentioned circRNA-disease associations. This network was denoted as ${N}_{ci-di}$.

circRNA-drug association network

Similar to diseases, recent studies have confirmed that circRNAs also play critical roles in various drug response mechanisms. For instance, circ-PVT1 has been shown to promote paclitaxel resistance in gastric cancer cells⁷⁷ and high expression of circCELSR1 significantly reduces the sensitivity of ovarian cancer cells to paclitaxel⁷⁸. These findings suggested that the circRNA-drug associations not only provide insights into their potential biological functions but also may serve as valuable features for predicting circRNA subcellular localizations. In view of this, we downloaded the circRNA-drug sensitivity associations from Deng et al.’s study⁷⁹. The restricting operation on circRNAs yielded 850 circRNA-drug sensitivity associations covering 217 drugs. Then, the circRNA-drug association network was built. It set 1486 circRNAs and 217 drugs as nodes. The edges in ${N}_{s}$ were also included in this network and obtained circRNA-drug sensitivity associations formed the edges connecting circRNAs and drugs. This network was denoted by ${N}_{ci-dr}$.

circRNA-miRNA association network

MicroRNAs (miRNAs) are a well-studied class of ncRNAs whose subcellular localization patterns have been systematically characterized. In recent years, regulatory relationships between circRNAs and miRNAs have also been increasingly elucidated. CircRNAs can act as “miRNA sponges” by adsorbing miRNAs, thereby attenuating their regulatory effects on target genes and blocking miRNA-mediated repression⁸⁰. This mechanism highlights the close relations between circRNAs and miRNAs. Employment of miRNAs may be helpful to predict the subcellular localizations of circRNAs. In view of this, we sourced the circRNA-miRNA associations from circBank database (www.circbank.cn)⁸¹. These associations were also restricted to 1486 circRNAs, resulting in 6893 circRNA-miRNA associations, which covered 1781 miRNAs. Then, a circRNA-miRNA association network was constructed, where 1486 circRNAs and 1781 miRNAs were defined as nodes. This network also included the edges in ${N}_{s}$ to show the relations between circRNAs. The above-obtained circRNA-miRNA associations defined the edges connecting circRNAs and miRNAs. We denoted this network as ${N}_{ci-mi}$.

circRNA-protein association network

RBPs participate in many biological processes. These proteins can be useful indictors to suggest the properties of RNAs. This study employed the RBPs of circRNAs collected in CircInteractome (https://circinteractome.irp.nia.nih.gov/)⁸², which were used as circRNA-protein associations. Likewise, these associations were restricted to 1486 circRNAs, yielding 2827 circRNA-protein associations, which covered 37 RBPs. Accordingly, a circRNA-protein association network was constructed. The 1486 circRNAs and 37 RBPs were defined as nodes. The restricted circRNA-protein associations determined the edges connecting circRNA and RBP nodes. Also, the edges in ${N}_{s}$ were also included in this network. Let us denote this network as ${N}_{ci-RBP}$.

This section constructed five circRNA networks. Their detailed information is listed in Table 3.

Table 3 Information of five networks.

Full size table

Features yielded by node2vec from networks

Above-constructed networks contained abundant association information between circRNAs or between circRNAs and other objects. It is essential to convert this information into numeric vectors. The network embedding algorithms provide effective ways to tackle this problem. Here, the widely used network embedding algorithm, node2vec⁸³, was adopted. This algorithm samples lots of paths in a given network and then adopts word2vec with SkipGram to yield node embeddings. The procedure of sampling paths is the core of this algorithm. Suppose there is a path starting at node ${u}_{0}=y$ and it has been extended to the i-th node ${u}_{i-1}=v$. The next node of this path is determined by the possibility, defined as below

$$P\left({u}_{i}=x\mid{u}_{i-1}=v\right)=\left\{\begin{array}{c}\frac{{\pi}_{vx}}{Z}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{}\text{i}\text{f}\text{ } v\text{}\text{ a}\text{n}\text{d }x\text{ are adjacent}\\0\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{}\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}\text{}\end{array}\right.$$

(5)

where ${\pi}_{vx}$ is the unnormalized transition probability between v and x, and Z is the normalization constant, which is defined as the sum of transition probabilities between v and other nodes. ${\pi}_{vx}$ is defined as ${\pi}_{vx}={\alpha}_{pq}(t,x)\cdot{w}_{vx}$, where ${w}_{vx}$ is the weight of edge $(v,x)$ and ${\alpha}_{pq}(t,x)$ is computed by

$${\alpha}_{pq}(t,x)=\left\{\begin{array}{c}\frac{1}{p}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{}\text{i}\text{f}\text{ }{d}_{tx}=0\\1\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{}\text{i}\text{f}\text{ }{d}_{tx}=1\\\frac{1}{q}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\hspace{0.25em}\text{}\text{i}\text{f}\text{ }{d}_{tx}=2\end{array}\right.$$

(6)

where t is the node prior to v, ${d}_{tx}$ represents the distance from t to x, p and q determine the probabilities of return t and moving away from t, which are generally called return and in-out parameters. The path stops extending until its length reaches the predefined length parameter (l). After the predefined number (m) of paths starting from each node have been sampled, paths are deemed as sentences and nodes are treated as words, which constitute the corpus. The word2vec with SkipGram is applied to this corpus for generating the node embeddings.

In this study, we applied node2vec to above five networks. From circRNA sequence similarity network, we obtained the circRNA similarity features. From other four association networks, the circRNA disease, drug, miRNA, and RBP network features were obtained. The dimensions of above features were all set to 128. The information of above circRNA features is listed in Table 2.

Features improved by GATE

Features yielded by node2vec always contain some noises. Thus, they should be refined. GATE⁴⁷ is a type of graph learning algorithm. Different from the network embedding algorithms, it can fuse the raw node embeddings and topology of the given network to yield more informative embeddings. GATE contains encoder and decoder procedures. The encoder updates the raw embedding of each node by considering the embeddings of its neighbors. The decoder tries to recover the raw embeddings of nodes as perfect as possible.

Generally, several layers are contained in the encoder procedure. Let ${x}_{i}={h}_{i}^{\left(0\right)}$ be the raw embeddings of the i-th node and ${h}_{i}^{\left(k\right)}$ be the embeddings after the k-th layer. In the k-th layer, GATE calculates the relevance between the i-th node and its neighbor (j-th node for example) by

$${e}_{ij}^{\left(k\right)}=\text{S}\text{i}\text{g}\text{m}\text{o}\text{i}\text{d}\left({v}_{s}^{(k{)}^{T}}\sigma\left({W}^{\left(k\right)}{h}_{i}^{(k-1)}\right)+{v}_{r}^{(k{)}^{T}}\sigma\left({W}^{\left(k\right)}{h}_{j}^{(k-1)}\right)\right)$$

(7)

where ${W}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{\left(k\right)}\times{d}^{(k-1)}}$, ${v}_{s}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{\left(k\right)}}$, and ${v}_{r}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{\left(k\right)}}$ are the trainable parameters. $\sigma$ is the activation function. Then, the softmax is adopted to normalize the relevance, formulated by

$${\alpha}_{ij}^{\left(k\right)}=\frac{\text{exp}\left({e}_{ij}^{\left(k\right)}\right)}{\sum_{l\in{\mathcal{N}}_{i}}\text{e}\text{x}\text{p}\left({e}_{il}^{\left(k\right)}\right)}$$

(8)

where ${\alpha}_{ij}^{\left(k\right)}$ represents the attention coefficient of the j-th node relative to i-th node in the k-th encoder layer, and ${\mathcal{N}}_{i}$ denotes the closed neighborhood of the i-th node. Then, the new embedding of the i-th node is updated by

$${h}_{i}^{\left(k\right)}=\sum_{j\in{\mathcal{N}}_{i}}{\alpha}_{ij}^{\left(k\right)}\sigma\left({W}^{\left(k\right)}{h}_{j}^{(k-1)}\right)$$

(9)

If there are L layers in the encoder procedure, the output of the L-th layer is deemed as the learnt embeddings through GATE, which is denoted as ${h}_{i}={h}_{i}^{\left(L\right)}$ for the i-th node.

The decoder procedure contains the same number of layers in the encoder procedure. It evaluates the quality of ${h}_{i}$ by recovering ${x}_{i}$ using the similar structure. ${h}_{i}$ is the input of the decoder procedure and denoted by ${\widehat{h}}_{i}^{\left(L\right)}$. Likewise, the attention coefficient of the j-th node relative to i-th node in the k-th layer is computed by

$${\widehat{e}}_{ij}^{\left(k\right)}=\text{S}\text{i}\text{g}\text{m}\text{o}\text{i}\text{d}\left({\widehat{v}}_{s}^{(k{)}^{T}}\sigma\left({\widehat{W}}^{\left(k\right)}{\widehat{h}}_{i}^{\left(k\right)}\right)+{\widehat{v}}_{r}^{(k{)}^{T}}\sigma\left({\widehat{W}}^{\left(k\right)}{\widehat{h}}_{j}^{\left(k\right)}\right)\right)$$

(10)

$${\widehat{\alpha}}_{ij}^{\left(k\right)}=\frac{\text{exp}\left({\widehat{e}}_{ij}^{\left(k\right)}\right)}{\sum_{l\in{\mathcal{N}}_{i}}\text{e}\text{x}\text{p}\left({\widehat{e}}_{il}^{\left(k\right)}\right)}$$

(11)

where ${\widehat{W}}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{(k-1)}\times{d}^{\left(k\right)}}$, ${\widehat{v}}_{s}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{(k-1)}}$, and ${\widehat{v}}_{r}^{\left(k\right)}\in{\mathbb{R}}^{{d}^{(k-1)}}$ are the trainable parameters in the k-th decoder layer, $\sigma$ and sigmoid are same as those in Eq. 7, and ${\mathcal{N}}_{i}$ is same as that in Eq. 8. Accordingly, the embedding of the i-th node after the k-th decoder layer is updated by

$${\widehat{h}}_{i}^{(k-1)}=\sum_{j\in{\mathcal{N}}_{i}}{\widehat{\alpha}}_{ij}^{\left(k\right)}\sigma\left({\widehat{W}}^{\left(k\right)}{\widehat{h}}_{j}^{\left(k\right)}\right)$$

(12)

The output of the last layer in decoder procedure is collected as the reconstructed node embeddings. This embedding of the i-th node is represented by ${\widehat{x}}_{i}$.

As a self-supervised algorithm, GATE adopts the following loss function to assess the quality of the output of encoder procedure

$$\text{Loss\:}=\sum_{i=1}^{N}{\parallel{x}_{i}-{\widehat{x}}_{i}\parallel}_{2}-\lambda\sum_{j\in{\mathcal{N}}_{i}}\text{l}\text{o}\text{g}\left(\frac{1}{1+\text{e}\text{x}\text{p}\left(-{\text{h}}_{i}^{T}{\text{h}}_{j}\right)}\right)$$

(13)

where N is the number of nodes, $\lambda$ is a parameter used to control the weights of two loss parts.

In this study, GATE was used to improve the circRNA disease, drug, miRNA, and RBP network features. In this procedure, the edges in circRNA sequence similarity network were first filtered by setting a binarization threshold, denoted by T. The edges representing weak circRNA-circRNA associations were discarded. Then, it was fed into GATE. In this way, we obtained the high-level circRNA disease, drug, miRNA, and RBP features.

Self-attention layer

Several features were extracted for each circRNA, as listed in Table 2. They were concatenated to constitute a representation of each circRNA. Then, a self-attention layer was adopted to learn the weights between features for better representing the internal structure of all features. This operation can help the model capture complex dependencies between features, thereby enhancing model’s performance.

The self-attention layer employs three weight matrices ${W}_{Q}$, ${W}_{K}$, and ${W}_{V}$. They can conduct linear transformations on the input features $X$, as follows:

$$\left\{\begin{array}{c}Q=X\cdot{W}_{Q}\\K=X\cdot{W}_{K}\\V=X\cdot{W}_{V}\end{array}\right.$$

(14)

The outputs of Eq. 14 are used to calculate the attention score matrix $A$

$$A=\frac{Q{K}^{T}}{\sqrt{{d}_{k}}}$$

(15)

where ${d}_{k}$ stands for the dimension of $Q$ or $K$. A softmax operation is applied to $A$ to produce the attention weight matrix $M$, i.e., $M=\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(A\right)$. Finally, the output $Y$ of the self-attention layer is obtained by $V$ and $M$

$$Y=MV$$

(16)

In this study, the representations of circRNAs were refined by a self-attention layer, yielding better representations of circRNAs.

Prediction and optimization

The circRNA representations processed by the self-attention layer were fed into a fully connected layer for making predictions. The hidden layers adopted ReLU as the activation function and sigmoid was employed as the activation function of the output layer. To evaluate the predicted results, a loss function was necessary. In this study, the loss function was set as the binary cross-entropy, which is defined as

$$L=-\frac{1}{N\times{C}}\sum_{i=1}^{N}\sum_{j=1}^{C}\left[{y}_{ij}\text{l}\text{o}\text{g}\left({\widehat{y}}_{ij}\right)+\left(1-{y}_{ij}\right)\text{l}\text{o}\text{g}\left(1-{\widehat{y}}_{ij}\right)\right]$$

(17)

where ${y}_{ij}$and ${\widehat{y}}_{ij}$ represent the observed label and predicted probability for the i-th sample on the j-th label, respectively, N and C denote the number of samples and labels, respectively. Several trainable parameters were included when training the model, these parameters were optimized through Adam optimizer⁸⁴.

Evaluation metrics

Cross-validation is widely used to assess the performance of classification models^{85,86,87,88,89}. In this method, training samples are randomly and equally divided into multiple parts. Each part is singled out as test set one by one, whereas other parts constitute the training set. The average performance on all parts is generally used to evaluate the overall performance of the model. In this study, we adopted ten-fold cross-validation to measure model’s performance.

In classification, receiver operating characteristic (ROC) and precision-recall (PR) curves are commonly used to display model’s performance^{90,91,92,93,94}. Different from some classic metrices (e.g., accuracy), they can show the performance of models under various thresholds. Thus, they can give a complete evaluation on model’s performance. Furthermore, the area under these two curves, denoted as AUC and AUPR, are the key quantitative metrics. They are between 0 and 1. The higher they are, the higher the performance. In this study, we adopted ROC and PR curves to display model’s performance on each localization and computed corresponding AUC and AUPR values. The average AUC and AUPR were further computed to show the overall performance of models.

In addition to average AUC and AUPR for evaluating overall performance of models, we further employed four metrics, including hamming loss, ranking loss, macro F1, and micro F1. The addition of these metrics can give a more complete exhibition on model’s performance.

Outline of CircLoc

In this study, a new computational model was designed for predicting subcellular localization of circRNAs. The entire procedures are illustrated in Fig. 2. The model first extracted circRNA features from two aspects: circRNA sequences and networks. The methods of k-mer, k-RevcKmer, and RNAErine were applied to the sequences for extracting sequence features. Node2vec was adopted to extract circRNA features from the networks and GATE was used to improve these features. All features were concatenated to represent circRNAs. After features were processed by a self-attention layer, a fully connected layer was employed to make predictions. For easy descriptions, we called this model as CircLoc. The model was implemented by TensorFlow 2.11.0 and Scikit-learn⁹⁵.

Results and discussion

Parameter setting of CircLoc

Generally, tuning the main hyperparameters is necessary to build efficient computational models. For CircLoc, the parameters in node2vec, GATE, self-attention layer, and fully connected layer may play key roles in determining its efficiency. They were tuned based on the cross-validation results on dataset S(3.0).

In node2vec, several parameters were important. The hyperparameters to determine the path length (l) and number (m) were tuned in a small range due to our limited computing resources. The parameter l was set to 80 and 90, whereas m was set to 150 and 160. The cross-validation results suggested that when l and m were set to 80 and 150, respectively, the model provided the best performance. The return and in-out parameters (p and q, respectively) were set to their default values, which were all 1.

In GATE, the number of encoder and decoder layers were all set to two as suggested in the original research on GATE⁴⁷ and some studies^79,96. The numbers of neurons in these layers were selected from 32, 64, 128, 256, and 512. The cross-validation results shown that when the first and second encoder layers contained 256 and 128 neurons, the model yielded the best performance. The learning rate was set to 10^− 2 and the parameter $\lambda$ (Eq. 13) for computing the loss was set to 1. Finally, the binarization threshold (T) for obtaining reliable circRNA sequence similarity network was set to 0.7.

In self-attention layer, the sizes of three matrices ${W}_{Q}$, ${W}_{K}$, and ${W}_{V}$ were all set to 324×324. As for the fully connected layer, we first set the number of hidden layers as 2, 3, and 4 and found that four hidden layers yielded better performance. The numbers of neurons in the four hidden layers were set to various values among 32, 64, 128, 256, 512, and 1024. The test results shown that when the numbers of neurons in the first, second, third, and fourth layers were set to 512, 512, 256, and 128, respectively, the model provided the best performance. The fully connected layer also contained an output layer, which contained seven neurons representing the probabilities of seven subcellular localizations.

Above hyperparameters setting to multiple candidate values were optimized by grid search. The detailed hyperparameter setting of CircLoc is listed in Table 4. The epoch was set to 100 and we adopted early stopping strategy to control overfitting.

Table 4 Hyperparameter setting of CircLoc.

Full size table

Performance of CircLoc

The CircLoc adopted the hyperparameter setting listed in Table 4. It was evaluated on S(3.0) by ten-fold cross-validation. The ROC and PR curves are displayed in Fig. 3(A) and 3(B), respectively. The AUC values for seven subcellular localizations were 0.8103, 0.8188, 0.7161, 0.7791, 0.8326, 0.7294, and 0.8127. These values yielded the average AUC of 0.7856. The seven AUPR values were 0.2182, 0.7656, 0.4856, 0.2980, 0.2534, 0.6538, and 0.1637. The average AUPR was 0.4055. Clearly, the AUPR values were smaller than the AUC values. When the ROC and PR curve analysis was conducted on each subcellular localization, the circRNAs in this localization were deemed as positive samples, whereas the rest were treated as negative samples. In this case, the negative samples were much more than positive samples, resulting in an imbalanced dataset. As the PR curve is more sensitive to the imbalanced problem than the ROC curve, the fact that AUPR values were smaller than the AUC values was acceptable. Besides average AUC and AUPR, we also employed four metrics to show the overall performance of CircLoc, which are listed in Table 5. The micro F1, macro F1, hamming loss, and ranking loss were 0.5779, 0.4494, 0.1720, and 0.1698., respectively.

Table 5 Overall performance of CircLoc on two training datasets under ten-fold cross-validation and one test dataset.

Full size table

Ablation tests on features

Eight feature types were extracted for representing circRNAs, as displayed in Table 2. As some feature types were similar, we clustered eight feature types into four groups. The first group contained k-mer and k-RevcKmer features, denoted as α feature. RNAErnie features constituted the second group, indicated as β feature. The third group contained the circRNA similarity features, denoted by γ feature. The last group included the circRNA disease, drug, miRNA, and RBP features, denoted as η feature. The model using any combination of above feature groups was built and evaluated on S(3.0) by ten-fold cross-validation. The average AUC, average AUPR, micro F1, macro F1, hamming loss, and ranking loss for each model are listed in Table 6. It can be observed that all fourteen models yielded lower average AUC, average AUPR, micro F1, and macro F1 and higher ranking loss than the CircLoc, which used all feature groups. The hamming loss yielded by CircLoc was only higher than that of the model using γ and η features. Evidently, the CircLoc was better than the models using part of feature groups. The detailed performance of models using different feature groups on seven subcellular localizations is shown in TableS1. CircLoc provided the highest AUC and AUPR values on most localizations. We can conclude that all feature groups provided positive contributions to CircLoc because the model using part of four feature groups yielded lower performance. To further confirm this fact, we counted the average AUC and AUPR values of models using same number of feature groups on seven localizations, as illustrated in Fig. 4. It can be found that the models using more feature groups generally yielded higher AUC and AUPR, suggesting that using more feature groups was helpful to improve model’s performance. It was also indicated that each feature group provided positive contributions to CircLoc.

Table 6 Overall performance of the models using different feature groups.

Full size table

Although all four feature groups played essential roles in CircLoc, their importances were not identical. From the performance of models using a single feature group, we can find that the models using α or β features yielded similar performance but evidently lower performance than the models using γ or η features. Thus, we can conclude that γ and η features were more important than α and β features. The γ and η features were extracted from five networks containing the relations among circRNAs and other objects (drugs, diseases, miRNAs, and RBPs), showing informative associations of circRNAs. In protein subcellular localization prediction, the fact that interacting proteins always share similar functions (subcellular localizations are highly related to protein functions) is widely accepted^97,98,99. This fact may also hold for circRNA subcellular localization prediction. Accordingly, network information, showing the relationships between circRNAs, was essential for subcellular localization prediction. The α and β features were derived from circRNA sequences, indicating the isolated properties of circRNAs. As the relations between sequence information and subcellular localizations are not clear at present, they played minor roles in determining circRNA subcellular localizations. Furthermore, γ features were more important than η features as the model using γ feature was slightly superior to the model using η feature. As the γ feature contained direct relationships between circRNAs, they can provide more useful clues for correct prediction of subcellular localizations. Therefore, the importance of four feature groups for the prediction of circRNA subcellular localization from the highest to lowest was γ > η >> α ≈ β.

In addition, we tested the model using a single feature type to uncover the importance of each feature type. These models were also evaluated on S(3.0) by ten-fold cross-validation. The overall metrics of these models are listed in Table 7. It can be found that circRNA similarity feature yielded the best AUC, AUPR, and macro F1, whereas circRNA miRNA feature provided the best micro F1, hamming loss, and ranking loss. As the AUC and AUPR values yielded by circRNA similarity feature had evident advantages compared with those of circRNA miRNA feature. It was believed that circRNA similarity feature was more essential than circRNA miRNA feature. As for the rest six feature types, the circRNA drug, disease, and RBP features yielded evident higher performance than k-mer, k-RevcKmer, and RNAErnie features. These results were compatible with the above-mentioned importances of feature groups as the γ represented circRNA similarity feature, η contained circRNA disease, drug, miRNA, and RBP features, α consisted of k-mer and k-RevcKmer features, β indicated the RNAErnie features.

Table 7 Overall performance of the models using a single feature type.

Full size table

Ablation tests on modules

There were several modules in CircLoc, such as GATE and self-attention layer. Here, some ablation tests were conducted to elaborate their importances. For GATE, it was used to refine the features extracted from four networks. The model was constructed by removing this module. The features extracted from four networks via node2vec were directly fed into the following procedures. This model was called CircLoc-GATE. On the other hand, the self-attention layer was applied to all features to yield more informative features. Another model was built by removing this module. All features were directly fed into the fully connected layer. This model was termed as CircLoc-SAL. Above two models were also assessed on S(3.0) by ten-fold cross-validation. The overall performance is listed in Table 8 and their performance on seven localizations is shown in Fig. 5. From Table 8, the average AUC and AUPR of CircLoc-GATE were 0.6577 and 0.2747, respectively. The micro F1, macro F1, hamming loss, and ranking loss were 0.3796, 0.2535, 0.2832, and 0.2661, respectively. The above six metrices for CircLoc-SAL were 0.7349, 0.3569, 0.5622, 0.4453, 0.1965, and 0.1697. Compared with the metrics of CircLoc (Table 5), CircLoc-GATE and CircLoc-SAL were inferior to CircLoc. As for the AUC and AUPR values on seven localizations, CircLoc yielded the highest AUC and AUPR on most localizations, as shown in Fig. 5. Thus, we can conclude that CircLoc outperformed the CircLoc-GATE and CircLoc-SAL, suggesting that GATE and self-attention layer played key roles in CircLoc for determining circRNA subcellular localizations. Furthermore, the importances of GATE and self-attention layer were not same. As the performance of CircLoc-GATE was lower than CircLoc-SAL, removal of GATE induced more decrease in performance than the removal of self-attention layer. Thus, GATE was more important than self-attention layer for CircLoc.

Table 8 Performance of the variants of CircLoc.

Full size table

With above arguments, GATE provided key contributions for CircLoc. To uncover its specific contributions, we analyzed the attention coefficients of GATE in two encoder layers. Three circRNAs (hsa_circ_0061259, hsa_circ_0077855, and hsa_circ_0013126) were selected. Their subcellular localizations are provided in Table S2. The circRNAs with top five attention coefficients to above three circRNAs under four different feature types and two encoder layers are extracted. Table S2 lists these circRNAs and their subcellular localizations. It can be found that the circRNAs with high attention coefficients had at least one common subcellular localization with the selected circRNA. For example, hsa_circ_0001235 was assigned the top attention coefficient to hsa_circ_0061259 under the circRNA miRNA feature in the first encoder layer. Its subcellular localization (Cytoplasm) is one of the localizations of hsa_circ_0061259 (Nucleolus, Membrane, Nucleus, Cytosol, and Cytoplasm). According to the principle of GATE, the circRNAs with high attention coefficients can give a high influence on the representation of each circRNA. The representation of hsa_circ_0061259 contained the features of hsa_circ_0001235, which was helpful to predict the localization of Cytoplasm. This partly uncovered the essential contributions of GATE in CircLoc.

Comparisons with other methods

To date, there are limited computational models for prediction of circRNA subcellular localizations. Existing models can only deal with circRNAs with exact one subcellular localization^37,38. Thus, we employed three multi-label miRNA subcellular localization prediction models for comparing with CircLoc, including MiRLoc³², MirLocPredictor²⁸, and PMLocMSCAM³⁶. Furthermore, we employed two classic multi-label classification algorithms Binary Relevance (BR)¹⁰⁰ and RAndom k-labELsets (RAkEL)¹⁰¹ to construct the traditional machine learning-based models, which were implemented by MEKA (http://waikato.github.io/meka/)¹⁰². The eight feature types were fed into these algorithms to train the models. For convenience, the models based on BR and RAkEL were still called BR and RAkEL. After trying different classification algorithms, including decision tree, random forest, bayes, support vector machine, and k-nearest neighbor algorithm, we selected the best classification algorithm to construct BR and RAkEL models. The performance of above models on S(3.0) under ten-fold cross-validation is listed in Tables 9 and 10. For easy comparisons, the performance of CircLoc was also listed in these two tables. It can be found that CircLoc yielded the highest average AUC and AUPR. For AUC, the advantage was at least 0.05, whereas the advantage on AUPR was at least 0.03. Furthermore, CircLoc provided the highest AUC and AUPR values on five localizations. Clearly, CircLoc was superior to above five models. This result proved the effectiveness of CircLoc in predicting circRNA subcellular localizations.

Table 9 AUC values of different models in determining circRNA subcellular localizations.

Full size table

Table 10 AUPR values of different models in determining circRNA subcellular localizations.

Full size table

Generalization ability of CircLoc

As mentioned in Sect. 2.1, we employed the human circRNA subcellular localization data in RNALocate 2.0 and constructed one training dataset S(2.0) and one test dataset TD₁. The CircLoc was trained on S(2.0) and applied to TD₁.

We first tested the CircLoc on S(2.0) by ten-fold cross-validation. The ROC and PR curves on seven subcellular localizations are displayed in Fig. 3(C) and 3(D), respectively. The overall metrics are listed in Table 5. It can be observed that the performance on S(2.0) was similar to that on S(3.0). On some metrics (e.g., micro F1), CircLoc on S(2.0) had even evident advantages. Then, CircLoc was applied to TD₁. As there were only four localizations (Cytoplasm, Cytosol, Nucleolus, and Nucleus) containing circRNAs, we only counted the AUC and AUPR on these localizations and the overall metrics were not counted. The AUC and AUPR values on above four localizations are listed in Table 11. The four AUC values were 0.6887, 0.6074, 0.7541, and 0.7974, whereas the four AUPR values were 0.9672, 0.0097, 0.0867, and 0.2283. The AUC values were more stable than AUPR values. As there were only seven and two circRNAs on Cytosol and Nucleolus, the AUPR values on these two localizations were very low. The AUPR values on other two localizations (Cytoplasm and Nucleus) were much higher as there were enough circRNAs on these two localizations.

Table 11 Performance of CircLoc on two test datasets.

Full size table

Further, we also constructed another test dataset TD₂ using the human circRNA subcellular localization data in CSCD2. The CircLoc was trained on S(3.0) and applied to TD₂. The test results are listed in Tables 5 and 11. Six overall metrics (average AUC, average AUPR, micro F1, macro F1, hamming loss, and ranking loss) were 0.7097, 0.3619, 0.3377, 0.2134, 0.2947, and 0.2805. Compared with the cross-validation results on the training dataset S(3.0), the average AUC, average AUPR, micro F1, and macro F1 reduced and hamming loss and ranking loss increased, suggesting that the performance of the model on TD₂ was lower than that on the training dataset. In detail, the average AUC and average AUPR reduced by less than 0.1, other four metrics decreased or increased by more than 0.1. As the circRNAs in TD₂ and their localization information were retrieved from CSCD2 rather than RNALocate, this information was strictly isolated from the circRNAs in the training dataset, which can explain the lower performance of the model on the test dataset on one hand. On the other hand, during the test procedure, we found that most associated disease and drug information of test circRNAs was not available, influencing the predicted results of test circRNAs. As for the AUC and AUPR values on seven localizations (Table 11), some values were higher than those on the training dataset and others were lower. However, the differences were not very large, suggesting the normal results on this test.

According to the results on two test datasets, CircLoc has the ability to predict subcellular localizations of new circRNAs. However, its ability was not very strong and had spaces for improvement.

Limitations of this study

In this study, a computational model was built for predicting circRNA subcellular localizations. Although the model exhibited good performance, there still existed some limitations. First, as a deep learning-based model, the interpretability of the model is poor. This model can be a tool to identify latent subcellular localizations of circRNAs. However, it cannot provide clear insights for uncovering the essential differences of different circRNA subcellular localizations. Second, only seven subcellular localizations were considered in this study. Some important localizations (e.g. Extracellular vesicle) were not included because of the extremely imbalanced problem and lack of perfect computational methods for tackling this problem, losing a biological importance. Third, although the localizations with few or numerous circRNAs were removed, the dataset was still imbalanced. The current model did not employ any computational methods to deal with this problem, reducing its performance. Fourth, various circRNA properties collected from multiple public databases were used for building the model. This operation enriched the representations of circRNAs. However, it also reduced the practical value of the model because not all circRNAs shared all used properties. The model may produce unreliable results for circRNAs with partial properties. Fifth, the performance of CircLoc had spaces for improvement, especially its generalization ability. Finally, this study did not report the latent subcellular localizations of some circRNAs and validated them using wet experiments. The model still needed solid tests to confirm its ability in predicting circRNA subcellular localizations. In future, we will continue this work to overcome above limitations.

Conclusion

This study developed a circRNA subcellular localization prediction model, which fused the circRNA sequence and network information. The node2vec, GATE and self-attention layer were integrated in the model for extracting informative circRNA features. The model has good performance and outperforms the models for predicting miRNA subcellular localizations and those using traditional multi-label classification algorithms. Each circRNA feature and some modules in CircLoc provided positive contributions in predicting circRNA subcellular localizations. As the computational models for predicting circRNA subcellular localizations are in initial stage, we hope that this study can attract investigators to design efficient models on this problem. The data and codes used in this study are available at https://github.com/CNHanFei/CircLoc.

Data availability

All codes and data are available at https:/github.com/CNHanFei/CircLoc.

References

Oddo, J. C., Saxena, T., McConnell, O. L., Berglund, J. A. & Wang, E. T. Conservation of context-dependent splicing activity in distant Muscleblind homologs. Nucleic Acids Res. 44, 8352–8362. https://doi.org/10.1093/nar/gkw735 (2016).
Article CAS PubMed PubMed Central Google Scholar
Cheng, L. & Leung, K. S. Quantification of non-coding RNA target localization diversity and its application in cancers. J. Mol. Cell. Biol. 10, 130–138. https://doi.org/10.1093/jmcb/mjy006 (2018).
Article CAS PubMed Google Scholar
Frias-Lasserre, D. & Villagra, C. A. The importance of ncRNAs as epigenetic mechanisms in phenotypic variation and organic evolution. Front. Microbiol. 8, 2483. https://doi.org/10.3389/fmicb.2017.02483 (2017).
Article PubMed PubMed Central Google Scholar
Patop, I. L., Wüst, S. & Kadener, S. Past, present, and future of circRNAs. EMBO J. 38, e100836 (2019).
Article PubMed PubMed Central Google Scholar
Sanger, H. L., Klotz, G., Riesner, D., Gross, H. J. & Kleinschmidt, A. K. Viroids are single-stranded covalently closed circular RNA molecules existing as highly base-paired rod-like structures. Proc. Natl. Acad. Sci. U. S. A. 73, 3852–3856. https://doi.org/10.1073/pnas.73.11.3852 (1976).
Article ADS CAS PubMed PubMed Central Google Scholar
Cocquerelle, C., Mascrez, B., Hetuin, D. & Bailleul, B. Mis-splicing yields circular RNA molecules. FASEB J. 7, 155–160. https://doi.org/10.1096/fasebj.7.1.7678559 (1993).
Article CAS PubMed Google Scholar
Li, Z. et al. Exon-intron circular RNAs regulate transcription in the nucleus. Nat. Struct. Mol. Biol. 22, 256��264. https://doi.org/10.1038/nsmb.2959 (2015).
Article CAS PubMed Google Scholar
Salzman, J., Gawad, C., Wang, P. L., Lacayo, N. & Brown, P. O. Circular RNAs are the predominant transcript isoform from hundreds of human genes in diverse cell types. PLoS One 7, e30733. https://doi.org/10.1371/journal.pone.0030733 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Geng, X. et al. Circular RNA: Biogenesis, degradation, functions and potential roles in mediating resistance to anticarcinogens. Epigenomics 12, 267–283. https://doi.org/10.2217/epi-2019-0295 (2020).
Article CAS PubMed Google Scholar
Liu, J., Yang, L., Fu, Q. & Liu, S. Emerging roles and potential biological value of CircRNA in osteosarcoma. Front. Oncol. 10, 552236. https://doi.org/10.3389/fonc.2020.552236 (2020).
Article PubMed PubMed Central Google Scholar
Holdt, L. M., Kohlmaier, A. & Teupser, D. Molecular roles and function of circular RNAs in eukaryotic cells. Cell. Mol. Life Sci. 75, 1071–1098. https://doi.org/10.1007/s00018-017-2688-5 (2018).
Article CAS PubMed Google Scholar
Bachmayr-Heyda, A. et al. Correlation of circular RNA abundance with proliferation–Exemplified with colorectal and ovarian cancer, idiopathic lung fibrosis, and normal human tissues. Sci. Rep. 5, 8057. https://doi.org/10.1038/srep08057 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Li, F. et al. Circular RNA ITCH has inhibitory effect on ESCC by suppressing the Wnt/β-catenin pathway. Oncotarget 6, 6001–6013. https://doi.org/10.18632/oncotarget.3469 (2015).
Article PubMed PubMed Central Google Scholar
Li, P. et al. Using circular RNA as a novel type of biomarker in the screening of gastric cancer. Clin. Chim. Acta 444, 132–136. https://doi.org/10.1016/j.cca.2015.02.018 (2015).
Article ADS CAS PubMed Google Scholar
Lei, B., Tian, Z., Fan, W. & Ni, B. Circular RNA: A novel biomarker and therapeutic target for human cancers. Int. J. Med. Sci. 16, 292 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Z., Yang, T. & Xiao, J. Circular RNAs: Promising biomarkers for human diseases. EBioMedicine 34, 267–274 (2018).
Article PubMed PubMed Central Google Scholar
Chen, L. & Shan, G. CircRNA in cancer: Fundamental mechanism and clinical potential. Cancer Lett. 505, 49–57 (2021).
Article CAS PubMed Google Scholar
Zhang, T. et al. Protein subcellular localization prediction model based on graph convolutional network. Interdiscip. Sci. Comput. Life Sci. 14, 937–946. https://doi.org/10.1007/s12539-022-00529-9 (2022).
Article CAS Google Scholar
Ullah, M. et al. PScL-HDeep: Image-based prediction of protein subcellular location in human tissue using ensemble learning of handcrafted and deep learned features with two-layer feature selection. Brief. Bioinform. 22, bbab278. https://doi.org/10.1093/bib/bbab278 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cheng, X., Zhao, S. G., Lin, W. Z., Xiao, X. & Chou, K. C. pLoc-mAnimal: Predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 33, 3524–3531. https://doi.org/10.1093/bioinformatics/btx476 (2017).
Article CAS PubMed Google Scholar
Chen, L., Qu, R. & Liu, X. Improved multi-label classifiers for predicting protein subcellular localization. Math. Biosci. Eng. 21, 214–236. https://doi.org/10.3934/mbe.2024010 (2024).
Article PubMed Google Scholar
Pan, X. et al. Identifying protein subcellular locations with embeddings-based node2loc. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 666–675. https://doi.org/10.1109/tcbb.2021.3080386 (2021).
Article CAS Google Scholar
Wang, R. & Chen, L. Identification of human protein subcellular location with multiple networks. Curr. Proteomics. 19, 344–356 (2022).
Article CAS Google Scholar
Musleh, S., Islam, M. T., Qureshi, R., Alajez, N. M. & Alam, T. MSLP: mRNA subcellular localization predictor based on machine learning techniques. BMC Bioinformatics 24, 109 (2023).
Article CAS PubMed PubMed Central Google Scholar
Yan, Z., Lécuyer, E. & Blanchette, M. Prediction of mRNA subcellular localization using deep recurrent neural networks. Bioinformatics 35, i333–i342 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wang, D. et al. DM3Loc: Multi-label mRNA subcellular localization prediction and analysis based on multi-head self-attention mechanism. Nucleic Acids Res. 49, e46–e46 (2021).
Article CAS PubMed PubMed Central Google Scholar
Xiao, Y., Cai, J., Yang, Y., Zhao, H. & Shen, H. in IEEE International Conference on Data Mining (ICDM). 1332–1337 (IEEE). 1332–1337 (IEEE). (2018).
Asim, M. N. et al. MirLocPredictor: A ConvNet-based multi-label microRNA subcellular localization predictor by incorporating k-Mer positional information. Genes (Basel) 11, 1475. https://doi.org/10.3390/genes11121475 (2020).
Article CAS PubMed PubMed Central Google Scholar
Asim, M. N. et al. in 2021 International Joint Conference on Neural Networks (IJCNN). 1–8 (IEEE).
Meher, P. K., Satpathy, S. & Rao, A. R. miRNALoc: Predicting miRNA subcellular localizations based on principal component scores of physico-chemical properties and pseudo compositions of di-nucleotides. Sci. Rep. 10, 14557. https://doi.org/10.1038/s41598-020-71381-4 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Yang, Y., Fu, X., Qu, W., Xiao, Y. & Shen, H. B. MiRGOFS: A GO-based functional similarity measurement for miRNAs, with applications to the prediction of miRNA subcellular localization and miRNA-disease association. Bioinformatics 34, 3547–3556. https://doi.org/10.1093/bioinformatics/bty343 (2018).
Article CAS PubMed Google Scholar
Xu, M. et al. MiRLoc: Predicting miRNA subcellular localization by incorporating miRNA-mRNA interactions and mRNA subcellular localization. Brief. Bioinform. 23, bbac044. https://doi.org/10.1093/bib/bbac044 (2022).
Article CAS PubMed Google Scholar
Bai, T., Yan, K. & Liu, B. DAmiRLocGNet: miRNA subcellular localization prediction by combining miRNA-disease associations and graph convolutional networks. Brief. Bioinform. 24, bbad212. https://doi.org/10.1093/bib/bbad212 (2023).
Article CAS PubMed Google Scholar
Chen, L., Gu, J. & Zhou, B. PMiSLocMF: Predicting miRNA subcellular localizations by incorporating multi-source features of miRNAs. Brief. Bioinform. 25, bbae386 (2024).
Article CAS PubMed PubMed Central Google Scholar
Liang, Y. et al. MGFmiRNAloc: Predicting miRNA subcellular localization using molecular graph feature and convolutional block attention module. IEEE/ACM Trans. Comput. Biol. Bioinform. 21, 1348–1357. https://doi.org/10.1109/tcbb.2024.3383438 (2024).
Article CAS PubMed Google Scholar
Jiang, J. & Yan, C. PMLocMSCAM: Predicting miRNA subcellular localisations by miRNA similarities and cross-attention mechanism. IET Syst. Biol. 19, e70023. https://doi.org/10.1049/syb2.70023 (2025).
Article PubMed PubMed Central Google Scholar
Asim, M. N., Ibrahim, M. A., Imran Malik, M., Dengel, A. & Ahmed, S. Circ-LocNet: A computational framework for circular RNA sub-cellular localization prediction. Int. J. Mol. Sci. 23, 8221. https://doi.org/10.3390/ijms23158221 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zeng, M. et al. CellCircLoc: Deep neural network for predicting and explaining cell line-specific CircRNA subcellular localization. IEEE J. Biomed. Health Inform. 29, 1494–1503. https://doi.org/10.1109/jbhi.2024.3491732 (2025).
Article PubMed Google Scholar
Wei, M.-M. et al. Integrating transformer and graph attention network for circRNA-miRNA interaction prediction. IEEE J. Biomed. Health Inform. https://doi.org/10.1109/JBHI.2025.3561197 (2025).
Article PubMed PubMed Central Google Scholar
Yuan, S. et al. FDENet: Frequency-guided dual-encoder network for building footprint extraction from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. https://doi.org/10.1109/JSTARS.2025.3601023 (2025).
Article Google Scholar
Li, D. et al. DeepHIV: A sequence-based deep learning model for predicting HIV-1 protease cleavage sites. IEEE Trans. Comput. Biol. Bioinform. https://doi.org/10.1109/TCBBIO.2025.3610881 (2025).
Article PubMed PubMed Central Google Scholar
Lan, W. et al. The large language models on biomedical data analysis: A survey. IEEE J. Biomed. Health Inform. 29, 4486–4497. https://doi.org/10.1109/JBHI.2025.3530794 (2025).
Article PubMed Google Scholar
Noor, S. et al. Deep-m5U: A deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinformatics 25, 360. https://doi.org/10.1186/s12859-024-05978-1 (2024).
Article CAS PubMed PubMed Central Google Scholar
Khan, S., AlQahtani, S. A., Noor, S. & Ahmad, N. PSSM-Sumo: Deep learning based intelligent model for prediction of sumoylation sites using discriminative features. BMC Bioinformatics 25, 284. https://doi.org/10.1186/s12859-024-05917-0 (2024).
Article CAS PubMed PubMed Central Google Scholar
Khan, S. et al. XGBoost-enhanced ensemble model using discriminative hybrid features for the prediction of sumoylation sites. Biodata Min. 18, 12. https://doi.org/10.1186/s13040-024-00415-8 (2025).
Article CAS PubMed PubMed Central Google Scholar
Khan, S., Dilshad, N., Ahmad, N., Noor, S. & AlQahtani, S. A. Integrating AI in security information and event management for real time cyber defense. Sci. Rep. 15, 35872 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Salehi, A. & Davulcu, H. Graph attention auto-encoders. arXiv preprint, doi:arXiv: 10715 (2019). (1905).
Wu, L. et al. RNALocate v3.0: Advancing the repository of RNA subcellular localization with dynamic analysis and prediction. Nucleic Acids Res. 53, D284-d292. https://doi.org/10.1093/nar/gkae872 (2025).
Article CAS PubMed PubMed Central Google Scholar
Cui, T. et al. RNALocate v2.0: An updated resource for RNA subcellular localization with increased coverage and annotation. Nucleic Acids Res. 50, D333–D339. https://doi.org/10.1093/nar/gkab825 (2022).
Article CAS PubMed PubMed Central Google Scholar
Feng, J. et al. CSCD2: An integrated interactional database of cancer-specific circular RNAs. Nucleic Acids Res. 50, D1179-d1183. https://doi.org/10.1093/nar/gkab830 (2022).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Liu, F., Fang, L., Wang, X. & Chou, K. C. repDNA: A Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31, 1307–1309. https://doi.org/10.1093/bioinformatics/btu820 (2015).
Article PubMed Google Scholar
Liu, B. et al. Pse-in-One: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 43, W65-71. https://doi.org/10.1093/nar/gkv458 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liu, B., Liu, F., Fang, L., Wang, X. & Chou, K. C. repRNA: A web server for generating various feature vectors of RNA sequences. Mol. Genet. Genomics. 291, 473–481. https://doi.org/10.1007/s00438-015-1078-7 (2016).
Article CAS PubMed Google Scholar
Liu, B. BioSeq-Analysis: A platform for DNA, RNA and protein sequence analysis based on machine learning approaches. Brief. Bioinform. 20, 1280–1294. https://doi.org/10.1093/bib/bbx165 (2019).
Article CAS PubMed Google Scholar
Glazar, P., Papavasileiou, P. & Rajewsky, N. circBase: A database for circular RNAs. RNA 20, 1666–1670. https://doi.org/10.1261/rna.043687.113 (2014).
Article CAS PubMed PubMed Central Google Scholar
Lv, H. et al. Evaluation of different computational methods on 5-methylcytosine sites identification. Brief. Bioinform. 21, 982–995. https://doi.org/10.1093/bib/bbz048 (2020).
Article CAS PubMed Google Scholar
Tang, G., Shi, J., Wu, W., Yue, X. & Zhang, W. Sequence-based bacterial small RNAs prediction using ensemble learning strategies. BMC Bioinform. 19, 503. https://doi.org/10.1186/s12859-018-2535-1 (2018).
Article CAS Google Scholar
Gupta, S. et al. Predicting human nucleosome occupancy from primary sequence. PLoS Comput. Biol. 4, e1000134. https://doi.org/10.1371/journal.pcbi.1000134 (2008).
Article CAS PubMed PubMed Central Google Scholar
Akiyama, M. & Sakakibara, Y. Informative RNA base embedding for RNA structural alignment and clustering by deep representation learning. NAR genomics Bioinf. 4, lqac012. https://doi.org/10.1093/nargab/lqac012 (2022).
Article CAS Google Scholar
Wang, N. et al. Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning. Nat. Mach. Intell. 6, 548–557 (2024).
Article Google Scholar
Chen, J. et al. Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions. arXiv preprint arXiv:2204.00300 (2022).
Wei, M., Wang, L., Su, X., Zhao, B. & You, Z. Multi-hop graph structural modeling for cancer-related circRNA-miRNA interaction prediction. Pattern Recogn. 170, 112078 (2026).
Article Google Scholar
Wang, S., Lee, H. C. & Lee, S. Predicting herb-disease associations using network-based measures in human protein interactome. BMC Complement. Med. Ther. 24, 218. https://doi.org/10.1186/s12906-024-04503-4 (2024).
Article PubMed PubMed Central Google Scholar
Yin, A., Chen, L., Zhou, B. & Cai, Y. D. CMAGN: circRNA–miRNA association prediction based on graph attention auto-encoder and network consistency projection. BMC Bioinform. 25, 336. https://doi.org/10.1186/s12859-024-05959-4 (2024).
Article CAS Google Scholar
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Article CAS PubMed Google Scholar
Meng, S. et al. CircRNA: Functions and properties of a novel potential biomarker for cancer. Mol. Cancer 16, 1–8 (2017).
Article CAS Google Scholar
Gao, J.-L., Chen, G., He, H.-Q. & Wang, J. CircRNA as a new field in human disease research. Zhongguo Zhong Yao Za Zhi 43, 457–462 (2018).
PubMed Google Scholar
Guarnerio, J. et al. Oncogenic role of fusion-circRNAs derived from cancer-associated chromosomal translocations. Cell 165, 289–302 (2016).
Article CAS PubMed Google Scholar
Shang, Q., Yang, Z., Jia, R. & Ge, S. The novel roles of circRNAs in human cancer. Mol. Cancer. 18, 1–10 (2019).
Article Google Scholar
Lan, W. et al. LGCDA: Predicting circRNA-disease association based on fusion of local and global features. IEEE/ACM Trans. Comput. Biol. Bioinform. 21, 1413–1422. https://doi.org/10.1109/TCBB.2024.3387913 (2024).
Article CAS PubMed Google Scholar
Lan, W. et al. IGNSCDA: Predicting circRNA-disease associations based on improved graph convolutional network and negative sampling. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 3530–3538. https://doi.org/10.1109/TCBB.2021.3111607 (2022).
Article CAS PubMed Google Scholar
Lan, W. et al. KGANCDA: Predicting circRNA-disease associations based on knowledge graph attention network. Brief Bioinform. 23, bbab494. https://doi.org/10.1093/bib/bbab494 (2022).
Article CAS PubMed Google Scholar
Lan, W. et al. Predicting CircRNA-disease associations based on heterogeneous graph neural network and knowledge graph attribute mining attention. Interdiscip. Sci. Comput. Life Sci. 17, 586–597. https://doi.org/10.1007/s12539-025-00706-6 (2025).
Article Google Scholar
Lan, W. et al. Benchmarking of computational methods for predicting circRNA-disease associations. Brief. Bioinform. 24, bbac613. https://doi.org/10.1093/bib/bbac613 (2023).
Article CAS PubMed Google Scholar
Chen, L. & Zhao, X. PCDA-HNMP: Predicting circRNA-disease association using heterogeneous network and meta-path. Math. Biosci. Eng. 20, 20553–20575 (2023).
Article PubMed Google Scholar
Fan, C. et al. CircR2Disease v2.0: An updated web server for experimentally validated circRNA-disease associations and its application. Genomics. Proteomics. Bioinform. 20, 435–445. https://doi.org/10.1016/j.gpb.2021.10.002 (2022).
Article Google Scholar
Wei, L. et al. Noncoding RNAs in gastric cancer: Implications for drug resistance. Mol. Cancer. 19, 62. https://doi.org/10.1186/s12943-020-01185-7 (2020).
Article PubMed PubMed Central Google Scholar
Cui, C. et al. Functions and mechanisms of circular RNAs in cancer radiotherapy and chemotherapy resistance. Mol. Cancer. 19, 58. https://doi.org/10.1186/s12943-020-01180-y (2020).
Article PubMed PubMed Central Google Scholar
Deng, L., Liu, Z., Qian, Y. & Zhang, J. Predicting circRNA-drug sensitivity associations via graph attention auto-encoder. BMC. Bioinform. 23, 160. https://doi.org/10.1186/s12859-022-04694-y (2022).
Article CAS Google Scholar
Chen, Y., Wang, Y., Ding, Y., Su, X. & Wang, C. RGCNCDA: Relational graph convolutional network improves circRNA-disease association prediction by incorporating microRNAs. Comput. Biol. Med. 143, 105322. https://doi.org/10.1016/j.compbiomed.2022.105322 (2022).
Article CAS PubMed Google Scholar
Liu, M., Wang, Q., Shen, J., Yang, B. B. & Ding, X. Circbank: A comprehensive database for circRNA with standard nomenclature. RNA. Biol. 16, 899–905. https://doi.org/10.1080/15476286.2019.1600395 (2019).
Article PubMed PubMed Central Google Scholar
Dudekula, D. B. et al. CircInteractome: A web tool for exploring circular RNAs and their interacting proteins and microRNAs. RNA Biol. 13, 34–42. https://doi.org/10.1080/15476286.2015.1128065 (2016).
Article PubMed PubMed Central Google Scholar
Grover, A. & Leskovec, J. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 855–864ACM, San Francisco, California, USA, (2016).
Kingma, D. P. & Ba, J. in 3rd International Conference on Learning RepresentationsLouisiana, USA, (2019).
Kohavi, R. in International joint Conference on artificial intelligence. 1137–1145 (Lawrence Erlbaum Associates Ltd).
Yuan, F. et al. Integrative multi-omics machine learning reveals novel driver genes associations in lung adenocarcinoma. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics 1874, 141113. https://doi.org/10.1016/j.bbapap.2025.141113 (2026).
Article CAS PubMed Google Scholar
Ma, Q. et al. Identifying transcriptional signatures of leukocytes in tissue and blood for multicancer diagnosis by using machine learning methods. Cancer Genet. 302–303, 13–26. https://doi.org/10.1016/j.cancergen.2026.01.003 (2026).
Article CAS PubMed Google Scholar
Ren, J. et al. Identification of Gene Signatures Associated with Multisystem Inflammatory Syndrome in Children after SARS-CoV-2 Infection. Current Bioinformatics (2026).
Chen, L., Yang, J., Zhou, B. & Cai, Y.-D. PLysPTM-HGNN: Predicting lysine PTM sites of proteins using hybrid graph neural networks. BMC. Bioinformatics 27, 32 (2026).
Article CAS PubMed PubMed Central Google Scholar
Powers, D. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies 2, 37–63 (2011).
Google Scholar
Chen, L., Xu, L., Zhou, B., Chen, Y. & AntiCanNet A Graph Convolution and Chemical LLM Framework for Predicting Anti-Cancer Small Molecules. Current Bioinformatics (2026).
Chen, L., Lu, Y., Xu, J. & Zhou, B. Prediction of drug’s anatomical therapeutic chemical (ATC) code by constructing biological profiles of ATC codes. BMC. Bioinformatics 26, 86 (2025).
Article CAS PubMed PubMed Central Google Scholar
Chen, L., Zhang, S. & Zhou, B. Herb-disease association prediction model based on network consistency projection. Sci. Rep. 15, 3328 (2025).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, L., Xun, X. & Zhou, B. Root-associated protein prediction using a protein large language model and hypergraph convolutional networks. Sci. Rep. 16, 4876 (2026).
Article CAS PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Yang, B. & Chen, H. Predicting circRNA-drug sensitivity associations by learning multimodal networks using graph auto-encoders and attention mechanism. Brief. Bioinform. 24, bbac596. https://doi.org/10.1093/bib/bbac596 (2023).
Article CAS PubMed Google Scholar
Chen, L. et al. Predicting human protein subcellular locations by using a combination of network and function features. Front. Genet. 12, 783128. https://doi.org/10.3389/fgene.2021.783128 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pan, X. et al. Identifying protein subcellular locations with embeddings-based node2loc. IEEE/ACM Trans. Comput. Biol. Bioinform. 19, 666–675. https://doi.org/10.1109/tcbb.2021.3080386 (2022).
Article CAS PubMed Google Scholar
Garapati, H. S., Male, G. & Mishra, K. Predicting subcellular localization of proteins using protein-protein interaction data. Genomics 112, 2361–2368. https://doi.org/10.1016/j.ygeno.2020.01.007 (2020).
Article CAS PubMed Google Scholar
Zhang, M.-L., Li, Y.-K., Liu, X.-Y. & Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 12, 191–202. https://doi.org/10.1007/s11704-017-7031-7 (2018).
Article Google Scholar
Tsoumakas, G., Katakis, I. & Vlahavas, I. Random k-Labelsets for multilabel classification. IEEE. Trans. Knowl. Data Eng. 23, 1079–1089. https://doi.org/10.1109/TKDE.2010.164 (2011).
Article Google Scholar
Read, J., Reutemann, P., Pfahringer, B. & Holmes, G. MEKA: A multi-label/multi-target extension to WEKA. J. Mach. Learn. Res. 17, 1–5 (2016).
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Engineering, Shanghai Maritime University, Shanghai, 201306, People’s Republic of China
Lei Chen & Jinghai Hu
School of Basic Medical Sciences, Shanghai University of Medicine and Health Sciences, Shanghai, 201318, China
Bo Zhou

Authors

Lei Chen
View author publications
Search author on:PubMed Google Scholar
Jinghai Hu
View author publications
Search author on:PubMed Google Scholar
Bo Zhou
View author publications
Search author on:PubMed Google Scholar

Contributions

L.C. designed the research; L.C., J.H. and B.Z. conducted the experiments; J.H. and B.Z. analyzed the results; L.C. and J.H. wrote the manuscript. All authors have read and approved the manuscript.

Corresponding author

Correspondence to Lei Chen.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1 (download PDF )

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, L., Hu, J. & Zhou, B. Predicting circRNA subcellular localization by fusing circRNA sequence and network information. Sci Rep 16, 12775 (2026). https://doi.org/10.1038/s41598-026-43808-x

Download citation

Received: 06 October 2025
Accepted: 06 March 2026
Published: 09 March 2026
Version of record: 20 April 2026
DOI: https://doi.org/10.1038/s41598-026-43808-x

Subjects

Abstract

Similar content being viewed by others

Circular RNA discovery with emerging sequencing and deep learning technologies

An updated resource for the detection of protein-coding circRNA with CircProPlus

Specific expression and functions of circular RNAs

Introduction

Materials and methods

Benchmark dataset

Construction of circRNA sequence features

Features yielded by k-mer

Features yielded by reverse compliment k-mer

Features yielded by RNAErnie model

Construction of circRNA network features

Network construction

circRNA sequence similarity network

circRNA-disease association network

circRNA-drug association network

circRNA-miRNA association network

circRNA-protein association network

Features yielded by node2vec from networks

Features improved by GATE

Self-attention layer

Prediction and optimization

Evaluation metrics

Outline of CircLoc

Results and discussion

Parameter setting of CircLoc

Performance of CircLoc

Ablation tests on features

Ablation tests on modules

Comparisons with other methods

Generalization ability of CircLoc

Limitations of this study

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1 (download PDF )

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links