CN119252330B

CN119252330B - RNA N4-acetylcytidine modification site prediction method and system based on RNAErnie pre-training model

Info

Publication number: CN119252330B
Application number: CN202411508778.3A
Authority: CN
Inventors: 张子龙; 贾艳娜; 崔菲菲; 耿奥运; 付修豪
Original assignee: Hainan University
Current assignee: Hainan University
Priority date: 2024-10-28
Filing date: 2024-10-28
Publication date: 2025-08-22
Anticipated expiration: 2044-10-28
Also published as: CN119252330A

Abstract

The present invention relates to a method and system for predicting RNAN4-acetylcytidine modification sites based on an RNAErnie pre-training model. The method comprises: collecting an RNA sequence data set and inputting it into the RNAErnie pre-training model to capture context dependencies and extract global features; performing feature encoding by combining the RNAErnie pre-training model with six traditional feature encoding methods; performing feature dimensionality reduction in a deep neural network model and inputting the result into a soft voting integration model to integrate the prediction results of different classifiers, thereby obtaining RNAN4-acetylcytidine modification site prediction results. Multi-level masking is performed by the RNAErnie pre-training model, and the six traditional feature encoding methods are combined to capture the details and physicochemical properties of the sequence; feature dimensionality reduction is automatically performed using a deep neural network, which reduces computational complexity and retains key information; the reduced-dimensional features are input into the soft voting integration model, and the final prediction result is obtained by integrating multiple classifiers, which significantly improves the accuracy and robustness of the prediction.

Description

RNA N4-acetylcytidine modification site prediction method and system based on RNAErnie pre-training model

Technical Field

The invention relates to the technical field of biological information, in particular to an RNAN 4-acetylcytidine modification site prediction method and system based on RNAErnie pre-training model.

Background

To date, over 170 modified nucleosides have been found in RNA. Post-transcriptional chemical modification of RNA, collectively referred to as "epitranscriptome", has a substantial impact on gene expression and cellular processes, playing an important role in molecular interactions and inter-molecular relationships. N4-acetylcytidine (ac 4C) is a common type catalyzed by the enzyme NAT10, with an acetyl group added to the nitrogen at the fourth position of the cytidine base. Ac4C was originally found in tRNA and rRNA in eukaryotes and prokaryotes, and current research also established that ac4C is present in human mRNA, which can increase translation efficiency, enhance mRNA stability, and regulate gene expression. Furthermore, there is growing evidence that ac4C is associated with a variety of human diseases, including inflammation, metabolic disorders, autoimmune diseases, and cancer. In summary, ac4C modification plays an important role in cellular function and disease processes as a key post-transcriptional modification of RNA. The function and mechanism of RNA-ac4C modification site are discussed and are important to elucidate the biological significance and the therapeutic strategy for the promotion of related diseases.

Traditional RNA-ac4C modification site detection includes biological experimental methods, high-throughput sequencing technology and computer-aided analysis methods. The traditional biological experiment method mainly comprises chemical analysis and immunodetection, is widely used in laboratories, has higher sensitivity and specificity, is suitable for carrying out qualitative and quantitative analysis on specific modification, is widely used for carrying out full genome level detection on ac4C modification in recent years by a high-throughput sequencing technology, wherein MeRIP-seq (mRNA co-precipitation sequencing) is used as a representative, and can carry out large-scale exploration and analysis on ac4C modification in the whole genome range by enriching modification sites in RNA samples and then carrying out sequencing analysis.

However, in the traditional RNA-ac4C modification site detection, most of wet experiments are expensive and time-consuming, the detection cost is high, the operation is complex, the sensitivity and the specificity are low, the high-throughput sequencing technology often depends on antibody enrichment, the resolution is low, the detection accuracy is limited by antibody quality and background noise, the computer-aided analysis method depends on the traditional feature coding technology seriously, complex feature engineering steps are needed, and comprehensive understanding of the context semantic relationship is lacking. Therefore, the traditional RNA-ac4C modification site detection method often has the problems of high cost and insufficient information mining, so that the accuracy of RNA-ac4C modification site detection is low.

Disclosure of Invention

Based on the above, in order to solve the above technical problems, a method and a system for predicting an RNAN 4-acetylcytidine modification site based on RNAErnie pre-training model are provided, which can quickly and low-cost detect the RNAN 4-acetylcytidine modification site of the model, and improve the accuracy of detection of the RNAN 4-acetylcytidine modification site of the model.

A method for predicting an RNAN 4-acetylcytidine modification site based on RNAErnie pre-trained models, the method comprising:

collecting an RNA sequence data set, wherein the RNA sequence data set comprises positive and negative samples;

Inputting each RNA sequence in the RNA sequence data set into RNAErnie pre-training models for multistage masking, capturing context dependency and extracting global features, and carrying out feature coding on each RNA sequence by combining the RNAErnie pre-training models with six traditional feature coding methods to obtain coded high-dimensional features;

inputting the encoded high-dimensional features into a deep neural network model for feature dimension reduction to obtain dimension reduced features;

inputting the feature after dimension reduction into a soft voting integration model, and integrating prediction results of different classifiers through the soft voting integration model to obtain an RNAN 4-acetylcytidine modification site prediction result;

wherein the soft voting integration model is constructed by XGBoost, MLP, catBoost classifiers.

In one embodiment, the method further comprises:

determining an evaluation index, and performing performance evaluation on the soft voting integration model by using a ten-fold cross validation mode according to the evaluation index to obtain an evaluation result;

The evaluation indexes comprise sensitivity, specificity, accuracy, ma Xiusi correlation coefficients and areas under curves.

In one embodiment, the method further comprises:

Displaying a user interaction interface, and acquiring an RNA sequence to be predicted through the user interaction interface;

Inputting the RNA sequence to be predicted into the soft voting integrated model, and outputting an RNAN 4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted;

And displaying an RNAN 4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted in the user interaction interface.

In one embodiment, after collecting the RNA sequence dataset, the method further comprises:

determining a data set dividing ratio;

and carrying out hierarchical sampling processing on the RNA sequence data set based on the data set dividing proportion to obtain a divided training data set and a divided test data set.

In one embodiment, the RNAErnie pre-training model is based on enhancing the representation framework through knowledge integration and combines a transducer layer and a multi-head self-attention mechanism, wherein:

inputting each RNA sequence in the RNA sequence data set into a RNAErnie pre-training model respectively, and calculating the attention score corresponding to each RNA sequence through each attention head based on the multi-head self-attention mechanism;

the obtained attention scores are connected, and the RNA sequences are mapped to a query, key and value matrix through a linear transformation matrix.

In one embodiment, each RNA sequence in the RNA sequence dataset is separately input into a RNAErnie pre-training model for multi-level masking, comprising:

The RNAErnie pre-training model adopts a motif masking strategy, a subsequence masking strategy and a motif random masking strategy, and combines coarse-grained RNA as a vocabulary mark;

the RNAErnie pre-training model appends the lexical markers to the last segment of each of the RNA sequences, enhancing the RNA sequence representation.

In one embodiment, inputting the high-dimensional feature into a deep neural network model for feature dimension reduction, to obtain a feature after dimension reduction, includes:

Inputting the encoded high-dimensional features into a deep neural network model, and mapping the high-dimensional features from a high-dimensional space to a low-dimensional space through multi-layer nonlinear mapping of the deep neural network model to obtain the features after dimension reduction.

In one embodiment, the soft voting integration model integrates the prediction results of different classifiers to obtain an RNAN 4-acetylcytidine modification site prediction result, which comprises the following steps:

determining the prediction probability of all classifiers through the soft voting integration model;

carrying out weighted average calculation on each prediction probability to obtain the maximum value of the weighted average probability;

The maximum was taken as a predictor of RNAN 4-acetylcytidine modification site.

An RNAN 4-acetylcytidine modification site prediction system based on RNAErnie pre-trained models, the system comprising:

The data set acquisition module is used for acquiring an RNA sequence data set, wherein the RNA sequence data set comprises positive and negative samples;

The feature coding module is used for inputting each RNA sequence in the RNA sequence data set into a RNAErnie pre-training model respectively for multi-level masking, capturing context dependency and extracting global features, and carrying out feature coding on each RNA sequence by combining the RNAErnie pre-training model with six traditional feature coding methods to obtain coded high-dimensional features;

the feature dimension reduction module is used for inputting the encoded high-dimension features into a deep neural network model to perform feature dimension reduction to obtain dimension reduced features;

the result prediction module is used for inputting the feature subjected to dimension reduction into a soft voting integration model, integrating prediction results of different classifiers through the soft voting integration model, and obtaining an RNAN 4-acetylcytidine modification site prediction result;

According to the RNAN 4-acetylcytidine modification site prediction method and system based on RNAErnie pre-training model, the RNAErnie pre-training model is used for carrying out multistage masking, context dependency relationship can be accurately captured, global characteristics can be extracted, more comprehensive RNA sequence information can be captured, details and physicochemical properties of sequences can be captured by combining six traditional characteristic coding modes, characteristic dimension reduction is automatically carried out by utilizing a deep neural network, computational complexity is reduced and key information is reserved by automatically learning and screening the most relevant characteristics, the dimension reduced characteristics are input into a soft voting integration model, final prediction results are obtained by integrating a plurality of classifiers, prediction accuracy and robustness are remarkably improved, and the accuracy of RNAN 4-acetylcytidine modification site detection of the model can be fast, low in cost and improved.

Drawings

FIG. 1 is a diagram of an application environment for an RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training models in one embodiment;

FIG. 2 is a flow chart of a method for predicting RNAN 4-acetylcytidine modification site based on RNAErnie pre-trained models in one embodiment;

FIG. 3 is a schematic diagram of a model framework of Voting-ac4C in one embodiment;

FIG. 4 is a schematic diagram of parameters for verifying the reliability of hybrid feature encoding for an ablation experiment;

FIG. 5 is a schematic diagram of parameters of an ablation experiment to verify the reliability of a selected machine learning classifier;

FIG. 6 is a schematic diagram of parameters for verifying the reliability of the soft voting method by an ablation experiment;

FIG. 7 is a block diagram of an RNAN 4-acetylcytidine modification site prediction system based on RNAErnie pre-training models, under one embodiment;

FIG. 8 is a block diagram of an RNAN 4-acetylcytidine modification site prediction system based on RNAErnie pre-training models in another embodiment;

Fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model provided by the embodiment of the application can be applied to an application environment shown in figure 1. As shown in FIG. 1, the application environment includes a computer device 110. The computer equipment 110 can collect an RNA sequence data set, wherein the RNA sequence data set contains positive and negative samples, the computer equipment 110 can respectively input each RNA sequence in the RNA sequence data set into a RNAErnie pre-training model for multistage masking so as to capture a context dependency relationship and extract global features, the RNAErnie pre-training model is combined with six traditional feature coding methods to perform feature coding on each RNA sequence to obtain coded high-dimensional features, the computer equipment 110 can input the coded high-dimensional features into a deep neural network model for feature dimension reduction to obtain dimension-reduced features, the computer equipment 110 can input the dimension-reduced features into a soft voting integration model, and the prediction results of different classifiers are integrated through the soft voting integration model to obtain RNAN 4-acetylcytidine modification site prediction results, wherein the soft voting integration model is constructed by a XGBoost, MLP, catBoost classifier. The computer device 110 may be, but is not limited to, various personal computers, notebook computers, smart phones, robots, unmanned aerial vehicles, tablet computers, and the like.

In one embodiment, as shown in FIG. 2, a method for predicting RNAN 4-acetylcytidine modification site based on RNAErnie pre-trained model is provided, comprising the steps of:

step 202, collecting an RNA sequence data set, wherein the RNA sequence data set comprises positive and negative samples.

Establishing a reliable reference dataset is the basis for developing a powerful potential mechanism for predicting ac4C modification sites. In this example, a pre-designed memory may be used to provide a good RNA sequence dataset, and may specifically include 2758 balanced positive and negative samples. The data set is divided into a training set (2206 samples) and a testing set (552 samples), positive and negative samples are equally distributed in the training and testing data sets, and stability and generalization capability of the model are ensured.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model further comprises the steps of dividing the data set proportion, determining the data set dividing proportion, and carrying out hierarchical sampling processing on the RNA sequence data set based on the data set dividing proportion to obtain a divided training data set and a test data set.

Specifically, in this embodiment, all samples were stratified in a 4:1 ratio into training and test data sets, which included 2206 positive and negative samples. In contrast, the independent test dataset consisted of 552 samples from each category, including positive and negative.

And 204, inputting each RNA sequence in the RNA sequence data set into a RNAErnie pre-training model for multistage masking, capturing a context dependency relationship and extracting global features, and carrying out feature coding on each RNA sequence by combining the RNAErnie pre-training model with six traditional feature coding methods to obtain coded high-dimensional features.

Global context features are obtained by RNAErnie pre-training models for each RNA sequence in the RNA sequence dataset. Wherein RNAErnie pre-trained models are used to characterize the RNA sequences. RNAErnie is a pre-training model based on a transducer architecture specifically designed for RNA sequences, based on a framework enhanced by knowledge integration (ERNIE), incorporating multiple transducer layers and multiple head self-attention mechanisms, with a hidden state dimension of 768 for each transducer block, which design choices enable the RNAErnie pre-training model to capture complex patterns and deep biological information in RNA sequences.

In one embodiment, RNAErnie the pre-training model is built on the basis of a knowledge integration enhancement representation framework and combines a transducer layer and a multi-head self-attention mechanism, and the provided RNAN 4-acetylcytidine modification site prediction method based on the RNAErnie pre-training model can further comprise a characteristic processing process by using the RNAErnie pre-training model, and the specific process comprises the steps of respectively inputting each RNA sequence in an RNA sequence data set into the RNAErnie pre-training model, calculating the attention score corresponding to each RNA sequence through each attention head based on the multi-head self-attention mechanism, connecting the obtained attention scores, and mapping each RNA sequence to a query, a key and a value matrix through a linear transformation matrix.

In this embodiment, RNAErnie pre-training model is based on a transducer architecture, one of its core components is a multi-headed attention mechanism that captures different aspects of the input sequence by computing multiple attention headers in parallel, thereby enhancing the ability to understand the RNA sequence. For each attention header, W represents a set of linear transformation matrices, in particular W ^Q、W^K and W ^V are linear transformation matrices for mapping the input sequence X to a query (Q), key (K) and value (V) matrix, respectively, this process involving the calculation formula q=xw ^Q;K＝XW^K;V＝XW^V, each header calculating an attention score by a self-attention mechanism to evaluate the interrelationship between the elements of the input sequence, the calculation formula being: Wherein the dimension d _k of the key vector is used to scale the dot product of the query matrix Q and the key matrix K to prevent gradient problems, this scaled dot product generates the original attention score which is then converted into a probability distribution by a softmax function, which is used to weight the vectors in the value matrix V to produce a contextual representation of the sequence. The outputs of the multiple attention heads are connected and then the final result is obtained by linear transformation, which involves a calculation formula MultiHead (Q, K, V) = Concat (head 1, head2,.. headh) W ^O, where the connection operator Concat merges the individual head outputs and W ^O is a trainable transformation matrix mapping the combined features to the desired output space.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model can further comprise a multi-level masking process, and the specific process comprises the steps that RNAErnie pre-training model adopts motif level masking, subsequence level masking and motif level random masking strategies, coarse-grained type RNA is combined as vocabulary marks, and RNAErnie pre-training model adds vocabulary marks to the last segment of each RNA sequence to enhance RNA sequence representation.

The RNAErnie pre-training model adopts three masking strategies to enhance the RNA sequence expression, namely, a motif masking finger masks single nucleotide to capture local characteristics, a subsequence masking is aimed at continuous fragments of the RNA sequence to capture long dependency and global characteristics, and motif random masking introduces randomness by masking the nucleotide without considering the position or continuity of the nucleotide, so that the robustness and generalization capability of the model are improved.

In addition, RNAErnie pre-training models incorporate coarse-grained types of RNA (e.g., mRNA, miRNA, lncRNA) as special lexical markers, which are appended to the last segment of each RNA sequence at the time of pre-training. This strategy enables the model to identify and exploit RNA type-specific features in processing various downstream tasks, thereby enhancing the domain adaptability of the model and its ability to generalize across multiple tasks.

The characteristic coding method is characterized by combining the traditional six characteristic coding methods, namely One-hot, ENAC, C and ND, TPCP, ksnpf characteristic coding methods. Specifically, one-hot encoding method: in bioinformatics are used to represent nucleotide sequences representing four RNA bases in an RNA molecule, adenine (A), cytosine (C), guanine (G) and uracil (U), as binary vectors consisting of 0 and 1, in particular, this means that nucleotides A, C, G and U are represented by four vectors (1, 0), (0, 1, 0), (0, 1, 0) and (0, 1), respectively; an ENAC coding method of calculating nucleic acid composition within fixed length windows to generate feature vectors for each window, capturing local structure information in the sequence to provide useful feature representation for subsequent analysis and modeling, a C2 coding method of converting elements of biological sequence to specific values from the perspective of global sequence, converting RNA bases in nucleotide sequence of RNA molecule to 2-bit binary values, for example, adenine (A) to (0, 0), cytosine (C) to (1, 1), guanine (G) to (1, 0), uracil (U) to (0, 1), an ND coding method of using 11 physicochemical properties to generate a 1375-dimensional vector (i.e., 125 trinucleotide combinations×11 physicochemical properties) for each sequence window containing TPCP, any trinucleotide comprising nucleotide N is set to a zero value and Ksnpf encodes a sequence that quantifies the occurrence of 16 nucleotide pairs separated by k arbitrary nucleotides in the sequence, the sequence being converted to various characteristic representations reflecting the frequency of these nucleotide pairs at different interval intervals by setting k to values of 0,1, 2, 3 and 4.

In this embodiment, for each RNA sequence, a multidimensional feature encoding method is adopted, global context features are obtained through RNAErnie pre-training models, and the abundant features of the RNA sequence are extracted from multiple angles such as physical and chemical properties, position specificity and the like by combining with six traditional encoding methods of One-hot, ENAC, C2, ND, TPCP and Ksnpf.

And 206, inputting the encoded high-dimensional features into a deep neural network model to perform feature dimension reduction, and obtaining the dimension reduced features.

The generated high-dimensional features are input into a deep neural network model for feature dimension reduction, and the deep neural network keeps the features with the most prediction capability through automatic learning and feature screening, so that redundant information is reduced, and the calculation complexity is reduced.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model further comprises the process of feature dimension reduction, wherein the specific process comprises the steps of inputting the encoded high-dimensional features into a deep neural network model, and mapping the high-dimensional features from a high-dimensional space to a low-dimensional space through multi-layer nonlinear mapping of the deep neural network model to obtain the dimension-reduced features.

In the model construction process in the embodiment, the mixed feature fused with RNAErnie pre-training models and six traditional feature coding methods is input into a Deep Neural Network (DNN) for feature dimension reduction, and features are mapped from a high-dimensional space to a low-dimensional space through multi-layer nonlinear mapping of the DNN, so that the computational complexity and the storage requirement are effectively reduced, and the main features which are vital to prediction are reserved. In addition, DNN can capture complex interaction relations among different features, higher-level feature expression is realized, and the feature dimension reduction method not only avoids interference of redundant information, but also improves understanding of a model on feature importance, so that prediction accuracy is remarkably improved.

And step 208, inputting the feature subjected to dimension reduction into a soft voting integration model, and integrating prediction results of different classifiers through the soft voting integration model to obtain an RNAN 4-acetylcytidine modification site prediction result.

Wherein the soft voting integration model is constructed by XGBoost, MLP, catBoost classifiers. The feature subjected to dimension reduction is transmitted into a soft voting integrated model constructed by XGBoost, MLP and CatBoost classifiers, and the prediction performance and stability of the model are improved by integrating the prediction results of different classifiers.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training models can further comprise a process of predicting through a soft voting integration model, and the specific process comprises the steps of determining the prediction probabilities of all classifiers through the soft voting integration model, carrying out weighted average calculation on all the prediction probabilities to obtain the maximum value of the weighted average probability, and taking the maximum value as an RNAN 4-acetylcytidine modification site prediction result.

Soft voting is an ensemble learning method that calculates the final prediction probability by weighted averaging the prediction probabilities of all the classifiers. The core formula of the soft voting model is as follows: In the formula (i) the formula (ii), Representing the final prediction, c represents the class, n is the number of base models, w _i is the weight of the ith model, and P _i (y=c|x) is the probability that the samples predicted by the ith model belong to class c. The final class prediction is based on the maximum value of the weighted average probability, i.e. the class with the highest prediction probability is selected as the final prediction result.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model further comprises the process of performing performance evaluation on the soft voting integration model, wherein the specific process comprises the steps of determining an evaluation index, performing performance evaluation on the soft voting integration model according to the evaluation index by using a ten-fold cross-validation mode to obtain an evaluation result, and the evaluation index comprises sensitivity, specificity, accuracy, ma Xiusi correlation coefficient and area under a curve.

The performance of the model is comprehensively evaluated by using ten-fold cross validation and an independent test set, and evaluation indexes comprise Sensitivity (SN), specificity (SP), accuracy (ACC), ma Xiusi correlation coefficient (MCC) and Area Under Curve (AUC), and the evaluation method ensures that the model has good generalization performance and robust prediction capability.

For evaluating the performance of the model, in the present embodiment, five commonly used evaluation indexes are used, including Sensitivity (SN), specificity (SP), accuracy (ACC), ma Xiusi correlation coefficient (MCC) and Area Under Curve (AUC), and the formulas of these indexes are expressed as: Wherein TP, TN, FN, FP represents the number of real, true, false, and false positive examples, respectively, SN represents the positive sample proportion of correct recognition, SP represents the negative sample proportion of correct recognition, ACC represents all sample proportions of correct classification, MCC measures the correlation between the real value and the predicted value, and the range is [ -1,1 ]. In addition, in order to comprehensively compare the performances of different models, an evaluation index needs to be calculated based on the area under a subject working characteristic (ROC) curve, the curve shows the proportion of a real example to a false example, the AUC value ranges from 0 to 1, the higher the AUC value is, the better the predicted performance of the underlying model is, and the higher the values of the five indexes are, the better the model performance is.

In one embodiment, the RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training model further comprises the steps of establishing a website to perform user interaction, specifically comprising the steps of displaying a user interaction interface, obtaining an RNA sequence to be predicted through the user interaction interface, inputting the RNA sequence to be predicted into a soft voting integrated model, outputting an RNAN 4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted, and displaying the RNA N4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted in the user interaction interface.

The method can establish a user-friendly online platform on computer equipment to provide an intuitive interface, a user can conveniently input an RNA sequence and rapidly obtain a predicted result of an ac4C modification site, and the design of the platform enables the application of the method for predicting the RNAN 4-acetylcytidine modification site based on RNAErnie pre-training model to be more convenient and popular.

In one embodiment, as shown in FIG. 3, an application model framework of an RNAN 4-acetylcytidine modification site prediction method based on RNAErnie pre-training models is provided, which mainly comprises five parts, namely A, data acquisition, B, feature coding, C, feature dimension reduction, D, model evaluation, E and website service.

Specifically, as shown in fig. 3, an RNA sequence data set is firstly collected, specifically, the data set of each RNA sequence is collected from a database and then divided into a training set and a testing set, then, a RNAErnie pre-training model and six traditional feature coding methods are used for carrying out feature coding on the RNA sequence to obtain high-dimensional features, the high-dimensional features are input into a deep neural network model for feature dimension reduction, the features after dimension reduction are input into a soft voting integration model, and the prediction results of different classifiers are integrated through the soft voting integration model to obtain the RNA N4-acetylcytidine modification site prediction results.

In one embodiment, in order to verify the superiority of the soft voting integrated model prediction ac4C, that is, voting-ac4C model provided in the present application, compared with the existing experimental results on ten-fold cross-validation, the case is:

The Voting-ac4C model was compared in depth with several typical existing models, including PACES, XG-ac4C, iRNA-ac4C, auto-ac4C and ac4C-AFL, and the comparison results are shown in the table:

From the ten-fold cross-validation results in the table, voting-ac4C is superior to all existing models in terms of several key performance indicators such as Sensitivity (SN), specificity (SP), accuracy (ACC), ma Xiusi correlation coefficient (MCC), and area under the curve (AUC). This suggests that Voting-ac4C not only significantly improves accuracy, but also balances the predictive power of different classes while maintaining high sensitivity and specificity. In addition, voting-ac4C combines a pre-trained large language model with a plurality of traditional feature coding methods, captures more comprehensive RNA sequence information, realizes breakthrough in performance, and further verifies the effectiveness and reliability of the RNA sequence as an ac4C modification site prediction tool.

Next, the experimental results of the Voting-ac4C model and the existing model on the independent dataset test were compared as follows:

The test results on the independent data set show that each performance index of the Voting-ac4C model is excellent, and the comparison results are shown in the following table:

As can be seen from the table, the Sensitivity (SN), specificity (SP), accuracy (ACC), ma Xiusi correlation coefficient (MCC) and area under the curve (AUC) of the Voting-ac4C model reached 85.14%, 81.15%, 83.15%, 66.35% and 88.73%, respectively. These results show that Voting-ac4C model has high accuracy, high sensitivity and high specificity in the task of predicting ac4C modification site, and can effectively distinguish positive and negative samples. Compared with the existing model, the Voting-ac4C has obviously superior overall performance, particularly improvement in accuracy and MCC, and shows the capability of predicting balanced performance among different categories. This further verifies the robustness and practicality of Voting-ac4C models in wide application in ac4C modification site prediction.

Next, the performance comparison of RNAErnie pre-training models in combination with other coding methods is:

The model performance of RNAErnie combined with each conventional feature encoding method was first evaluated separately and the results are shown in the following table:

It follows that the performance of the single coding method reveals the advantages and disadvantages of each, wherein Ksnpf and ENAC are superior in performance, reaching 78.07% and 75.54% Accuracy (ACC), respectively, but they still do not reach higher predictive performance.

After evaluating the individual methods, the present embodiment explores the impact of feature code combinations on model performance. After combining different coding methods step by step, the model performance is found to be significantly improved, as shown in the following table:

As shown in the above table, the Accuracy (ACC) was improved to 80.34% after combining Ksnpf and ENAC codes, which is a significant improvement over the results (78.07% and 75.54%) obtained with both codes alone. Further addition of C2 coding, ACC was raised to 80.71%, indicating that the introduction of diverse features helped capture more sequence information. When One-hot, ND and TPCP codes are added, the final combined accuracy reaches 83.15%, which is obviously higher than RNAErnie and any conventional coding method (best result 78.07%) alone. In addition, ma Xiusi correlation coefficients (MCC) increased from the highest 56.16% to 66.35% and AUC increased from 86.47% to 88.73%. These results show the remarkable advantages of feature combinations, and demonstrate that the model can better utilize multidimensional information of RNA sequences by fusing different coding methods, thereby improving the prediction performance and realizing the effect of 1+1> 2.

In this embodiment, as shown in fig. 4, fig. 4 shows that the model performance after multi-feature combination is significantly better than that of a single coding method, and the result proves that the RNAErnie pre-training model provided in the application can more comprehensively characterize the characteristics of an RNA sequence by combining six traditional feature coding hybrid strategies, thereby improving the accuracy and stability of prediction, and showing the innovativeness and advantages of the application.

Next, the soft voting integration model in the present application is compared with other machine learning classifiers as follows:

When constructing an RNA-ac4C modification site prediction model, XGBoost, catBoost and MLP are selected as base classifiers for a soft voting ensemble learning model. To further verify the effectiveness of these classifiers, performance comparisons were made for six different machine-learned classifiers, as shown in fig. 5, XGBoost, catBoost and MLP performed well over the other classifiers in terms of multiple indicators, such as Sensitivity (SN), specificity (SP), accuracy (ACC), ma Xiusi correlation coefficient (MCC), and area under the curve (AUC). Therefore, the three classifiers become ideal choices for constructing the soft voting integrated learning model, and the comprehensive performance of the prediction model is effectively improved.

The comparison situation of the soft voting integrated model and other integrated learning methods is as follows:

To further verify the stability and reliability of the model, several common ensemble learning methods were compared, including Blending (Blending), stacking (Stacking), bagging (Bagging), hard voting (Hard Voting), and soft voting (SoftVoting), as shown in fig. 6, the soft voting (SoftVoting) performed better on each evaluation index than the other methods, demonstrating significant advantages in improving the model predictive performance.

It should be understood that, although the steps in the above-described flowcharts are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, and the order of execution of the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with at least a part of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in FIG. 7, there is provided an RNA N4-acetylcytidine modification site prediction system based on RNAErnie pre-training model, comprising a data set acquisition module 710, a feature encoding module 720, a feature dimension reduction module 730, and a result prediction module 740, wherein:

a data set collection module 710 for collecting an RNA sequence data set comprising positive and negative samples;

The feature encoding module 720 is used for inputting each RNA sequence in the RNA sequence data set into the RNAErnie pre-training model for multistage masking, capturing the context dependency relationship and extracting the global feature, and performing feature encoding on each RNA sequence by combining the RNAErnie pre-training model with six traditional feature encoding methods to obtain encoded high-dimensional features;

The feature dimension reduction module 730 is configured to input the encoded high-dimension feature into the deep neural network model for feature dimension reduction, so as to obtain a feature after dimension reduction;

The result prediction module 740 is configured to input the feature after the dimension reduction into a soft voting integration model, integrate the prediction results of different classifiers through the soft voting integration model, and obtain an RNAN 4-acetylcytidine modification site prediction result, where the soft voting integration model is constructed by XGBoost, MLP, catBoost classifiers.

As shown in FIG. 8, in one embodiment, the RNA N4-acetylcytidine modification site prediction system based on RNAErnie pre-training model further comprises a model evaluation module 750 for determining an evaluation index and performing performance evaluation on the soft voting integrated model by using a ten-fold cross-validation mode according to the evaluation index to obtain an evaluation result, wherein the evaluation index comprises sensitivity, specificity, accuracy, ma Xiusi correlation coefficient and area under a curve

As shown in FIG. 8, in one embodiment, the RNA N4-acetylcytidine modification site prediction system based on RNAErnie pre-training model further comprises a user interaction module 760 for displaying a user interaction interface and obtaining an RNA sequence to be predicted through the user interaction interface, inputting the RNA sequence to be predicted into the soft voting integration model, outputting an RNA N4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted, and displaying the RNA N4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted in the user interaction interface.

In one embodiment, the data set acquisition module 710 is further configured to determine a data set division ratio, and perform hierarchical sampling processing on the RNA sequence data set based on the data set division ratio, to obtain a divided training data set and a test data set.

In one embodiment, RNAErnie the pre-training model is built on the basis of the knowledge integration enhancement representation framework and combines a transducer layer and a multi-head self-attention mechanism, the feature encoding module 730 is further configured to input each RNA sequence in the RNA sequence dataset into the RNAErnie pre-training model respectively, calculate an attention score corresponding to each RNA sequence through each attention head based on the multi-head self-attention mechanism, connect the obtained attention scores, and map each RNA sequence to a query, key and value matrix through a linear transformation matrix.

In one embodiment, feature encoding module 720 is further configured to use RNAErnie a pre-training model that uses motif masking, subsequence masking, motif random masking strategies, and incorporates coarse-grained type RNA as lexical markers, and RNAErnie a pre-training model that appends lexical markers to the last segment of each RNA sequence to enhance the RNA sequence representation.

In one embodiment, the feature dimension reduction module 730 is further configured to input the encoded high-dimensional feature into a deep neural network model, and map the high-dimensional feature from the high-dimensional space to the low-dimensional space through a multi-layer nonlinear mapping of the deep neural network model, to obtain the feature after dimension reduction.

In one embodiment, the result prediction module 740 is further configured to determine the prediction probabilities of all classifiers through a soft voting integration model, perform weighted average calculation on each prediction probability to obtain a maximum value of the weighted average probabilities, and use the maximum value as the RNAN 4-acetylcytidine modification site prediction result.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by the processor is configured to implement a method for predicting RNAN 4-acetylcytidine modification sites based on RNAErnie pre-trained models. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor, when executing the computer program, implementing the steps of a method for predicting an RNAN 4-acetylcytidine modification site based on a RNAErnie pre-trained model.

In one embodiment, a computer readable storage medium is provided having stored thereon a computer program which when executed by a processor performs the steps of a method for predicting an RNAN 4-acetylcytidine modification site based on a RNAErnie pre-training model.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. A method for predicting an RNA N4-acetylcytidine modification site based on RNAErnie pre-training model, the method comprising:

Inputting each RNA sequence in the RNA sequence data set into RNAErnie pre-training models respectively for multistage masking, capturing context dependency and extracting global features, and carrying out feature coding on each RNA sequence by combining the RNAErnie pre-training models with six traditional feature coding methods to obtain coded high-dimensional features, wherein the six feature coding methods are One-hot, ENAC, C and ND, TPCP, ksnpf feature coding methods respectively;

Inputting the feature after dimension reduction into a soft voting integration model, and integrating prediction results of different classifiers through the soft voting integration model to obtain an RNA N4-acetylcytidine modification site prediction result;

2. The method for predicting RNA N4-acetylcytidine modification sites based on RNAErnie pre-training models according to claim 1, wherein the method further comprises:

3. The method for predicting RNA N4-acetylcytidine modification sites based on RNAErnie pre-training models according to claim 1, wherein the method further comprises:

Inputting the RNA sequence to be predicted into the soft voting integrated model, and outputting an RNA N4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted;

and displaying the RNA N4-acetylcytidine modification site prediction result corresponding to the RNA sequence to be predicted in the user interaction interface.

4. The method for predicting RNA N4-acetylcytidine modification sites based on RNAErnie pre-training models according to claim 1, wherein after collecting the RNA sequence dataset, the method further comprises:

determining a data set dividing ratio;

5. The method for predicting an RNA N4-acetylcytidine modification site based on RNAErnie pre-training models according to claim 1, wherein the RNAErnie pre-training models are based on a framework of representation enhanced by knowledge integration and combine a transducer layer and a multi-headed self-care mechanism, wherein:

6. The method for predicting RNA N4-acetylcytidine modification sites based on RNAErnie pre-training models according to claim 1, wherein each RNA sequence in the RNA sequence dataset is input into RNAErnie pre-training models for multistage masking, respectively, comprising:

7. The method for predicting the RNA N4-acetylcytidine modification site based on RNAErnie pre-training models according to claim 1, wherein the encoded high-dimensional features are input into a deep neural network model for feature dimension reduction, and the feature after dimension reduction is obtained, comprising:

8. The method for predicting the RNA N4-acetylcytidine modification site based on RNAErnie pre-training models according to claim 1, wherein the method for integrating the prediction results of different classifiers through the soft voting integration model to obtain the RNA N4-acetylcytidine modification site prediction results comprises the following steps:

The maximum value is taken as a predicted result of the RNA N4-acetylcytidine modification site.

9. An RNA N4-acetylcytidine modification site prediction system based on RNAErnie pre-training model, the system comprising:

The feature coding module is used for respectively inputting each RNA sequence in the RNA sequence data set into a RNAErnie pre-training model for multi-level masking, capturing a context dependency relationship and extracting global features, and carrying out feature coding on each RNA sequence by combining the RNAErnie pre-training model with six traditional feature coding methods to obtain coded high-dimensional features, wherein the six feature coding methods are One-hot, ENAC, C and ND, TPCP, ksnpf feature coding methods respectively;

The feature dimension reduction module is used for inputting the high-dimension features into the deep neural network model to perform feature dimension reduction to obtain dimension reduced features;

The result prediction module is used for inputting the feature subjected to dimension reduction into a soft voting integration model, integrating prediction results of different classifiers through the soft voting integration model, and obtaining an RNA N4-acetylcytidine modification site prediction result;

10. The system for predicting the RNA N4-acetylcytidine modification site based on RNAErnie pre-training models according to claim 9, further comprising a model evaluation module for determining an evaluation index and performing performance evaluation on the soft voting integration model according to the evaluation index by using a ten-fold cross-validation mode to obtain an evaluation result, wherein the evaluation index comprises sensitivity, specificity, accuracy, ma Xiusi correlation coefficient and area under a curve.