Rapid Reads News

HOMEmiscentertainmentcorporateresearchwellnessathletics

mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset - Nature Communications


mRNABERT: advancing mRNA sequence design with a universal language model and comprehensive dataset - Nature Communications

In light of these observations, we developed mRNABERT, a robust language model pre-trained on a diverse and high-quality dataset of over 18 million non-redundant mRNA sequences (curated by the authors, with further details provided in the Methods section). To overcome the limitations of previous models, we incorporated several advanced techniques. Built on the well-established BERT architecture25, mRNABERT replaced traditional positional embeddings with Attention with Linear Biases (ALiBi)48 to handle long input sequences and integrated Flash Attention49 to improve computational efficiency. Furthermore, mRNABERT featured an innovative dual tokenization strategy, treating individual nucleotides as tokens for UTRs and codons for coding sequences (CDS). This unique tokenization approach not only aligns with the biological characteristics of mRNA but also lays a strong foundation for a wide range of downstream tasks (Fig. 1A). Additionally, we introduced a customized contrastive learning scheme to align mRNA and protein sequences in latent space (Fig. 1B), allowing mRNABERT to improve predictions of protein functions and mRNA-protein interactions. By effectively capturing the complex relationships between genetic sequences and protein sequences, this method enhances our understanding of biological processes and broadens the model's range of applications.

To fully leverage the vast integrated dataset of full-length mRNA sequences, mRNABERT introduces a novel dual tokenization scheme for encoding the entire sequences. Tokenization, a critical initial step in language modeling, determines the types of semantic information the model can capture. RNA sequences are composed of four nucleotide bases, and traditional language models typically used character-based tokenizers that encode each nucleotide as an independent token to learn attention weights that may help a model understand nucleotide interactions within a sequence. However, when encoding full-length mRNA sequences, the maximum length constraints on the token input compromise the model's representation capacity. Given the nature of mRNA codons, models based on CDS often treat each codon as an independent token, resulting in a complete loss of individual nucleotide information. Consequently, these models are only suitable for handling specific sequences. To address this, we employ an appropriate method to segment each region of the mRNA, refining local features at multiple granularities and integrating global features of the entire sequence. This approach yields a comprehensive embedding that is suitable for a wide range of downstream tasks.

Regarding the model architecture, mRNABERT is built on 12 bidirectional encoder blocks rooted in Transformers. To overcome the input length limitations of existing models, we introduced ALiBi, an alternative method for encoding positional information. By directly incorporating linear biases into the attention scores, ALiBi enhances the model's ability to handle long sequences and improves overall performance. Additionally, we used IO-aware Flash attention to implement precise standard attention calculations in a more time- and memory-efficient way, thereby accelerating mRNABERT's training process.

Considering the tight functional interplay between mRNA and protein sequences, we augmented our approach by incorporating contrastive learning to align codon and amino acid sequences following the masked language model (MLM) learning phase, a staple of BERT training. This step aimed to enrich the model's understanding of the intricate biological landscape. Notably, mRNABERT's performance exhibited marked improvement subsequent to the implementation of contrastive learning. Further details on the model architecture and training process are provided in the Methods section.

We conducted a comparative analysis of mRNABERT against the leading models in various tasks. For the eight 5'UTR ribosome load prediction tasks, the neural network baselines include Optimus, FramePool, and MTtrans, as well as the state-of-the-art UTR-LM. In the six CDS-related prediction tasks, mRNABERT was evaluated against methods such as Codon2vec, TextCNN, and the pre-trained CDS model, CodonBERT. For the 3'UTR tasks, we focused on predicting 22 RBP binding sites using three baseline methods, including CNN-based iDeepE and DeepCLIP, and the language model BERT-RBP. In the mA site prediction task, we collected data from nine different cell lines and compared mRNABERT against machine learning methods such as SRAMP and WHISTLE. Both tasks also include the specially designed 3UTRBERT. Furthermore, we evaluate the performance of all ncRNA models mentioned above on these tasks. Protein-related tasks include melting point and solubility prediction, as well as transcript abundance prediction across seven species. Here, mRNABERT was benchmarked against high-performing protein models, including ESM2, ProtTrans, Ankh, and the pre-trained codon language model CaLM. Finally, we benchmarked all available RNA-related pre-trained models for eight full-length mRNA property prediction tasks. This comprehensive comparison demonstrates mRNABERT's exceptional performance across all tasks.

To illustrate that mRNABERT can learn more biological knowledge from sequences than most baseline models, we performed an analysis of its embeddings, characterizing how it extracts functional and evolutionary knowledge from biological sequences.

The first aspect we investigated was the model's vocabulary representation capabilities, focusing on its ability to discern the fundamental biological principles of the genetic code. Ideally, an mRNA model should identify similarities among synonymous codons, but this task is challenging due to the unannotated pre-training data and the model's representation of biological sequences as tokens, which lack explicit information about nucleotides or codons. Additionally, we conducted ablation experiments to validate the effectiveness of contrastive learning. As illustrated in Fig. 2A, B, the mRNABERT model without contrastive learning exhibited disorganized clustering at the amino acid level. However, through the same t-SNE dimensionality reduction to mRNABERT's vocabulary embeddings and projecting them onto a two-dimensional space, we observed that synonymous codons corresponding to the same amino acid type tended to cluster together (Fig. 2C). This clustering suggests that the model has successfully learned the genetic code from the extensive data it was trained on. Furthermore, by utilizing color to differentiate amino acids based on their distinct chemical properties (Fig. 2D), we found that the model effectively groups amino acids with similar properties, with ARI increasing from 0.166 to 0.498 and FMI from 0.325 to 0.596 (Methods). This clearly indicates that contrastive learning enables the model to capture additional semantic information about amino acids.

Next, we evaluated mRNABERT's ability to classify various types of RNA data. As depicted in Fig. 2E, mRNABERT successfully discriminated between distinct mRNA regions, including the 5'UTR and 3'UTR sequences. Impressively, it also showed an ability to differentiate between long non-coding RNA (lncRNA) and mRNA sequences, despite not being explicitly trained on ncRNA data. This highlights mRNABERT's capacity to encapsulate sufficient biological characteristics, enabling it not only to differentiate among various mRNA regions but also to distinguish mRNA from other RNA sequence types. By extracting profound semantic information from the entire mRNA sequence, it identifies sequence similarities that extend beyond mere length.

Subsequently, our analysis concentrated on the embeddings of sequences derived from six different species, carefully chosen to represent a broad range of biological classifications across diverse holdout datasets. The scatter plot depicted in Fig. 2F reveals a clear clustering of homologous sequences, with clear-cut boundaries delineating different species. This result highlights mRNABERT's ability to recognize and retain evolutionary information embedded within biological sequences, emphasizing its robust capability to capture biological details at the sequence level.

Controlling translation efficiency hinges on the critical role of 5'UTR sequence. Ribosome load, defined as the number of ribosomes bound to an mRNA molecule at any given moment, stands as a pivotal marker of protein synthesis efficiency. Therefore, accurately predicting ribosome load from 5'UTR sequences is paramount for optimizing mRNA sequence design to maximize protein expression, particularly when forging new sequences beyond existing 5'UTR templates.

To address this challenge, we leveraged a benchmark dataset sourced from previous studies that used massively parallel reporter assays (MPRA) to curate a library of 280,000 gene sequences with their respective ribosome loads. Our approach involved fine-tuning the mRNABERT model to predict ribosome load from 5'UTR sequences (detailed in the "Method" section). Alongside mRNABERT, we benchmarked several machine-learning models tailored for this task, including Optimus, FramePool, and MTtrans, as well as pre-trained language models such as UTR-LM, RNABERT, and RNA-FM. The performance of mRNABERT was evaluated by comparing it against benchmark methods across eight synthetic libraries.

The results from our study, depicted in Fig. 3 and Supplementary Tables 4 and 5, highlight the exceptional performance of mRNABERT, which was comparable to the top-performing specialized model, UTR-LM. Notably, in the largest MPRA datasets (fixed-length random UTRs denoted as U and U), our model achieved state-of-the-art results (Spearman R = 0.962 and 0.924). Across the remaining six datasets, our model led in three tasks (Ψ, m1Ψ, and mC-U), effectively matching UTR-LM in the number of tasks with top performance (both achieving the best results in 4 of 8).

We collected multiple datasets to evaluate the performance of our model on CDS prediction tasks. These datasets include the mRFP Expression, Fungal Expression and Escherichia coli Proteins datasets, comprising thousands of data points on protein expression in fungi and E. coli; the mRNA Stability and SARS-CoV-2 Vaccine Degradation datasets, containing mRNA stability data; and the Tc-Riboswitches dataset, highlighting tetracycline riboswitch dimer sequences. These datasets cover various downstream tasks related to mRNA translation, stability, and regulation, incorporating data ranging from newly published recombinant proteins to bio-computation for SARS-CoV-2 vaccine design (Supplementary Table 6 contains detailed information about the datasets).

After fine-tuning mRNABERT on these datasets, we compared its performance with several state-of-the-art CDS prediction methods, including TF-IDF, TextCNN, Codon2vec, RNABERT, RNA-FM, and CodonBERT. Our results indicate that mRNABERT outperformed or matched all other methods in all 6 CDS-related prediction tasks, demonstrating exceptional performance in the SARS-CoV-2 Vaccine Degradation dataset (Table 1).

Furthermore, our analysis revealed that codon-based models such as CodonBERT excel in protein expression tasks but exhibit subpar performance in stability-related tasks. This discrepancy may be attributed to the pivotal role codons play in protein expression, whereas mRNA stability is closely tied to its secondary structure. Notably, the performance of codon-based models declined in datasets where the local and global secondary structure patterns of RNA sequences are crucial, such as the SARS-CoV-2 vaccine degradation and Tc-riboswitch datasets. In contrast, mRNABERT effectively integrates nucleotide and codon information, encoding the structurally relevant 5'UTR and 3'UTR regions. Consequently, it demonstrates superior performance in tasks where CodonBERT struggles, as it can learn co-evolutionary and structural characteristics from millions of mRNA sequences. This capability aids in designing highly expressive and stable mRNA sequences.

RNA-binding proteins (RBPs) specifically bind to RNA molecules, and this binding depends on both RNA sequences and spatial structure characteristics. We downloaded and processed protein-RNA crosslinking sites for 22 RBPs and fine-tuned the mRNABERT to predict RBP binding sites using these experimentally determined data. When evaluating the predictive performance of our model, we benchmarked it against several computational methods, including neural network models iDeepE, DeepCLIP, RPI-Net, GraphProt2, BERT-RBP, all pre-trained RNA models such as RNABERT and RNAFM, and the previously best model designed for 3'UTR tasks, 3UTRBERT.

To assess the effectiveness of each model, we employed five-fold cross-validation and evaluated predictions using three metrics: accuracy (ACC), F1-score, and Matthews correlation coefficient (MCC) (The definition of the evaluation metrics is in Supplementary Table 7). Across all 22 RBPs, mRNABERT demonstrated superior performance with an average ACC of 0.786, F1-score of 0.751, and MCC of 0.501, comparable to the best specialized 3UTRBERT model with an average ACC of 0.785, F1-score of 0.751, and MCC of 0.503. Remarkably, mRNABERT outperformed other methods for 13 out of the 22 RBPs, exceeding 3UTRBERT's performance for 9 RBPs. Except for 3UTRBERT, mRNABERT significantly outperformed all other models. The next best performance was achieved by iDeepE, with an ACC of 0.758, an F1 score of 0.565, and an MCC of 0.413, which were on average 20% lower than those of mRNABERT (Fig. 4A and Supplementary Table 8). It is worth noting that BERT-RBP lagged due to the lack of pre-training, while other deep learning methods underperformed due to insufficient model capacity. These comparative results suggest that mRNABERT is a highly effective method for accurately identifying RBP binding sites in the 3'UTR.

N6-methyladenosine (mA) is the most common covalent modification in cells, involved in numerous critical developmental processes and human diseases. We downloaded real mA modification sites from the mA-Atlas database and enhanced the prediction capabilities of mRNABERT for potential mA modification sites by fine-tuning (refer to Methods for detailed information).

We conducted a comparative analysis of mRNABERT's predictive performance with various models found in the literature, such as the most effective model 3UTRBERT, as well as different machine learning-based methods (SRAMP, WHISTLE, iMRM) and deep learning-based methods (DeepM6ASeq). The results displayed in Fig. 4B and Supplementary Table 9 indicated that mRNABERT achieved the second-best performance consistently across all nine cell lines, closely trailing the leading 3UTRBERT model while surpassing all other models. These findings demonstrate that mRNABERT possesses the ability to capture and utilize structural and functional information from the 3'UTR, exhibiting comparable performance to models extensively pre-trained exclusively on 3'UTR data.

RNA splicing is a fundamental regulatory mechanism in eukaryotic gene expression, orchestrating the precise removal of non-coding intronic sequences from precursor mRNAs (pre-mRNAs) and the ligation of coding exons to generate mature transcripts. This process critically depends on the accurate recognition of splice sites that demarcate exon-intron boundaries. At the 5'end of introns, donor sites initiate splicing, while acceptor sites at the 3'termini facilitate exon ligation.

Accurate identification of these splice sites constitutes a critical prerequisite for determining gene architecture and transcriptional isoforms. Computational approaches to this challenge are frequently framed as sequence-based binary classification tasks, where algorithmic models discriminate authentic splice signals from decoy sequences within pre-mRNA molecules. To this end, we utilized a widely adopted dataset of positive and negative splice site sequences, which includes donor and acceptor site data from four distinct species. We fine-tuned the models using the same dataset and testing protocol to evaluate all RNA baseline models. mRNABERT exhibited the second-highest overall performance, outperformed solely by ERNIE-RNA and surpassing both RiNALMo and UNI-RNA (Supplementary Table 10).

Alternative polyadenylation (APA) is a widespread post-transcriptional regulatory mechanism that diversifies transcriptomes through selective 3'UTR processing, thereby generating mRNA isoforms with distinct stability, localization, and protein-coding potential. This dynamic process fine-tunes gene expression networks and is indispensable for cellular differentiation, stress responses, and developmental patterning.

To systematically quantify APA dynamics, we integrated isoform-level predictions derived from the BEACON dataset into our analytical framework. Our approach specifically models the relative usage of proximal versus distal polyadenylation sites (PAS) within annotated 3'UTR regions, enabling precise resolution of APA-mediated regulatory outcomes. In this task, mRNABERT exhibited a significant advantage over all other RNA baseline models (Supplementary Table 11).

mRNABERT's superior performance in these specific tasks provides compelling evidence for its profound understanding of post-transcriptional mRNA modifications, thereby significantly expanding its analytical capabilities within the broader landscape of mRNA research.

We evaluated the performance of mRNABERT on protein-related tasks, noting that codon pLM models have previously shown superior results in certain amino acid sequence annotation tasks. We assessed the performance of mRNABERT on several protein-related tasks, specifically predicting protein melting points and solubility. Additionally, we gathered and compiled transcriptome abundance data from seven organisms to evaluate the model's effectiveness in key codon usage tasks. All datasets were mapped back to original codon sequences, with further details provided in the Methods.

We fed amino acid sequences into advanced pLMs such as ESM2, ProtTrans, and the Ankh series, and mapped codon sequences into mRNA models, including CaLM and mRNABERT. We also tested mRNABERT without contrastive learning to better understand the impact of amino acid semantic integration. The resulting embeddings from these models were then utilized in the downstream task model, and performance was evaluated through the use of five-fold cross-validation (refer to Methods for detailed information).

Figure 5 and the supplementary file demonstrate that mRNABERT with contrastive learning exhibited significant improvement across all tasks. In melting point prediction (Fig. 5A), after contrastive learning, mRNABERT increased its R from 0.60 to 0.77, slightly below CaLM's 0.78 but surpassing all other large-scale protein models (best ProtT5-XL with R = 0.73). In solubility prediction (Fig. 5B), mRNABERT achieved an R of 0.63, surpassing both its performance before contrastive learning and CaLM's 0.61. Furthermore, mRNABERT's performance in this task is comparable to most protein models, falling just behind larger-scale models like ProtT5-XL and Ankh-large (R = 0.66).

Additionally, in the task of transcript abundance prediction across seven species, mRNABERT outperformed CaLM in five species(except E. coli and Haloferax hvolcanii). Remarkably, across multiple species, mRNABERT outperformed all other models. For instance, in predictions for Homo sapiens, mRNABERT achieved an R of 0.38, significantly higher than CaLM's 0.35 and surpassing all other protein models (highest R = 0.36). In the predictions of Pichia pastoris and Saccharomyces cerevisiae, mRNABERT is also substantially superior to any other models, achieving respective best R of 0.56 and 0.53, while the highest performance among protein models is 0.53 and 0.52 (Fig. 5C).

The success of the CaLM model underscores the potential of codon-based pre-training to enhance the quality of protein models. Furthermore, our mRNA model exhibited superior performance in certain protein-related tasks. Based on these results and ablation studies, the integration of amino acid information with encoding sequences emerges as a cost-effective approach to enhancing overall model performance substantially. This outcome highlights the potential of leveraging extensive biological data to enhance machine learning capabilities, thereby addressing model limitations and broadening its applicability.

Redesigning complete mRNA sequences to maximize their stability and expression can significantly improve the overall performance of therapeutic mRNA. However, designing such sequences faces challenges due to a limited understanding of how mRNA sequences and structures affect their expression and stability in solution and cells. Therefore, accurately predicting the structural and functional properties of complete mRNA will aid in understanding mRNA design rules, greatly advancing mRNA vaccine development.

Rapidly synthesizing large quantities of full-length mRNA with different UTRs and CDS is challenging, making direct comparisons of their stability and expression capabilities through high-throughput experimental approaches impossible. To address this issue, we compiled a dataset of hundreds of reporter gene constructs that encompass a wide range of UTR and CDS mRNA sequences, resulting in 233 usable mRNA sequences, with 112 distinct 5' and/or 3'UTRs and 121 CDSs. The dataset included labels for four cell interior translation efficiencies and two stability-related properties that directly impact protein expression levels. To further explore the potential of mRNA models, we fine-tuned mRNABERT using the collected data and evaluated its performance in real-world mRNA tasks. Additionally, we assessed all currently available RNA baseline models, such as UTR-LM for 5'UTRs, codon-related CaLM and mRNA-FM, 3UTRBERT for 3'UTRs, and various RNA pre-trained models including RNABERT, RNA-FM, RNA-MSM, ERNIE-RNA, RNAErnie and RiNALMo.

The results in Fig. 6 and Supplementary Table 12 demonstrated that mRNABERT significantly outperformed other models across all tasks. Models pre-trained on ncRNA data struggled to generalize to full-length mRNA, and models excelling in specific mRNA region tasks performed poorly on complete mRNA tasks. This discrepancy likely arises because previous models used nucleotide-based tokenizers constrained by maximum input lengths, causing truncation and information loss for full-length mRNAs. Codon-based tokenizers often misinterpret non-triplet region segments, leading to information confusion. Our model adopted a dual-tokenizer approach for UTR and CDS regions and incorporates an advanced technique to enable a BERT architecture to extend input sequence lengths and improve practical application. Additionally, previous mRNA models were trained and evaluated on specific fragments, thereby limiting their efficacy on full-length mRNA tasks. In contrast, ncRNA models primarily focus on RNA structure prediction, which is often challenging to surpass mRNA-trained models.

To rigorously evaluate mRNABERT's predictive capabilities for ultra-long mRNA sequences, we conducted additional benchmark tasks focused on predicting the translation efficiency of full-length mRNA in mammalian cells. The analysis leveraged a comprehensive dataset derived from thousands of ribosome profiling experiments paired with matched RNA-seq data across >140 human and mouse cell types. Notably, the human dataset (mean length: 4040 nt) contained 94.9% of sequences exceeding 1024 nt, with 82.2% surpassing 1022 tokens post-encoding. The mouse dataset (mean length: 3645 nt) demonstrated comparable proportions (94.6% and 80.8%, respectively). These sequence lengths substantially surpass the maximum input capacities of existing RNA models (typically limited to 1024 nt) and our training dataset (Supplementary Table 13). However, it is crucial to emphasize that the application of ALiBi enables mRNABERT to handle sequences longer than 1022 tokens. Sequences exceeding the model's max_length parameter were systematically truncated to ensure computational feasibility.

To assess the generalization capabilities of mRNABERT on sequences exceeding the training length, we evaluated mRNABERT with maximum sequence lengths of 1022, 2044, and 3066 tokens. Our results reveal that mRNABERT consistently outperformed all existing RNA models, achieving a mean R² value of 0.66 across cell types (Table 2). This represents a substantial performance enhancement, ranging from 1.6 to 10.4-fold improvement over previous RNA models, which attained a maximum R² of 0.42 (range: 0.06-0.42). Moreover, the observed performance gains with increasing input length suggest that mRNABERT exhibits robustness and applicability to longer sequences. This finding underscores the benefits of our model design: dual tokenization facilitates the capture of comprehensive mRNA information, while the ALiBi mechanism enables generalization to extended sequence lengths. mRNABERT demonstrates applicability and a clear advantage in predicting the properties of longer mRNA sequences.

Overall, the success of mRNABERT in these challenging tasks fully illustrates the efficacy of our model design strategy and its tremendous potential in real-world application scenarios.

Previous articleNext article

POPULAR CATEGORY

misc

6177

entertainment

7084

corporate

5841

research

3664

wellness

5876

athletics

7120