Ijraset Journal For Research in Applied Science and Engineering Technology
Authors: Jyothi G C, Kruthika B M, Sushma K M, Apoorva B R, Vijayalaxmi
DOI Link: https://doi.org/10.22214/ijraset.2022.45866
Certificate: View Certificate
The mRNA molecules expressed in cow’s milk are important molecular biomarkers for different physiological and pathological conditions in cattle. The prediction of the quantity that a specific mRNA type could be expressed in cow’s milk is a challenging theoretical task. The current study presents for the first time several different Machine Learning models to predict the mRNA expression using the mRNA secondary structure fragments.
I. INTRODUCTION
The mRNA expression in cow’s milk is an important biomarker for the cattle conditions. The current study proposes a method to predict the low or high expression levels of mRNA using mRNA secondary structure fragments and Machine Learning classifiers. Essentially, the terms “classifier” and “model” are synonymous in certain contexts; however, sometimes people refer to “classifier” as the learning algorithm that learns the model from the training data. Model: In machine learning field, the terms hypothesis and model are often used interchangeably. In other sciences, they can have different meanings, i.e., the hypothesis would be the “educated guess” by the scientist, and the model would be the manifestation of this guess that can be used to test the hypothesis. Classifier: A classifier is a special case of a hypothesis (nowadays, often learned by a machine learning algorithm). A classifier is a hypothesis or discrete-valued function that is used to assign (categorical) class labels to particular data points. In the email classification example, this classifier could be a hypothesis for labeling emails as spam or non-spam. However, a hypothesis must not necessarily be synonymous to a classifier.
II. PROBLEM STATEMENT
The risk of infectious state of the DNA might lead to less accuracy in prediction the quality of the milk. Efficiency is less with the usage of different therapeutic tools for predicting. The cost for predicting, using different protein is more compared to using mRNA tool.
III. PROPOSED SYSTEM
The proposed system is to predict the secondary structure of mRNA in cow’s milk using final dataset of 30 selected features which becomes the input of the machine learning technique called as “Recurrent neural network” in Google Colab.
IV. OBJECTIVES
V. LITERATURE SURVEY
COVID-19 mRNA Vaccine Degradation Prediction Using LR and LGBM Algorithms Soon Hwai Ing, Azian Azamimi Abdullah, Nor Hazlyna Harun and Shigehiko Kanaya proposed a paper of COVID-19 mRNA Vaccine Degradation Prediction Using LR and LGBM Algorithms The threatening Coronavirus which was assigned as the global pandemic concussed not only the public health but society, economy and every walk of life. Some measurements are taken to stifle the spread and one of the best ways is to carry out some precautions to prevent the contagion of SARS-cov-2 virus to uninfected populaces.
Injecting prevention vaccines is one of the precaution steps under the grandiose blueprint. Among all vaccines, it is found that mRNA vaccine which shows no side effect with marvellous effectiveness is the most preferable candidates to be considered. However, degradation had become its biggest drawback to be implemented. Hereby, this study is held with desideratum to develop prediction models specifically to predict the degradation rate of mRNA vaccine for COVID-19. Two machine learning algorithms, which are, Linear Regression (LR) and Light Gradient Boosting Machine (LGBM) are proposed for models development using Python language. Dataset comprises of thousands of RNA molecules that holds degradation rates at each position from Eterna platform is extracted, pre-processed and encoded with label encoding before loaded into algorithms. The results show that LGBM (0.2447) performs better than LR (0.3957) for this study when evaluate with the RMSE metric.
Prediction of mRNA expression in cow’s milk using mRNA secondary structures and Machine Learning classifiers Rodrigo Martín, Yong Liu, Omar Landaeta , Luis Felipe Llamas , Chuanshe Zhou, Zhiliang Tan, Haibo Zhang , Cristian R Munteanu, proposed a paper of Prediction of mRNA expression in cow’s milk using mRNA secondary structures and Machine Learning classifiers the prediction of the quantity that a specific mRNA type could be expressed in cow’s milk is a challenging theoretical task. The current study presents for the first time several different Machine Learning models to predict the mRNA expression using the mRNA secondary structure fragments. This unique methodology is based on a dataset of experimental mRNA expression data. Thus, the best classification model was obtained with bayes net method and is based on 24features and 4067 cases. The model has the true positive rate for the low mRNA expression class of 0.78 (average true positive rate of 0.66). Further studies are needed improve the current results, using datasets with different feature sets and more advanced Machine Learning methods.
Prediction of mRNA subcellular localization using deep recurrent neural networks: Zichao Yan1, Eric Le´ cuyer and Mathieu Blanchette, Prediction of m subcellular localization using deep recurrent neural networks: Messenger RNA proposed a paper of subcellular localization mechanisms play a crucial role in posttranscriptional gene regulation. This trafficking is mediated by trans-acting RNA-binding proteins interacting with cis[1]regulatory elements called zip codes. While new sequencing-based technologies allow the high[1]through put identification of RNA localized to specific subcellular compartments, the precise mechanisms at play, and their dependency on specific sequence elements, remain poorly understood introduce RNA tracker, a novel deep neural network built to predict, from their sequence alone, the distributions of mRNA transcripts over a predefined set of subcellular compartments. RNA tracker integrates several states of the art deep learning techniques (e.g., CNN, LSTM and attention layers) and can make use of both sequence and secondary structure information. We report on a variety of evaluations showing RNA tracker’s strong predictive power, which is significantly superior to a variety of baseline predictors. Despite its complexity, several aspects of the model can be isolated to yield valuable, testable mechanistic hypotheses, and to locate candidate zip code sequences within transcripts.
VI. SYSTEM DESIGN
System design thought as the application of theory of the systems for the development of the project. System design defines the architecture, data flow, use case, class, sequence and activity diagrams of the project development. This architecture diagram illustrates how the system is built and is the basic construction of the software method. Creations of such structures and documentation of these structures is the main responsible of software architecture.
A. Collection of Dataset
Dataset was collected by Kaggle website. The sample of mRNA secondary structure sequences are undergone high throughput Screening process. It is having features like reactivity, deg-pH - mg, deg- mg, 50c and id of the mRNA secondary structure.Submission.csv is the original dataset. From this data file we split the dataset into train and test dataset. Bpps folder is having the mRNA structure Id in .npy file format
a. deg Mg pH10 - (1×68 vector in Train and Public Test, 1x91 in Private Test) An array of floating-point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium in high pH (pH 10).
b. deg 50C - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating-point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating without magnesium at high temperature (50 degree Celsius).
c. deg Mg_50C - (1x68 vector in Train and Public Test, 1x91 in Private Test) An array of floating-point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likelihood of degradation at the base/linkage after incubating with magnesium at high temperature (50 degrees Celsius).
d. error - An array of floating-point numbers, should have the same length as the corresponding reactivity or deg_* columns, calculated errors in experimental values obtained in reactivity and deg_* columns.
e. predicted_loop_type - (1×107 string) Describes the structural context (also referred to as 'loop type') of each character in sequence. Loop types assigned by bp RNA from Vienna RNA fold 2 structure. From the bp RNA documentation: S: paired "Stem" M: Multiloop I: Internal loop B: Bulge H: Hairpin loop E: dangling End X: external loop vi.
f. id -An arbitrary identifier for each sample.
g. seq_scored - (68 in Train and Public Test, 91 in Private Test) Integer value denoting the number of positions used in scoring with predicted values. This should match the length of reactivity, deg_ and error* columns. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.
h. seq_length - (107 in Train and Public Test, 130 in Private Test) Integer values, denotes the length of sequence. Note that molecules used for the Private Test will be longer than those in the Train and Public Test data, so the size of this vector will be different.
i. sequence - (1 * 107 string in Train and Public Test, 130 in Private Test) Describes the RNA sequence, a combination of A, G, U, and C for each sample. Should be 107 characters long, and the first 68 bases should correspond to the 68 positions specified in seq_scored (note: indexed starting at 0).
j. structure - (1 * 107 string in Train and Public Test, 130 in Private Test) An array of (,) and characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses e.g. (....) means that base 0 is paired to base 5, and bases 1-4 are unpaired.
k. reactivity - (1 * 68 vector in Train and Public Test, 1 * 91 in Private Test) An array of floating-point numbers, should have the same length as seq_scored. These numbers are reactivity values for the first 68 bases as denoted in sequence, and used to determine the likely secondary structure of the RNA sample.
l. deg pH10 (1 * 68 vector in Train and Public Test, 1 * 91 in Private Test) An array of floating-point numbers, should have the same length as seq_scored.
B. Data Preprocessing
C. GRU
D. LSTM
VII. DATA FLOW DIAGRAM
A. Collecting of mRNA Dataset:
Dataset was collected by Kaggle website. The sample of mRNA secondary structure sequences are undergone high throughput Screening process. It is having features like reactivity, deg-pH - mg, deg- mg, 50c and id of the mRNA secondary structure.Submission.csv is the original dataset. From this data file we split the dataset into train and test dataset. Bpps folder is having the mRNA structure Id in .npy file format.
B. Data Preprocessing
After getting the data we start with data pre-processing. Data preprocessing includes:
C. Training LSTM/GRU Model
VIII. IMPLEMENTATION
Algorithms used are LSTM and GRU
A. LSTM
An LSTM has a similar control flow as a recurrent neural network. It processes data passing on information as it propagates forward. The differences are the operations within the LSTM’s cells.
decide how much past information to forget.
B. Algorithm Applied
a. Data augmentation.
b. SN filter noise remove.
c. Tokenization.
5. Step 5: Apply dataset to LSTM and GRU model.
6. Step 6: Model build.
7. Step 7: Prediction.
IX. EXPERIMENTAL RESULTS
A. Outputs
After the GRU and LSTM model development, predict the mRNA secondary structure sequence and the display the output. Both LSTM and GRU model are used to predict the sequence and the accuracy’s 86%.
B. Results Snapshots
???????
This project is for the predictions of mRNA secondary structure using the mRNA secondary structure dataset. The prediction of mRNA structure secondary sequence information done by using GRU and LSTM module where these modules are undergone training the dataset and predict the correct mRNA secondary structure sequences for testing data. Both the system predicted same accuracy and can prefer one of the models
[1] RE1.Murrieta, C.M.; Hess, B.W.; Scholljegerdes, E.J.; Engle, T.E.; Hossner, K.L.; Moss,G.E.;Rule, D.C. Evaluation of milk somatic cells as a source of mRNA for study of lipogenesisin the mammary gland of lactating beef cows supplemented with dietary high-linoleate safflower seeds.J. Anim. Sci. 2006, 84,2399-2405. [2] Ma, J.L.; Zhu, Y.H.; Zhang, L.; Zhuge, Z.Y.; Liu, P.Q.; Yan, X.D.; Gao, H.S.; Wang, J.F.Serum concentration and mRNA expression in milk somatic cells of toll- like receptor 2, toll-like receptor 4, and cytokines in dairy cows following intramammary inoculation withescherichia coli.J. Dairy Sci. 2011, 94,5903-5912. [3] Witten, I.; Frank, E. Data mining: Practical machine learning tools and techniques, second edition (morgan kaufmann series in data management systems). Morgan Kaufmann: 2005. [4] Smith, T.C.; Frank, E. Introducing machine learning concepts with weka. In Statistical genomics: Methods and protocols, Springer: New York, NY, 2016; pp 353-378
Copyright © 2022 Jyothi G C, Kruthika B M, Sushma K M, Apoorva B R, Vijayalaxmi . This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Paper Id : IJRASET45866
Publish Date : 2022-07-21
ISSN : 2321-9653
Publisher Name : IJRASET
DOI Link : Click Here