Title: | Medulloblastoma Subgroups Prediction |
---|---|
Description: | Utilizing a combination of machine learning models (Random Forest, Naive Bayes, K-Nearest Neighbor, Support Vector Machines, Extreme Gradient Boosting, and Linear Discriminant Analysis) and a deep Artificial Neural Network model, 'MBMethPred' can predict medulloblastoma subgroups, including wingless (WNT), sonic hedgehog (SHH), Group 3, and Group 4 from DNA methylation beta values. See Sharif Rahmani E, Lawarde A, Lingasamy P, Moreno SV, Salumets A and Modhukur V (2023), MBMethPred: a computational framework for the accurate classification of childhood medulloblastoma subgroups using data integration and AI-based approaches. Front. Genet. 14:1233657. <doi: 10.3389/fgene.2023.1233657> for more details. |
Authors: | Edris Sharif Rahmani [aut, ctb, cre] , Ankita Sunil Lawarde [aut, ctb] , Vijayachitra Modhukur [aut, ctb] |
Maintainer: | Edris Sharif Rahmani <[email protected]> |
License: | GPL |
Version: | 0.1.4.2 |
Built: | 2024-11-21 04:21:48 UTC |
Source: | https://github.com/sharifrahmanie/mbmethpred |
A function to draw a box plot for the DNA methylation dataset.
BoxPlot(File, Projname = NULL)
BoxPlot(File, Projname = NULL)
File |
The output of ReadMethylFile function. |
Projname |
A name used to name the plot. The default is null. |
A ggplot2 object
data <- Data2[1:10,] data <- cbind(rownames(data), data) colnames(data)[1] <- "ID" BoxPlot(File = data)
data <- Data2[1:10,] data <- cbind(rownames(data), data) colnames(data)[1] <- "ID" BoxPlot(File = data)
A function to calculate the confusion matrix of the machine and deep learning models. It outputs Accuracy, Precision, Sensitivity, F1-Score, Specificity, and AUC_average.
ConfusionMatrix(y_true, y_pred)
ConfusionMatrix(y_true, y_pred)
y_true |
True labels |
y_pred |
Predicted labels |
A data frame
set.seed(1234) data <- Data1[1:10,] data$subgroup <- factor(data$subgroup) fac <- ncol(data) split <- caTools::sample.split(data[, fac], SplitRatio = 0.8) training_set <- subset(data, split == TRUE) test_set <- subset(data, split == FALSE) rf <- randomForest::randomForest(x = training_set[-fac], y = training_set[, fac], ntree = 10) y_pred <- predict(rf, newdata = test_set[-fac]) ConfusionMatrix(y_true = test_set[, fac], y_pred = y_pred)
set.seed(1234) data <- Data1[1:10,] data$subgroup <- factor(data$subgroup) fac <- ncol(data) split <- caTools::sample.split(data[, fac], SplitRatio = 0.8) training_set <- subset(data, split == TRUE) test_set <- subset(data, split == FALSE) rf <- randomForest::randomForest(x = training_set[-fac], y = training_set[, fac], ntree = 10) y_pred <- predict(rf, newdata = test_set[-fac]) ConfusionMatrix(y_true = test_set[, fac], y_pred = y_pred)
Data1 is a medulloblastoma DNA methylation beta values from a GEO series (GSE85212) and focuses on 399 as the most important probes. This dataset is used to train and test the machine and deep learning models.
A data frame
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85212
data(Data1)
data(Data1)
Data2 is a medulloblastoma DNA methylation beta values (GSE85212, 50 samples) including 10000 most variable probes used for similarity network fusion.
A data frame
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85212
Cavalli FMG, Remke M, Rampasek L, Peacock J et al. Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell 2017 Jun 12;31(6):737-754.e6. PMID: 28609654
data(Data2)
data(Data2)
Data3 is an gene expression dataset from primary medulloblastoma samples (GSE85217, 50 samples) used for similarity network fusion.
A data frame
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85217
Cavalli FMG, Remke M, Rampasek L, Peacock J et al. Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell 2017 Jun 12;31(6):737-754.e6. PMID: 28609654
data(Data3)
data(Data3)
A function to train a K nearest neighbor model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal to zero and less than one. |
CV |
The number of folds for cross-validation. It should be greater than one. |
K |
The number of nearest neighbors. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(111) knn <- KNearestNeighborModel(SplitRatio = 0.8, CV = 3, K = 3, NCores = 1, NewData = NULL)
set.seed(111) knn <- KNearestNeighborModel(SplitRatio = 0.8, CV = 3, K = 3, NCores = 1, NewData = NULL)
A function to train a linear discriminant analysis model to classify medulloblastoma subgroups using the DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal than zero and less than one. |
CV |
The number of folds for cross validation. It should be greater than one. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(123) lda <- LinearDiscriminantAnalysisModel(SplitRatio = 0.8, CV = 2, NCores = 1, NewData = NULL)
set.seed(123) lda <- LinearDiscriminantAnalysisModel(SplitRatio = 0.8, CV = 2, NCores = 1, NewData = NULL)
A function to extract the confusion matrix information.
ModelMetrics(Model)
ModelMetrics(Model)
Model |
A trained model. |
A list
xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NULL) ModelMetrics(Model = xgboost)
xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NULL) ModelMetrics(Model = xgboost)
A function to train a Naive Bayes model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal to zero and less than one. |
CV |
The number of folds for cross-validation. It should be greater than one. |
Threshold |
The threshold for deciding class probability. A value greater or equal to zero and less than one. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(123) nb <- NaiveBayesModel(SplitRatio = 0.8, CV = 2, Threshold = 0.8, NCores = 1, NewData = NULL)
set.seed(123) nb <- NaiveBayesModel(SplitRatio = 0.8, CV = 2, Threshold = 0.8, NCores = 1, NewData = NULL)
A function to train an artificial neural network model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
Epochs |
The number of epochs. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
InstallTensorFlow |
Logical. Running this function for the first time, you need to install TensorFlow library (V 2.10-cpu). Default is TRUE. |
A list
## Not run: set.seed(1234) ann <- NeuralNetworkModel(Epochs = 100, NewData = NULL, InstallTensorFlow = TRUE) ## End(Not run)
## Not run: set.seed(1234) ann <- NeuralNetworkModel(Epochs = 100, NewData = NULL, InstallTensorFlow = TRUE) ## End(Not run)
A function to output the predicted medulloblastoma subgroups by trained models.
NewDataPredictionResult(Model)
NewDataPredictionResult(Model)
Model |
A trained model |
A data frame
set.seed(10) fac <- ncol(Data1) NewData <- sample(data.frame(t(Data1[,-fac])),10) NewData <- cbind(rownames(NewData), NewData) colnames(NewData)[1] <- "ID" xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NewData) NewDataPredictionResult(Model = xgboost)
set.seed(10) fac <- ncol(Data1) NewData <- sample(data.frame(t(Data1[,-fac])),10) NewData <- cbind(rownames(NewData), NewData) colnames(NewData)[1] <- "ID" xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NewData) NewDataPredictionResult(Model = xgboost)
A function to train a random forest model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal to zero and less than one. |
CV |
The number of folds for cross-validation. It should be greater than one. |
NTree |
The number of trees to be grown. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(21) rf <- RandomForestModel(SplitRatio = 0.8, CV = 3, NTree = 10, NCores = 1, NewData = NULL)
set.seed(21) rf <- RandomForestModel(SplitRatio = 0.8, CV = 3, NTree = 10, NCores = 1, NewData = NULL)
A function to read DNA methylation files. It can be used as the new data for prediction by every model.
ReadMethylFile(File)
ReadMethylFile(File)
File |
A data frame with tsv or csv file extension. The first column of the data frame is the CpG methylation probe that starts with cg characters and is followed by a number (e.g., cg100091). Other columns are samples with methylation beta values. All columns in the data frame should have a name. |
A data frame
## Not run: methyl <- ReadMethylFile(File = "file.csv") ## End(Not run)
## Not run: methyl <- ReadMethylFile(File = "file.csv") ## End(Not run)
A function to read user-provided file feeding into the SNF function (from the SNFtools package).
ReadSNFData(File)
ReadSNFData(File)
File |
A data frame with tsv or csv file extension. The first column of the data frame is the CpG methylation probe that starts with cg characters and is followed by a number (e.g., cg100091). Other columns are samples with methylation beta values. All columns in the data frame should have a name. |
A data frame
## Not run: data <- ReadSNFData(File = "file.csv") ## End(Not run)
## Not run: data <- ReadSNFData(File = "file.csv") ## End(Not run)
The actual labels from the medulloblastoma DNA methylation beta values (GSE85212, 50 samples) that was used for similarity network fusion.
Factor
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85212
Cavalli FMG, Remke M, Rampasek L, Peacock J et al. Intertumoral Heterogeneity within Medulloblastoma Subgroups. Cancer Cell 2017 Jun 12;31(6):737-754.e6. PMID: 28609654
data(RLabels)
data(RLabels)
A function to perform SNF function (from SNFtool package) and output clusters.
SimilarityNetworkFusion( Files = NULL, NNeighbors, Sigma, NClusters, CLabels = NULL, RLabels = NULL, Niterations )
SimilarityNetworkFusion( Files = NULL, NNeighbors, Sigma, NClusters, CLabels = NULL, RLabels = NULL, Niterations )
Files |
A list of data frames created using the ReadSNFData function or matrices. |
NNeighbors |
The number of nearest neighbors. |
Sigma |
The variance for local model. |
NClusters |
The number of clusters. |
CLabels |
A string vector to name the clusters. Optional. |
RLabels |
The actual label of samples to calculate the Normalized Mutual Information (NMI) score. Optional. |
Niterations |
The number of iterations for the diffusion process. |
Factor
data(RLabels) # Real labels data(Data2) # Methylation data(Data3) # Gene expression snf <- SimilarityNetworkFusion(Files = list(Data2, Data3), NNeighbors = 13, Sigma = 0.75, NClusters = 4, CLabels = c("Group4", "SHH", "WNT", "Group3"), RLabels = RLabels, Niterations = 10) snf
data(RLabels) # Real labels data(Data2) # Methylation data(Data3) # Gene expression snf <- SimilarityNetworkFusion(Files = list(Data2, Data3), NNeighbors = 13, Sigma = 0.75, NClusters = 4, CLabels = c("Group4", "SHH", "WNT", "Group3"), RLabels = RLabels, Niterations = 10) snf
A function to train a support vector machine model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal to zero and less than one. |
CV |
The number of folds for cross-validation. It should be greater than one. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(56) svm <- SupportVectorMachineModel(SplitRatio = 0.8, CV = 3, NCores = 1, NewData = NULL)
set.seed(56) svm <- SupportVectorMachineModel(SplitRatio = 0.8, CV = 3, NCores = 1, NewData = NULL)
A function to draw a 3D t-SNE plot for DNA methylation beta values using the K-means clustering technique.
TSNEPlot(File, NCluster = 4)
TSNEPlot(File, NCluster = 4)
File |
The output of ReadMethylFile function. |
NCluster |
The number of cluster. |
Objects of rgl
set.seed(123) data <- Data2[1:100,] data <- data.frame(t(data)) data <- cbind(rownames(data), data) colnames(data)[1] <- "ID" TSNEPlot(File = data, NCluster = 4)
set.seed(123) data <- Data2[1:100,] data <- data.frame(t(data)) data <- cbind(rownames(data), data) colnames(data)[1] <- "ID" TSNEPlot(File = data, NCluster = 4)
A function to train an XGBoost model to classify medulloblastoma subgroups using DNA methylation beta values (Illumina Infinium HumanMethylation450). Prediction is followed by training if new data is provided.
SplitRatio |
Train and test split ratio. A value greater or equal to zero and less than one. |
CV |
The number of folds for cross-validation. It should be greater than one. |
NCores |
The number of cores for parallel computing. |
NewData |
A methylation beta values input from the ReadMethylFile function. |
A list
set.seed(123) xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NULL)
set.seed(123) xgboost <- XGBoostModel(SplitRatio = 0.2, CV = 2, NCores = 1, NewData = NULL)