Random Forest is a common ensemble method used to make predictions from a variety of dataset types. This is done by combining a number of diverse, but simpler decision tree models into an ensemble model which can be more accurate, but also more versatile. In creating a random forest, each of the decision trees is created with a randomized subset of data, and a random subset of features, with the output of each tree combined into the output of the random forest. The benefits to building the model this way include the ability to handle missing data natively, dimensionality reduction or reducing the need to perform dimensionality reduction prior to modeling, and the structure of the forest can be used to estimate variable importance, which can then be utilized in training other models or feature engineering. Given these benefits, and the ability to use complex datasets natively, Random Forests have gained popularity over time and are commonly used.
Literature Review
The thesis (Louppe 2014) dives deep into a detailed analysis of random forests, which is an important machine learning algorithm. It aims to understand how they learn, how they work internally, and how easy they are to interpret. In the first part, it explores the creation of decision trees and their assembly into random forests, including their design and purpose. It also presents a study on the computational efficiency and scalability of random forests, with specific implementation insights from Scikit-Learn. The second part focuses on understanding how interpretable random forests are by examining the Mean Decrease Impurity measure - a key method for determining variable importance, especially in the context of multiway totally randomized trees under extensive or asymptotic conditions.
Random Forest is a widely used machine learning method for analyzing high-dimensional data, loved for its flexibility and ability to identify important features. However, it often overlooks the intricate connections among features and how they collectively influence outcomes. The thesis (Lucas F. Voges 2023) introduces two innovative approaches that tackle this issue: Mutual Forest Impact (MFI) and Mutual Impurity Reduction (MIR). MFI assesses the combined effect of features on outcomes, providing a more detailed understanding than correlation analysis. MIR takes this even further by integrating the relationship parameter with individual feature importance, incorporating testing procedures for selecting features based on statistical significance. Evaluations on simulated datasets and comparisons with existing feature selection methods demonstrate the potential of MFI and MIR in uncovering complex relationships between features and outcomes without any biases towards favoring features with many splits or high minor allele frequencies.
The paper (Zardad Khan 2024) introduces a new method for selecting features called ROBUST Weighted Score for Unbalanced data (ROWSU). It’s specifically designed to deal with the problem of class imbalance in high-dimensional gene expression data when doing binary classification tasks. To tackle the challenge of imbalanced classes, ROWSU first balances out the dataset by creating synthetic data points for the minority class. Then it uses a greedy search to find the smallest set of genes that are important, and it introduces a unique weighted robust score that calculates how useful each gene is using support vector weights. This whole process results in a final set of genes that combines both high-scoring genes and those found through greedy search, making sure we select genes that can really tell our classes apart even if they’re imbalanced. We tested ROWSU on six different gene expression datasets and compared its performance to other state-of-the-art feature selection techniques using accuracy and sensitivity metrics. We also visualized our results with boxplots and stability plots. Our findings show that ROWSU performs better than other methods at improving classifier effectiveness, as shown by its results with k nearest neighbors (kNN) and random forest (RF) classifiers.
The main focus of this article (Rabah Ouali 2024) is to make sure that the electrical grid works properly by making energy providers follow the rules and specifications set by Transmission System Operators (TSOs). Since there are many different energy sources connected through power electronic inverters, it’s crucial for them to choose between Grid Forming (GFM) and Grid Following (GFL) operating modes in order to maintain grid stability. This means that energy suppliers need to comply with these requirements. In this study, we compare various machine learning algorithms to classify converter control modes (GFL or GFM) using frequency-domain admittance from external measurements. While most algorithms can accurately identify known control structures, they struggle when faced with new modifications. However, the random forest algorithm stands out as it consistently performs well across different control configurations.
In the paper “Analysis of a Random Forests Model”, Biau (Biau 2012) provides an in-depth analysis of the Random Forests model. The author delves into the statistical components and mathematical support, as well as investigates various aspects of the Random Forests, such as its consistency, convergence rates, and the effect of the number of trees on performance. He touches on the importance of variable selection and the increasing adaptability found within the random forest algorithm. Biau provides theoretical insights into the behavior of Random Forests, shedding light on its robustness and effectiveness as a machine learning algorithm.
In “A Random Forest Guided Tour”, Biau and Scornet (Biau and Scornet 2016) present an extensive exploration of Random Forests, a popular machine learning algorithm. The paper serves as a comprehensive guide, covering aspects of theoretical foundations, methodology, model evaluation, and empirical performance. They discuss the inner workings of Random Forests, including topics such as tree construction, feature selection, and ensemble learning, which aid in predictive analysis. Overall, the paper serves as a valuable resource for those interested in understanding and utilizing Random Forests in their machine learning tasks.
In “Classification and Regression by randomForest,” Liaw and Wiener (Liaw, Wiener, et al. 2002) discuss the introduction of the randomForest package for classification and regression tasks in the R programming language. An overview of the Random Forest algorithm is provided, including how it draws bootstrap samples and estimates rates of error. This technique is referred to as ensemble learning called bagging and uses decision trees as base learners. Implementation of the randomForest package addresses parameters for tuning the model and handling missing values. It is noted that the production of multiple trees is crucial in obtaining variable importance and measures of proximity. Overall, the paper serves as a practical guide for R users and highlights the advantages of using random forests for both classification and regression problems, such as robustness to overfitting and high prediction accuracy.
In the paper “Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research”, the authors (Lingjun et al. 2019) explore the application of Random Forests as a predictive analytics tool in institutional research. They discuss the limitations of traditional regression models in handling complex datasets and introduce Random Forest as an alternative approach. To do this, they highlight the key benefits of the Random Forest model, as it has the ability to handle non-linear relationships, interactions, and high-dimensional data effectively. In addition, they expound upon predictive capabilities, improved accuracy, and robustness as a strategy for data-based decision making.
In “Machine Learning Benchmarks and Random Forest Regression”, Segal (Segal 2004) explores the application and effectiveness of Random Forest Regression in the context of machine learning. Machine learning algorithms may face challenges, which emphasize the need for standardized datasets and evaluation metrics. Segal (2004) examines how Random Forest Regression performs across various datasets, comparing its performance to other regression techniques. The study highlights the strengths of Random Forest Regression in handling non-linear relationships and noisy data, showcasing its versatility and robustness. Additionally, the paper provides insights into the impact of different parameters on the performance of Random Forest Regression, offering practical guidance for its implementation. Overall, the paper contributes to the understanding of Random Forest Regression and its potential as a powerful tool in the field of machine learning.
To tune or not to tune the number of trees in random forest. This article (Probst and Boulesteix 2018) intends to demonstrate whether to keep the number of trees in a random forest at the maximum feasible number, or to reduce the number of trees and the effects that might have. Previous research has shown that past a point, increasing the number of trees yields a small gain in the area under the curve, but this research did not demonstrate whether there was a possibility of a smaller number of trees leading to a better result. It was found that while there were specific situations and datasets where a lower tree count was beneficial, these were in the minority and selecting a larger number of trees is often beneficial with the possible exception of median squared/median absolute error rather than the more common mean squared/mean absolute error.
Improved random forest for classification. The goal of (Paul et al. 2018) is to create a random forest model to improve feature selection as well as the optimal number of trees simultaneously. This is done in iterations identifying important and unimportant features, then adds a set number of trees per pass until convergence. The feature selection method appears to be similar to stepwise regression. Its not clear in the paper whether this method may have similar drawbacks to stepwise regression / feature selection. The optimal number of trees to add is based on the probability of a good split for a given tree which is calculated using the strength and correlation of the forest. The idea appears to be to limit the number of trees added to reduce the computational overhead, without a classification accuracy penalty. The results were a fast and accurate classifier with more automation for tuning than common random forest algorithms.
Random Forest Based Feature Induction The goal of (Vens and Costa 2011) is to demonstrate a method to create induced features using a random forest, which has the benefit of handling sparse datasets and missing values by default. This can reduce dimensionality and assist with automated feature selection by reducing the dimensionality of the full dataset. When working with a very large number of features, some reduction is necessary to be able to conceptualize or easily work with the data. This method could also help with preparation for a method like PCA. A random forest is built similar to how one would be used for prediction, but instead of prediction the nodes of the trees are combined as a feature space. The resulting feature space used with a support vector machine was able to show high predictive performance, even when the original data set was not usable for SVM presumably due to missing values, On real datasets, this method often generated a better accuracy and in those cases where it did not, there was not a large penalty for using it with one exception.There may be limitations to use due to computational requirements, but there may also be datasets for which using this type of feature induction would decrease total computational requirements with some models.
Application of random forest algorithm on feature subset selection and classification and regression. The goal of (Jaiswal and Samikannu 2017) is to use a random forest to select a subset of high value features to be analyzed, reducing dimensionality, computational overhead, reducing the probability of overfit, and possibly benefiting the modeler’s ability to visualize and understand the dataset by removing unnecessary features. The method used is to generate large, unpruned trees then use out of bag estimation and the importance of each variable is calculated based on the number of correct votes - the permuted out of bag variables. This can be an iterative process for large datasets. Gini values are used to identify variable interactions across trees in the forest. The results of this paper are not clear, but it does appear that there was a reduction in dimensionality using this method. The limitations may be if there could be an engineered feature, these would be best created prior to using the random forest feature selection.
A guided random forest based feature selection approach for activity recognition.” The purpose of (Uddin and Uddiny 2015) is to use a random forest to during feature selection for a human activity recognition problem.This is performed by training a standard random forest then using the feature importance scores generated to select the most important features to use in a second random forest used for prediction.In general, feature selection of this type can improve accuracy and reduce computational complexity. The results were that the guided random forest has a comparable accuracy to the more common methods such as Relief-F and Lasso Logistic Regression, but has a lower computational complexity than Relief-F making it a possible choice for a solution requiring a tradeoff between complexity and accuracy.
A permutation importance-based feature selection method for short-term electricity load forecasting using random forest. The goal of (Huang, Lu, and Xu 2016) is to demonstrate a new feature selection method to be used with a random forest based on permutation importance values of a trained random forest to be used in selecting the best features to train a second random forest model to predict load forecasts for a power grid. This feature set is then used as a reduced feature set to train the short term load forecasting model, which improves the computation time necessary and may improve accuracy as well. This feature selection method provided worked best for a random forest model, but also provided better results for an artificial neural network and support vector regression indicating it may generalize well.
Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. (Boulesteix 2016) Random forest (RF) methodology is found to be used to address two main classes of problems which is to construct a prediction rule in a supervised learning problem and to assess and rank variables with respect to their ability to predict the response. RF is a classification and regression method based on the aggregation of a large number of decision trees. RF has become a popular analysis tool in many application fields including bioinformatics and will most probably remain relevant in the future due to its high flexibility. However, RF approaches still have to face a number of challenges. They can produce unexpected results in some specific cases such as a bias depending on the distribution of the predictor.
Comparison of Random Forest and SVM for Raw Data in Drug Discovery: Prediction of Radiation Protection and Toxicity Case Study. (Matsumoto and Ohwada 2016) Random forest and SVM was compared for the raw data in drug discovery. There were two types of problems based on the target protein. They predicted the radiation protection(cancer) function and toxicity for radioprotectors targeting p53 as a case study.Two experiments were performed for each compound.There were 84 total. First, experiments administering the compounds to normal cells were performed. These experiments were able to measure the toxicity. Then, experiments administering the compounds to gamma irradiated cells were performed. These experiments were able to measure the radiation-protection function. The experiment measured the cell death rate for each concentration case. After using random forest and machine learning, SVM was found to be better than random forest as determined by the AUC score. In contrast, for predicting toxicity, random forest is better than SVM.
Research on machine learning framework based on random forest algorithm. (Ren and Han 2017) This article is about research on a machine learning framework based on a random forest algorithm.There is an introduction to the filtering method. Filtration method gives the characteristics of the data with a weight to then carry out feature ranking according to the said weight. Then some rules are applied to set a threshold and the feature whose weight is greater than the threshold value is retained or deleted. The steps are as follows for algorithm optimization which is to carry out feature selection and remove noise characteristics, carry out feature selection and delete redundant features, and voting strategies for optimizing the random forest algorithm.
Random Forest. (Rigatti 2017) Colon cancer data from the SEER database was used to construct both a Cox model and a random forest model to determine how well the models perform on the same data. The data set is sampled with bootstrap sampling. The article explains what random forest is and gives many examples.In the sample problem they evaluated , the Cox and random forest models performed similarly, so because the Cox model was easier to interpret , it made it the preferred method.It only had a few predictors and no obvious interactions or nonlinear effects, so the random forest model would not be the best option in this case.However, the both achieved well, because the error rate of the models was approximately 18%.
Methods
Decision trees
Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The tree can be explained by two entities: decision (internal) nodes and leaves. The decision nodes are where the data is split, and the leaves are the decisions or the final outcomes. A decision tree is a flowchart-like structure in which each internal node represents a test on a feature (e.g. whether a coin flip comes up heads or tails) , each leaf node represents a class label (decision taken after computing all features), and branches represent conjunctions of features that lead to those class labels. The paths from root to leaf represent classification rules. The algorithm for predicting the class of a given dataset starts from the root node of the tree. It compares the values of the root attribute with the record attribute, follows the branch based on the comparison, and jumps to the next node. This process continues until it reaches the leaf node of the tree.
Decision trees in a random forest work together to create a robust and accurate model by leveraging their diversity and combining the output of multiple decision trees to make predictions.
When evaluating the quality of the splits in decision trees, several metrics are considered.
1. Gini impurity:
The Gini coefficient is a measure of inequality between 0 and 1, where a value closer to 1 indicates more inequality. This is used to calculate inequality in income and wealth, but it can be used in other problems where a measure of inequality is needed. \[
G = \frac{\sum_{i=1}^{n}\sum_{j=1}^{n}|x_i - x_j|}{2n^2\bar{x}}
\]
2. Information gain:
This measure can be used in multiclass problems and is often used for finding where to split the features. The other commonly used option is Entropy, which is also an entropy measure but with a more complex calculation.
The tradeoff is a faster calculation using the Gini impurity, with a possibly higher accuracy. \[H=\sum\limits_{i=1}^{n}-p_i log_2 p_i\] where \(p_i\) is the percentage of each class present in the node resulting from a tree split.
3. Entropy:
For decision trees, a related measure is used to measure the entropy of a result by calculating the probability that a random entry will be in the wrong class if it were randomly assigned a label. \[I_G=1-\sum\limits_{i=1}^{J}p_i^2\] where \(p_i\) is the probability that a randomly labeled element would be incorrectly classified.
4. Out of bag error estimation:
Each decision tree is trained on a bootstrapped sample of the original dataset.
The data points that are not included in the bootstrapped sample for a particular tree are called out-of-bag instances.
OOB error provides a measure of how well the random forest model is likely to perform on new data.
It is useful for assessing model performance and tuning hyperparameters.
\[\text{OOB Error} = \frac{1}{N} \sum_{i=1}^{N} I(y_i \neq \hat{y}_{i, \text{OOB}})
\] where \(N\) is the number of rows in the dataset and \(I\) is an indicator function which returns 1 if \(y_i \ne \hat{y}_{i,OOB}\)
Random Forest involves the following basic concepts:
Bootstrap Sampling (Bagging): Perform random sampling with replacement on the original dataset to form a new dataset for training a decision tree. In each round of bootstrap sampling, about 36.8% of the samples will be missed, not appearing in the new dataset, and these data are referred to as out-of-bag (OOB) data.
Random Feature Selection: At each node of the decision tree training, randomly select a subset of features, and then use information gain (or other criteria) to choose the best split. Repeat the above steps, generating multiple decision trees, to form a “forest”.
Prediction: When predicting new data samples, each tree produces its own prediction result. Random Forest combines these results and uses majority voting to determine the final prediction outcome.
Ensemble Learning via Hard Voting Classifier: By voting, the results of five models are integrated to get the combined outcome. The result selects the category that appears most frequently as the final prediction result. This is a strategy of ensemble learning, known as the Hard Voting Classifier.
Five normalization methods
The data needs to be normalized using 5 different methods, due to significant differences in feature values initially:
The idea is to combine five models using a voting method and pick the category with the most votes as our final prediction. It’s like a team effort in machine learning, known as the Hard Voting Classifier. For a sample point x, five models make predictions respectively:
\(y_1 = \text{MinMax}(x)\)
\(y2 = \text{ZScore}(x)\)
\(y3 = \text{MaxAbsoluteValue}(x)\)
\(y4 = \text{L1Norm}(x)\)
\(y5 = \text{L2Norm}(x)\)
Place these five prediction results into a set \(Y = \{y_1,y_2,y_3,y_4,y_5\}\). The final prediction result \(y_{final}\) is the element that appears most frequently in the set \(Y\), mathematically represented as \(y_{final} = \text{mode}(Y)\).
Benefits/Advantages
Capable of performing both classification and regression tasks.
Capable of handling large datasets with high dimensionality.
Decreased training time compared to other algorithms.
Promotes predictive ability with high accuracy, even with large datasets.
Can still maintain accuracy, even when a large proportion of data is missing.
Enhances the robustness of the model and prevents issues in overfitting.
Can be used as a feature selection tool using its variable importance plot.
Limitations/Challenges
Decision trees are prone to problems, such as bias and overfitting.
However, when multiple decision trees form an ensemble in the random forest algorithm, they predict results with greater accuracy, particularly when the individual trees are uncorrelated with each other.
Time-consuming.
Since random forest algorithms can handle large data sets, they can provide more accurate predictions. However, the process can be slow as they are computing data for each individual decision tree.
Requires more resources.
More complex.
The decision logic of a single decision tree is easier to interpret when compared to a forest of them.
Assumptions
Independence of trees: Each tree is built using a random subset of features and a bootstrapped sample of the data, aiming to reduce correlation between trees.
Randomness: While some of the actual values in the feature variable of the dataset should be present so the classifier can predict accurate results, random forests must also assume a random subset of features to reduce the probability of overfitting and encourage tree diversity.
Description Data was extracted from the South African Red List database (“Red List of South African Plants,” n.d.) to compile a profile for plant extinctions. South Africa offers a wide array of biodiversity and estimates over 22,000 plant taxa. The International Union for Conservation of Nature’s (IUCN) (“The IUCN Red List of Threatened Species. Version 2023-1” 2023) Red List of Threatened Species provides a standardized method to document and assess extinctions (IUCN 2023). Species are classified into one of the following statuses: Extinct (EX), Extinct in the wild (EW), Critically endangered possibly extinct (CR PE), Critically endangered (CR), Endangered (EN), Vulnerable (VU), Near threatened (NT), Conservation dependent (CD), Least concern (LC), and Data deficient (DD).
Plants are an essential component to an ecosystem’s functionality, so it is critical to evaluate drivers of extinction and determine methods of prevention. To examine potential indicators for extinctions, extinct, threatened, and non-threatened taxa are compared to identify and/or distinguish traits that may be associated with risk or vulnerability. The final dataset comprises 842 extant taxa, 33 Extinct taxa, and 69 Possibly Extinct (CR PE) taxa, to total 944 species.
The table below organizes and summarizes the explanatory variables.
Alteration of natural habitats necessary for species survival resulting in reduced functionality e.g., fire suppression, droughts.
(0, 1)
Invasive species
Impacts of alien species on natives through different mechanisms e.g. alteration of soil chemistry, resource competition.
(0, 1)
Pollution
Pollutants entering the natural environment e.g., air-borne pollutants, waste.
(0, 1)
Over-exploitation
Excessive use of species causing decreases in viable populations e.g. overharvesting.
(0, 1)
Other
Intrinsic factors, changes in native taxa dynamics, human disturbance, natural disasters.
(0, 1)
Unknown
N/A
(0, 1)
Categorical
Life form (LF)
Annual or perennial
(0, 1)
Growth form (GF)
One of 14 distinct forms: Parasitic plant, Tree, Shrub, Suffrutex, Herb, Lithophyte, Succulent, Graminoid, Geophyte, Climber, Carnivorous, Cyperoid, Creeper, Epiphyte.
(1,14)
Biomes
One of nine biomes present in South Africa: Fynbos, Grassland, Succulent Karoo, Albany Thicket, Savanna, Forest, Nama Karoo, Desert, Indian Ocean Coastal Belt. *Note: if a taxon was found in multiple biomes it was marked as generalist.
(1,10)
Continuous
Range size
All species range sizes are based on the standard measure of Extent of Occurrence (EOO), a parameter defined as the shortest continuous imaginary boundary that can be drawn to encompass all the known, inferred, or projected sites of present occurrence of a taxon.
data <-read_csv("All_threat_data.csv")ggplot(data, aes(x =factor(Status), fill =factor(Status))) +geom_bar(show.legend =FALSE) +scale_fill_brewer(palette ="Paired") +labs(title ="Barplot of Status", x ="Status", y ="Frequency") +theme_minimal() +theme(text =element_text(size =12),plot.title =element_text(hjust =0.5),axis.title =element_text(size =14, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),panel.grid.major =element_line(color ="grey80"),panel.grid.minor =element_blank())
The bar plot “Barplot of Status” shows how often different statuses occur. Each bar represents a specific status category. The plot clearly shows that the dataset is seriously unbalanced. One category, marked as ‘LC’, drastically outnumbers the others with a frequency that surpasses 500. The imbalance could bias any analysis or predictive modeling using this data, as the models will likely be biased towards the ‘LC’ status because it happens so often. To effectively tackle the issue of data imbalance, it might be necessary to use techniques for resampling the data that can boost the representation of underrepresented classes or balance out the dominance of the overrepresented class. By implementing these strategies, any insights or models derived from this analysis reflect a more balanced dataset, avoiding any bias caused by imbalanced data. In order to achieve this goal, the ROSE package has been chosen. This is a tool specifically designed for addressing imbalances in datasets. With its advanced methods for generating synthetic data and resampling, ROSE enables us to achieve a fairer distribution of classes and thereby enhance the reliability and validity of our analytical outcomes.
The image is a ggcorrplot, which is a way of visualizing how different ecological factors like “Pollution” and “Habitat_loss” are associated to each other. The colors in the plot show how strong these relationships are, with red indicating positive correlation. The diagonal shows that when variables are compared to themselves, there is a perfect positive association. The squares in this heatmap show how the variables on the x and y axes are related. A correlation of 1 means they have a perfect positive relationship, -1 means they have a perfect negative relationship, and 0 means there’s no relationship at all. The colors represent the strength and direction of the correlations: red shades mean there’s a positive correlation, blue shades mean there’s a negative correlation, and the intensity of the color shows how strong the correlation is. The variables used are related to ecology and biology, like “Pollution”, “Over_exploitation”, “Habitat_loss”, and “Biomes”.
Distribution of Range by Conservation Status
Code
ggplot(data = data, aes(x = Status, y = Range, fill = Status)) +geom_boxplot() +theme_bw() +ylim(0,100000)
This boxplot shows the sizes of species’ habitats based on their conservation status, such as Critically Endangered (CR), Endangered (EN), and Vulnerable (VU). It tells us how large or small the habitats are for each category. On the vertical axis is the habitat size in square kilometers, while the horizontal axisis the conservation status the observation belongs to. Each boxplot demonstrates the middle value of the group, a range between 25th and 75th percentiles, and any unusual values within each category’s habitat sizes. Among all categories, Endangered (EN) species have a wide variety of habitat sizes, but Extinct (EX) ones have very limited ranges.
This bar chart shows how often different environmental factors are reported, with the vertical axis representing frequency and the horizontal axis listing the factors: Degradation, Exploitation, IAS (Invasive Alien Species), Loss, Other, Pollution, and Unknown. The colors indicate whether each factor is present (darker shade for Yes) or absent (lighter shade for No). This visualization allows a quick way to interpret which factors are most commonly reported.
Statistical Modeling
Packages
The R packages utilized for running the statistical modeling for the Random Forest algorithm were readr, plyr, ipred, caret, caTools, randomForest, ROSE. - readr: Part of the tidyverse, readr is designed for reading rectangular data, particularly CSVs (comma-separated values) and other delimited types of text files. It’s known for its speed and for providing more informative error messages compared to base R functions like read.csv. It also converts data into tibbles, which are a modern take on data frames.
plyr: This package is used for splitting, applying, and combining data. plyr is known for its capability to handle different data types (arrays, lists, data frames, etc.) and apply functions to each element of the split data, then combine the results. Note that plyr is largely superseded by dplyr (also part of tidyverse), which is more efficient especially for large datasets.
ipred: Standing for “Improved Predictors”, ipred provides functions for predictive modeling. It includes methods for bagging (Bootstrap Aggregating), which helps improve the stability and accuracy of machine learning algorithms, particularly for decision trees.
caret: The caret package (short for Classification And REgression Training) is a comprehensive framework for building machine learning models in R. It simplifies the process of model training, tuning, and predicting by providing a unified interface for various machine learning algorithms.
caTools: This package contains several tools for handling data, including functions for reading/writing Binary Large Objects (BLOBs), moving window statistics, and splitting data into training/testing sets. It’s often used for its simple and effective method for creating reproducible train/test splits.
randomForest: As the name suggests, this package is used for implementing the Random Forest algorithm for classification and regression tasks. Random Forest is an ensemble learning method that operates by constructing a multitude of decision trees and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees.
ROSE: Standing for Random OverSampling Examples, the ROSE package is used to deal with imbalanced datasets in binary classification problems. It generates synthetic samples in a two-class problem to balance the class distribution, using smoothed bootstrapping. This helps improve the performance of classification models on imbalanced datasets.
Data Preparation
Encode the 15 columns, with ‘group’ categorized into 1, 2, and 3. ‘Yes’ should be encoded as 1, ‘No’ as 0, and other categories should be numbered starting from 1. In the first column, replace the “-” with a “.”.
Split the dataset into a training set and a test set with a ratio of 7:3. Perform class imbalance handling on the training set using the ROSE library, aiming for an equal data quantity among classes, represented as 1:1:1.
Given a minority class sample point \(x\), we find its \(k\) nearest neighbors. Then, one neighbor is randomly selected, denoted as \(z\), and a new data point \(y\) is constructed which lies on the line segment between \(x\) and \(z\):
\(y = x + \lambda * (z - x)\)
Here, \(\lambda\) is a random number between 0 and 1.
Data Processing
Process the data by setting the first 14 columns as [features] and the last column as the [label]
Random Forest is a powerful and flexible machine learning algorithm that can be used for a wide range of tasks. It is particularly useful when dealing with complex data composed of a large number of features and when the goal is to achieve high predictive accuracy while avoiding overfitting. The algorithm incorporates versatility in its capabilities for classification and regression tasks, handling missing data, and displaying robustness when faced with outliers and noisy data.
We produced a predictive model with 93% accuracy, indicating that our explanatory variables were able to differentiate between non threatened, threatened, and extinct taxa. Extinct species were classified with 100% specificity and 80% sensitivity. Most extinctions were perennial shrubs found in the Cape Floristic Region, a global biodiversity hotspot. As range was the strongest predictor of extinction, many of the recorded taxa deemed susceptible were range-restricted. Habitat loss is presented as the second strongest variable of importance in predicting plant extinctions. Predictions were based on a quantitative, evidence-based approach, though gaps in knowledge highlighted areas for further study. Improved species monitoring and documentation of threat factors will aid in a deeper understanding of the ecological role and value of South African plant species.
References
Biau, Gérard. 2012. “Analysis of a Random Forests Model.”The Journal of Machine Learning Research 13: 1063–95.
Biau, Gérard, and Erwan Scornet. 2016. “A Random Forest Guided Tour.”Test 25: 197–227.
Huang, Nantian, Guobo Lu, and Dianguo Xu. 2016. “A Permutation Importance-Based Feature Selection Method for Short-Term Electricity Load Forecasting Using Random Forest.”Energies 9 (10): 767.
Jaiswal, Jitendra Kumar, and Rita Samikannu. 2017. “Application of Random Forest Algorithm on Feature Subset Selection and Classification and Regression.” In 2017 World Congress on Computing and Communication Technologies (WCCCT), 65–68. Ieee.
Liaw, Andy, Matthew Wiener, et al. 2002. “Classification and Regression by randomForest.”R News 2 (3): 18–22.
Lingjun, He, Richard A Levine, Juanjuan Fan, Joshua Beemer, and Jeanne Stronach. 2019. “Random Forest as a Predictive Analytics Alternative to Regression in Institutional Research.”Practical Assessment, Research, and Evaluation 23 (1): 1.
Lucas F. Voges, Stephan Seifert, Lukas C. Jarren. 2023. “Opening the Random Forest Black Box by the Analysis of the Mutual Impact of Features.”https://arxiv.org/abs/2304.02490.
Matsumoto, Aoki, and Ohwada. 2016. “Comparison of Random Forest and SVM for Raw Data in Drug Discovery: Prediction of Radiation Protection and Toxicity Case Study.”International Journal of Machine Learning and Computing 6 (2): 1–4. https://www.ijmlc.org/vol6/589-L031.pdf.
Paul, Angshuman, Dipti Prasad Mukherjee, Prasun Das, Abhinandan Gangopadhyay, Appa Rao Chintha, and Saurabh Kundu. 2018. “Improved Random Forest for Classification.”IEEE Transactions on Image Processing 27 (8): 4012–24.
Probst, Philipp, and Anne-Laure Boulesteix. 2018. “To Tune or Not to Tune the Number of Trees in Random Forest.”Journal of Machine Learning Research 18 (181): 1–18.
Rabah Ouali, Pascal Yim, Jean-Yves Dieulot. 2024. “Machine Learning Classification of Power Converter Control Mode.”https://arxiv.org/abs/2401.10959.
Segal, Mark R. 2004. “Machine Learning Benchmarks and Random Forest Regression.”
“The IUCN Red List of Threatened Species. Version 2023-1.” 2023.
Uddin, Md Taufeeq, and Md Azher Uddiny. 2015. “A Guided Random Forest Based Feature Selection Approach for Activity Recognition.” In 2015 International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), 1–6. IEEE.
Vens, Celine, and Fabrizio Costa. 2011. “Random Forest Based Feature Induction.” In 2011 IEEE 11th International Conference on Data Mining, 744–53. IEEE.
Zardad Khan, Saeed Aldahmani, Amjad Ali. 2024. Feature Selection via Robust Weighted Score for High Dimensional Binary Class-Imbalanced Gene Expression Data. arxiv. https://arxiv.org/abs/2401.12667.