How Does Random Forest Work?

Do you get depression over regression? Are lost in some random forest of decision trees to the extent that you cannot even see the forest through the trees? If a decision tree falls in a random forest with nobody around to hear it, what kind of sound does it make?

If these and other questions are keeping you up at night, worry no further. Your regression problems are our classification problems and we will not leave you in a logistic lurch. So, let us start by answering the age-old question, how does random forest work?

Random Forest Model


The random forest algorithm is a supervised learning algorithm that is part of machine learning. It’s used for cleaning data within a training set to make sure that there is neither a high bias nor a high variance. The idea behind a random forest is that a single decision tree is not reliable. A decision tree is a simple supervised learning algorithm for regression or classification and consists of three parts:

  • Root node
  • Predictor
  • Leaf node

The root refers to the problem that the algorithm is trying to solve. The predictor is basically a split within the algorithm into different “decisions.” The leaf node is the end of the algorithm at which there will be no more splits.

However, the problem with using one individual tree is that you might have high variance, high bias, or even high variance along with high bias, which would lead to a prediction error that could compromise the training data for your dataset. This problem can be solved by having a large number of trees, decision trees, to make a more accurate prediction than you would with just an individual decision tree. By taking training samples from a large number of trees, you have a greater information gain due to the power of the majority vote and therefore less variance.

You can still have a high bias, however, unless you randomly take different samples from different trees within different parts of your training set. Thus, the name random forest.

Bootstrap Aggregation


For even greater accuracy and lower bias and variance within your training sets, you can use an algorithm known as bootstrap aggregation, a.k.a. bagging or bagged trees algorithm. This still involves using a random forest algorithm, but you make it even more random through bootstrapping.

Bootstrap aggregation involves taking individual decision trees from different training samples and transferring those trees to other training samples within the same dataset. This cross-pollination of decision trees is really cross-validation of information gain, ensuring a higher degree of randomness.

The best advice? Clean your data! Your dataset is only as good as your test set and your test set is only as good as your training set. Your training set is only as good as the data you feed it. If you live a life of dirty data, sooner or later you and your datasets will pay the price.

Artificial Intelligence Today


As we enter the third decade of the 21st century with a global pandemic and a new motivation for digitization, we need predictive models, data analysis, machine learning, and deep learning more than ever.

Machine learning is still considered weak artificial intelligence but we will be able to create strong artificial intelligence soon enough if Space X founder and Tesla entrepreneur Elon Musk is correct. While regular machine learning may not be as popular as deep learning and supervised learning may not be as romantic as unsupervised learning and reinforcement learning, the basics are important. Right now, decision trees and supervised learning are the foundation of the artificial intelligence we have, and mastering the random forest model is the first step towards keeping your data clean.