Random Forestry

Random Forest Models: A Summary of Ensemble Learning and Decision Trees

A powerful new technique that organizations are adopting is the Random Forest methodology of classifying and predicting their data. Let’s walk through what it means if you hear about a random forest model.

This approach is a form of ensemble modeling which means it takes multiple models and combines them together. That approach has been shown to create more accurate results. In a random forest model the ensemble is created by running multiple decision trees and putting those results together.

Today we’re going to cover what it means to classify your data, how decision trees and random forest models work and, finally, why your model will sometimes get it wrong.

Classifying Your Data

Random forest models and the decision trees that underpin them exist to classify large amounts of data into subgroups that can be acted upon. Put another way, classification makes big groups smaller so that you can treat those smaller groups differently.

Imagine you’re a bank with millions of customers. Half those customers only have a savings account and the other half only have checking accounts. Your want everyone to have both and need to market to them. You could send everyone an offer about checking (or savings) accounts but for half your customers this offer is pointless. You could have one marketing message that covers both checking and savings but now your message is less clear, also not idea.

In this case a basic classification model would divide your customers into two groups based on account type – Checking Account Customers and Savings Account Customers. Now you can market savings accounts specifically to the customers who don’t already have one and checking accounts specifically to the customers who don’t already have one of those.

That’s pretty easy so now imagine a more complex example. You run a retail chain that sells outdoor gear for fishing, camping and hunting. In this scenario your customers might overlap or might not. Just because someone fishes doesn’t mean they are all also campers, but some will be. A more complex classification model could create groups of customers based on their activities. This will help your business serve them better because a camper who doesn’t fish won’t be interested in fishing rods but will be open to hearing about camp stoves.

Decision Trees Are Just Classification Models

A decision tree is a model that takes a set of data and creates a series of questions that can be asked, in order, to try to classify the data. Once the model is built, you can run new data through the decision tree and figure out how to classify that data.

Each question is a decision point and as you have decision points leading to more decision points an image of the model branches out wider and wider…like a tree. (For what it’s worth, most visuals of decision trees go top-down, meaning they are wider at the bottom, so it’s more like the roots of the tree, but that’s just being picky!)

A Hypothetical Decision Tree

Let’s design a decision tree that answers the question: ‘Does a student on a college campus play on the men’s basketball team?’. The result we want is to be able to pick a random student and decide ‘Yes’ or ‘No’ as to whether the student is on the team.

Let’s also imagine we have data for some students’ age, height, gender, weight and class schedule and whether or not they are on the team. Without worrying about how to code something like this just think about the decisions you could make to get more accurate.

You can start with gender. There are no women on the men’s team so data can be split into ‘N’ if the students gender is ‘Female’ and we can process the remaining data.
Perhaps class schedule is next. All the basketball players have to free between 3-5pm for practice, so if a student has a class at 4pm you can likely put him into the ‘N’ classification.
But, just being open between 3-5 doesn’t make you a basketball player. Basketball players are generally tall so if the person is tall you’d assume they’re a basketball player.
Lastly, lets use weight or more specifically we’ll calcluate Body Mass Index which is basically weight divided by height. You know that basketball players are physically fit so any students with a high BMI are likely not on the team.

Will That Work?

If you started using those four questions to decide if someone plays basketball, will you get the right answer? The short answer is…sometimes, but that’s not quite the right question. A better question is ‘How well does that work’?

No model is perfect and there are always going to be false positives and false negatives. Instead, the model’s goal is to get close enough that the results are useful. How close you need to be depends on your specific work and the risk of getting it wrong. Medicine is more strict than retailing and you can imagine why.

This is also where the random forest methodology comes in. Look back at the third level of our basketball player decision tree. We said if someone were ‘tall’ they are more likely to be a basketball player. But what height makes someone ‘tall’? 7’1″? 6’5″?

You can pick a height based on your own judgement and plug that into the model, but will you pick the value for height that is most predictive? This is the problem that random forest solves. Specifically, the random forest method will run a decision tree for lots of different heights – 5’11”, 6’0″, 6’1″, etc – and decide which height is the most predictive.

It will do this for your other variables too. What’s the right BMI? Is a schedule that’s open from 3-5 the right block of time? Or is 2:30-5:15 a better predictor? The random forest method will try different possibilities and decide which combination of values is most likely to predict that someone is on the basketball team.

It’s still not perfect

When your model is done it’s going to give you a list of values for your four variables that, when true, make it likely that someone is on the men’s basketball team. If you measure those values for a random student, you can plug them into the model and get a prediction. But even now the prediction won’t always be right!

In my experience this is the nuance where data science most quickly loses support among general users. There’s a belief that a prediction model needs to be almost always right to be valuable. If a model might mis-classify some data…then the model is not good and probably shouldn’t be used (their thinking goes).

Rather than asking if your model is perfect ask if it’s better than what you’re currently doing and what are the risks when it will get it wrong?

To see what I mean think back to that example above of a retail store that sells camping, hunting and fishing gear. Your company uses no models and when you market to a potential customer you show them everything – the tents, the camouflage jackets and the fishing poles. All of it. When your customer sees it, they care for only about 1/3 of it.

Now your data scientist builds a model that can correctly predict whether a potential customer camps, hunts, fishes (or engages in some combination). Using a random forest model, the data scientist looks at past purchases and determines how much money a customer must spend in a category to be classified into it.

It’s correct 80% of the time; as a corollary, it’s wrong 20% of the time. So is it better to make advertising that’s specific to each activity knowing that 20% of the time you’ll have the wrong one? Or is it better to show everything knowing that 2/3 of your ad will be irrelevant?

That’s a tricky question and it can’t really be answered in this article. Generally having more information will lead to better results than treating all groups the same. In another article we will talk about testing and how you prove your predictions have value in the real world.

But that will come another day, let’s review what you’ve learned today.

Summing it all up

Classification models are just ways to take big groups and put them into smaller groups based on similarities in the data.
Decision trees are a series of steps that are followed to created a classification.
Random forests are a bunch of decision trees that are collectively used to create a better classification.
After your random forest creates the rules, you can use it to predict and classify new data as you get it.
It won’t always be right. You’ll need to use your judgement and domain expertise to decide if it’s better than what you already do and if what the risk for the cases where you’re wrong.