One of the most common mistakes data scientists make when training machine learning models is incorrectly splitting data for training and testing. The train/test split involves splitting data during the model training and evaluation process. Usually data is divided into two parts:

  1. Training data set – The data used to train the model
  2. Testing data set – The hold out data used to test the performance of the model

Typically we reserve 70% of the data for training and 30% percent for testing. This can vary and should be adjusted depending on the volume of data, the kind of models under consideration and the purpose for modeling.

Splitting Data With Care

If the test/train split is not done with care it’s easy to over-fit the model. An explanation of overfitting is beyond the scope of this post, but it essentially means hard coding the model to perform well against training data. The model then fails to generalize to new data.

One scenario to look out for when splitting data into training and test sets is when predicting user behavior.  A common example of this is when dealing with a static record that shows up in multiple rows representing a different scenario to model.

To illustrate this, we obfuscate user data streaming through a website checkout process. The UserId denotes a unique user with some demographic information. WeekId is the week of year which changes through time. The features we use are denoted with the Feature_* prefix.

Example of train-test split.
Example of train-test split.

It should be obvious that the demographic data (shown in red) is fixed, while the features we engineer with Teraport vary on a week-by-week basis. The target column to predict, in this case whether a purchase was made or not, is Week4_Outcome.

Many machine learning models, especially nonlinear ones like Random Forest, pick up on static features in data, learn them and give great training results. This is especially true if the same static value is used in both the training and test sets.

The key to overcome this challenge: while splitting data, unique values (UserId in our case) should only used in the training set or the test set, but not in both.

Learner Makes Data Training Simple

Learner makes this simple with a single parameter selection during the model building process. It’s also simple to set the percentage split between training and testing data for each model trained. 

In the example below, we split data from an edtech data set which has week-by-week student academic performance. The goal is to learn and predict the binary classification, label_student_failed.

The unique student is identified by the student_id column. We use the “Group by columns for train/test split” parameter within the Learner model training screen to specify one or more columns to group the records. Automagically, each group lands in either the training or test set – but not both.

Training a model in Learner.
Training a model in Learner.

After training the model with the grouped student_id, Learner splits the target column labels nearly equally. This can be explored in Learner’s Model Diagnostics and Model Compare summaries:

Model diagnostics in Learner.
Model diagnostics in Learner.

Learner also allows for time series splitting and stratified splitting to handle other train/test split scenarios with care. We’ll cover those in future posts.

Happy modeling 🙂

About Loominus

Loominus is an end-to-end platform that helps teams ingest and stage data, build advanced machine learning models with no code and deploy them into production. Loominus makes it easy for individuals and teams without experience building machine learning pipelines to take advantage of machine learning faster. Loominus is equally great for experienced data scientists that need to focus on model selection and tuning.

Get early access to Loominus

Help your business achieve machine learning success