One of the most common mistakes data scientists make when training machine learning models is incorrectly splitting data for training and testing. The train/test split involves splitting data during the model training and evaluation process. Usually data is divided into two parts:

  1. Training data set – The data used to train the model
  2. Testing data set – The hold out data used to test the performance of the model

Typically we reserve 70% of the data for training and 30% percent for testing. This can vary and should be adjusted depending on the volume of data, the kind of models under consideration and the purpose for modeling.

Read More