If you’re in traditional machine learning, you are likely facing an ongoing battle with missing values (NaN) in your data.
This is the case in just about every Kaggle competition as well as in most of the tabular datasets out there.
Here I will go over all the viable approaches, provide the code in python and add a cherry on top at the end of this paper.
If you’re eager to jump into coding as you read, don’t. Finish reading first.
Having read this article you will have dealt with this problem once and for all.
Spoiler: in this article…
The purpose of this article is to fix the sometimes incorrect stratification of train / valid splits so that multiclass classification models can be evaluated without an error.
Sklearn has great inbuilt functions to either preform a single stratified split
from sklearn.model_selection import train_test_split as splittrain, valid = split(df, test_size = 0.3, stratify=df[‘target’])
This will produce two splits where train will be 70% of the df and validation set will be 30% respectively with almost the same distribution of the ‘terget’ variable in both splits
In case one needs to evaluate a result of some function or a model…
Suppose you have only one validation set available for model validation. This is not the usual case, but it happens now and then.
In this case your objective would be to train the model on the train data to provide the best possible result (metric) on the only validation set.
We will leave data preprocessing / feature engineering & selection out of the scope of this article and focus only on model parameter tuning scheme.
So you’ve got your train & validation sets prepared.
With validation strategy development out of the way (we’ve got a finite dataset for validation), next…
In this short article I would like to elaborate on the practical aspect of splitting a dataset into train and test sets stratified by continuous (numeric) target variable with implementation example in python
Additionally I will post a solution how to go about the inability of sklearn.model_selection.train_test_split to create stratified splits based on categoric variable, if there are not enough examples of certain class for the stratified split
As usual below is a spoiler to a one line solution for those who are in a hurry to split some continuous data, otherwise read along for details and some more cool…
Applied Machine Learning Engineer