Continuous data stratification

Danil Zherebtsov
6 min readJan 30, 2019
Ben Miners/Getty Images

In this short article I would like to elaborate on the practical aspect of splitting a dataset into train and test sets stratified by continuous (numeric) target variable with implementation example in python

Additionally I will post a solution how to go about the inability of sklearn.model_selection.train_test_split to create stratified splits based on categoric variable, if there are not enough examples of certain class for the stratified split

As usual below is a spoiler to a one line solution for those who are in a hurry to split some continuous data, otherwise read along for details and some more cool stuff at the end of this article

Why is this interesting:

  • there are multiple ready to use methods for splitting a dataset into train and test sets for validating the model, which provide a way to stratify by categorical target variable but none of them is able to stratify a split by continuous variable
  • provided code, apart from being useful within a modeling pipeline, shows how to store the multiple train/test splits for any other reason, for example to export them for other analytics tools

--

--