Continuous data stratification
In this short article I would like to elaborate on the practical aspect of splitting a dataset into train and test sets stratified by continuous (numeric) target variable with implementation example in python
Additionally I will post a solution how to go about the inability of sklearn.model_selection.train_test_split to create stratified splits based on categoric variable, if there are not enough examples of certain class for the stratified split
As usual below is a spoiler to a one line solution for those who are in a hurry to split some continuous data, otherwise read along for details and some more cool stuff at the end of this article
Why is this interesting:
- there are multiple ready to use methods for splitting a dataset into train and test sets for validating the model, which provide a way to stratify by categorical target variable but none of them is able to stratify a split by continuous variable
- provided code, apart from being useful within a modeling pipeline, shows how to store the multiple train/test splits for any other reason, for example to export them for other analytics tools