Advanced categoric encoding like a pro: OneHot, MeanTarget, WOE, Frequency, Factorization

Danil Zherebtsov
6 min readDec 9, 2021

A common task for most traditional Data Science projects is transformation of categoric variables.

Most times categoric variables are represented by some text values. E.g. [‘red’, ‘green’, ‘blue’], which need to be represented by some meaningful numeric equivalent.

There are various different ways, and some of them depend on the type of final model to be trained.

In this article I will describe the top 5 best practice options with explanation and code in python. We will be working with titanic dataset from Kaggle. Download it to follow along with the examples.

import pandas as pd
train = pd.read_csv('titanic/train.csv')
test = pd.read_csv('titanic/test.csv')
train.head()

Okay. So all the further examples will be demonstrated on encoding the categoric column ‘Embarked’ with 3 distinct categories ‘S’, ‘C’, ‘Q’. It includes missing values too…

Let’s go over all the possible options.

OneHotEncoding

--

--