Automatic memory optimization when reading data into pandas
5 min readDec 19, 2022
This post will start with a short problem statement following by a robust on-line solution.
The problem is as old as the machine learning world: memory constraints will eventually become a bottleneck in a Data Science research.
Consider this:
- you are reading a 1GB file into a
pandas.DataFrame
- then you are creating different copies of the data with different types of preprocessing… load on RAM
- then you might be doing some feature engineering… load on RAM
- finally you might be training different models on the data with multiprocessing which will pass a copy of an unoptimized
dataframe
object into each process… XXX load on RAM - eventually you run out of memory and if your machine is not capable to swap from an SSD (which by default will be much slower than having enough RAM)— you program fails
The underlying root cause is simple — python will upload all the numeric data types by allocating the maximum possible memory slots for each column. So even though your binary column is represented by values from (0,1) range, python will reserve int64
data type for this column whereas the minimum & maximum values for int64
are:
- negative…