Automatic memory optimization when reading data into pandas

Danil Zherebtsov
5 min readDec 19, 2022

This post will start with a short problem statement following by a robust on-line solution.

The problem is as old as the machine learning world: memory constraints will eventually become a bottleneck in a Data Science research.

Consider this:

  • you are reading a 1GB file into a pandas.DataFrame
  • then you are creating different copies of the data with different types of preprocessing… load on RAM
  • then you might be doing some feature engineering… load on RAM
  • finally you might be training different models on the data with multiprocessing which will pass a copy of an unoptimizeddataframe object into each process… XXX load on RAM
  • eventually you run out of memory and if your machine is not capable to swap from an SSD (which by default will be much slower than having enough RAM)— you program fails

The underlying root cause is simple — python will upload all the numeric data types by allocating the maximum possible memory slots for each column. So even though your binary column is represented by values from (0,1) range, python will reserve int64 data type for this column whereas the minimum & maximum values for int64 are:

  • negative

--

--