2020

python: cooking pandas

Reading big data

A very useful feature of Pandas is iterating over a dataframe, if suddenly it doesn’t fit into memory (it often happens in real work).

data_iter = pd.read_csv('babynames.csv', chunksize=1000)

def data_prep(df, func):
    """ Filtering data 
    """
    for chunk in df:
        tmp = func(chunk)
        tmp.to_csv('dataset.csv')
        
def func_filter(data):
    return data.head(5)

data_prep(data_iter, func_filter)

Reducing the size of a data frame

When reading a table in pandas there is a possibility to influence the size of the occupied data by reducing the amount of memory allocated, but you should be careful with this, if you have very large numbers, reducing 64 bits to 32 can simply cause an overflow and incorrect data. A small benchmark is better here.

# Load data
data = pd.read_csv('babynames.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   Name    1690784 non-null  object
 1   Gender  1690784 non-null  object
 2   Count   1690784 non-null  int64 
 3   Year    1690784 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 51.6+ MB <<<<<< HERE

# Load data with optimise Gender column
data = pd.read_csv('babynames.csv', dtype={'Name': object, 'Gender': 'category','Count': 'int32','Year': 'int16'})
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
 #   Column  Non-Null Count    Dtype   
---  ------  --------------    -----   
 0   Name    1690784 non-null  object  
 1   Gender  1690784 non-null  category
 2   Count   1690784 non-null  int32   
 3   Year    1690784 non-null  int16   
dtypes: category(1), int16(1), int32(1), object(1)
memory usage: 24.2+ MB <<<<<< HERE

So what have we done? By default Pandas allocates 64 bits to Int for each value, we have lowered this to 32 and 16 respectively. And we also changed the Gender type to category, this type is very useful if your column has low granularity, then you can significantly reduce the size, the point is that each value will be encoded as a number, and the text will lie as a dictionary separately.

So we managed to reduce the size by 2 times, good result!

Parsing the date when reading

Correct date parsing and datetime column type, which will reduce the size of your DF (in my practice there was a real case when in financial data from 1C, God forgive me, the date was in a very strange form, and parsed in such a way that the month became a day and everything got mixed up). In general, specify explicitly, it may be longer from the point of view of reading the date frame, but it is obvious and will help to reduce space on the disc.

%time
# Let dateutils figure out how to parse the date on its own
data = pd.read_csv('babynames_time_test.csv', parse_dates=['Year'])

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs

%time
# Telling him how to make out the date
date_parse_func = lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
data = pd.read_csv('babynames_time_test.csv', parse_dates=['Year'], date_parser=date_parse_func)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs

Link to the Jupiter laptop I used

https://github.com/datanlnja/helpful_notebook/blob/main/pandas_optimisation.ipynb