python: cooking pandas
Reading big data
A very useful feature of Pandas is iterating over a dataframe, if suddenly it doesn’t fit into memory (it often happens in real work).
data_iter = pd.read_csv('babynames.csv', chunksize=1000)
def data_prep(df, func):
""" Filtering data
"""
for chunk in df:
tmp = func(chunk)
tmp.to_csv('dataset.csv')
def func_filter(data):
return data.head(5)
data_prep(data_iter, func_filter)
Reducing the size of a data frame
When reading a table in pandas there is a possibility to influence the size of the occupied data by reducing the amount of memory allocated, but you should be careful with this, if you have very large numbers, reducing 64 bits to 32 can simply cause an overflow and incorrect data. A small benchmark is better here.
# Load data
data = pd.read_csv('babynames.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1690784 non-null object
1 Gender 1690784 non-null object
2 Count 1690784 non-null int64
3 Year 1690784 non-null int64
dtypes: int64(2), object(2)
memory usage: 51.6+ MB <<<<<< HERE
# Load data with optimise Gender column
data = pd.read_csv('babynames.csv', dtype={'Name': object, 'Gender': 'category','Count': 'int32','Year': 'int16'})
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1690784 entries, 0 to 1690783
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1690784 non-null object
1 Gender 1690784 non-null category
2 Count 1690784 non-null int32
3 Year 1690784 non-null int16
dtypes: category(1), int16(1), int32(1), object(1)
memory usage: 24.2+ MB <<<<<< HERE
So what have we done? By default Pandas allocates 64 bits to Int for each value, we have lowered this to 32 and 16 respectively. And we also changed the Gender type to category, this type is very useful if your column has low granularity, then you can significantly reduce the size, the point is that each value will be encoded as a number, and the text will lie as a dictionary separately.
So we managed to reduce the size by 2 times, good result!
Parsing the date when reading
Correct date parsing and datetime column type, which will reduce the size of your DF (in my practice there was a real case when in financial data from 1C, God forgive me, the date was in a very strange form, and parsed in such a way that the month became a day and everything got mixed up). In general, specify explicitly, it may be longer from the point of view of reading the date frame, but it is obvious and will help to reduce space on the disc.
%time
# Let dateutils figure out how to parse the date on its own
data = pd.read_csv('babynames_time_test.csv', parse_dates=['Year'])
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 7.15 µs
%time
# Telling him how to make out the date
date_parse_func = lambda x: pd.datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
data = pd.read_csv('babynames_time_test.csv', parse_dates=['Year'], date_parser=date_parse_func)
CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.68 µs
Link to the Jupiter laptop I used
https://github.com/datanlnja/helpful_notebook/blob/main/pandas_optimisation.ipynb