1 of 48

Nathan Cheever

Data Scientist, AdvancedMD

Faster Data Manipulation

vectorizing with Pandas and NumPy

2 of 48

A little about me

3 of 48

Is that speed gain possible?

Actual results may vary!

Data size

Time

For Loop

Vectorization

1000x faster sounds like an infomercial. Python isn’t known for its speed. Can that really true?! Yes of course it’s possible. It’s all in how you approach your operations.

If you use a for loop over your data, the time it takes to complete will be proportional to the size of your data. But there’s another approach which can give the same result in a fraction of the time, and that’s vectorization.

What does that mean? That means taking code that takes for example 15 seconds and doing in closer to 15 ms.

Of course, depending on your dataset, your versions of Python, packages, hardware of course, so actual results may vary.

But, the big win for you is that you can keep your code in Python and do things fast, thanks to the great libraries of pandas and numpy

Ok let’s talk more about vectorization...

4 of 48

What is vectorization?

Strategy for avoiding looping over your dataset
Operating on an array simultaneously
Using NumPy ufuncs to push your operation to C level for-loops
Allows for huge speed increases

groupby
filter
transform
Column arithmetic operations ( +, - / * )
.str methods
pd.to_datetime()

5 of 48

It’s catching on

6 of 48

Sofia Heisler’s PyCon Talk

Python for loop over rows

Looping with .iterrows()

.apply() method

Pandas series vectorization

NumPy array vectorization

youtu.be/HN5d490_KKk

If you haven’t seen it yet and are new to this subject, I highly recommend you check it out.

She explains well how to profile and time your code using lprun and notebook magic functions like time and timeit. Her central point is that overall looping should be avoided and when possible you should try to do vectorized operations.

Her talk showcases a distance calculation between a super hero comic book store in Brooklyn and other Lat/Long points. <SLOW> She does the same calculation in each of these ways to show how easy it was to gain, in her case a 500x improvement as you move towards vectorization.

Her talk was my intro to the subject before it became popular on Towards Data Science and other places. It prompted me to start questioning all my functions that I was putting into .apply() methods. But I quickly came to a bump in the road...something not in her talk...

7 of 48

>>> df['status'] = df.apply(set_lead_status, axis=1)

# 8.1 s

def set_lead_status(row):

if row['current_status'] == 'None':

return row['status_at_time_of_lead']

else:

return row['current_status']

8 of 48

>>> df['status'] = set_lead_status(df['cs'], df['st'])

def set_lead_status(col1, col2):

if col1 == 'None':

return col2

else:

return col1

def set_lead_status(row):

if row['cs'] == 'None':

return row['st']

else:

return row['cs']

# ValueError!: The truth value of a Series is ambiguous.

9 of 48

Vectorizing logical statements?

10 of 48

numpy.where()

vectorizing if/else

11 of 48

Syntax

np.where(

conditional statement → bool array,

series/array/function()/scalar if True,

series/array/function()/scalar if False

)

12 of 48

np.where(

df['status_at_time_of_lead'] == 'None',

df['current_status'],

df['status_at_time_of_lead']

)

>>> df['status'] = np.where(^^^)

13 of 48

np.where(

df['status_at_time_of_lead'] == 'None',

df['current_status'],

'NA'

)

>>> df['status'] = np.where(^^^)

14 of 48

>>> df['status'] = np.where(^^^)

np.where(

df['status_at_time_of_lead'].values == 'None',

df['current_status'].values,

df['status_at_time_of_lead'].values

)

15 of 48

df.apply method

NumPy Vectorized

8.1 s

8.8 ms

920x faster!

n = 273,399

16 of 48

What about

numpy.vectorize()?

17 of 48

>>> vectfunc = np.vectorize(set_lead_status)

>>> df['status'] = vectfunc(

df['current_status'],

df['status_at_time_of_lead']

)

# 137 ms

def set_lead_status(col1, col2):

if col1 == '- None -':

return col2

else:

return col1

18 of 48

What multiple conditions?

if...elif...elif...else

19 of 48

def lead_category(row):

if row['date_col1'] == row['date_col2']:

return 'New Lead'

elif row['norm_status'].startswith('CLI'):

return 'Client Lead'

elif row['norm_status'] in list2:

return 'MTouch Lead'

elif row['norm_status'] in list1:

return 'EMTouch Lead'

else:

return 'NA'

20 of 48

20

def lead_category(row):

if row['date_col1'] == row['date_col2']:

return 'New Lead'

elif row['norm_status'].startswith('CLI'):

return 'Client Lead'

elif row['norm_status'] in list2:

return 'MTouch Lead'

elif row['norm_status'] in list1:

return 'EMTouch Lead'

else:

return 'NA'

df['lead_category'] = \

np.where(df['date_col1'] == df['date_col2'], 'New Lead',

np.where(df['norm_status'].str.startswith('CLI'), 'Client Lead',

np.where(df['norm_status'].isin(list2), 'MTouch Lead',

np.where(df['norm_status'].isin(list1), 'EMTouch Lead',

'NA'))))

21 of 48

numpy.select()

vectorizing if...elif...else

22 of 48

conditions = [

df['date_col1'] == df['date_col2'],

df['norm_status'].str.startswith('CLI'),

df['norm_status'].isin(multi_touch_leads),

df['norm_status'].isin(eng_multit_leads)

]

choices = [

'New Lead',

'Client Lead',

'MTouch Lead',

'EMTouch Lead'

]

>>> df['leadcat'] = np.select(conditions, choices, default='NA')

23 of 48

df.apply method

NumPy Vectorized

12.5 s

140 ms

89x faster!

n = 273,399

24 of 48

def sub_conditional(row):

if row['Inactive'] == 'No':

if row['Providers'] == 0:

return 'active_no_providers'

elif row['Providers'] < 5:

return 'active_small'

else:

return 'active_normal'

elif row['duplicate_leads']:

return 'is_dup'

else:

if row['bad_leads']:

return 'active_bad'

else:

return 'active_good'

25 of 48

conditions = [

((df['Inactive'] == 'No') & (df['Providers'] == 0)),

((df['Inactive'] == 'No') & (df['Providers'] < 5)),

df['Inactive'] == 'No',

df['duplicate_leads'],

df['bad_leads'],

]

choices = [

'active_no_providers',

'active_small',

'active_normal',

'is_dup',

'active_bad',

]

>>> df['lead_type'] = np.select(conditions, choices, default='NA')

26 of 48

df.apply method

NumPy Vectorized

9.4 s

24.8 ms

380x faster!

n = 273,399

27 of 48

What about more complicated things?

working with

strings
dictionaries
dates
other rows

28 of 48

Strings

def find_paid_nonpaid(s):

if re.search(r'non.*?paid', s, re.I):

return 'non-paid'

elif re.search(r'Buyerzone|^paid\s+', s, re.I):

return 'paid'

else:

return 'NA'

>>> df['is_paid'] = df['Lead Source'].apply(find_paid_nonpaid)

29 of 48

Using pandas .str methods

conditions = [

df['Lead Source'].str.contains(r'non.*?paid', na=False),

df['Lead Source'].str.contains(r'Buyerzone|^paid\s+', na=False)

]

choices = ['non-paid', 'paid']

>>> df['is_paid'] = np.select(conditions, choices, default='NA')

30 of 48

Using np.vectorize

>>> vect_str = np.vectorize(find_paid_nonpaid)

>>> df['is_paid'] = vect_str(df['Lead Source'])

31 of 48

df.apply method

Using np.vectorize

pandas .str methods

540 ms

563 ms

752 ms

32 of 48

Dictionary lookups

def a_dict_lookup(row):

if row['Providers'] > 7:

return 'Upmarket'

else:

channel = channel_dict.get(row['Category'])

return channel

>>> df['dict_lookup'] = df.apply(a_dict_lookup, axis=1)

33 of 48

Using pandas .map method

>>> df['dict_lookup1'] = np.where(

df['Providers'] > 7,

'Upmarket',

df['Category'].map(channel_dict)

)

34 of 48

df.apply method

NumPy Vectorized

7.84 s

26.6 ms

295x faster!

n = 273,399

35 of 48

Dates

def weeks_to_complete(row):

if pd.isnull(row['Start Date']):

return (row['Original Date Created'] - row['Date Created']).days / 7

else:

return (row['Date Created'] - row['Start Date']).days / 7

>>> df['weeks_to_complete'] = df.apply(weeks_to_complete, axis=1)

36 of 48

Using .dt accessor

x = np.where(df['Start Date'].isnull().values,

(df['Original Date Created'].values - df['Date Created']).dt.days / 7,

(df['Date Created'].values - df['Start Date']).dt.days / 7)

>>> df['weeks_to_complete1'] = x

37 of 48

Using .dt accessor

x = np.where(df['Start Date'].isnull().values,

(df['Original Date Created'].values - df['Date Created']).dt.days / 7,

(df['Date Created'].values - df['Start Date']).dt.days / 7)

>>> df['weeks_to_complete1'] = x

ndarray type casting

y = np.where(df['Start Date'].isnull().values,

((df['Original Date Created'].values - df['Date Created'].values) \ .astype('timedelta64[D]') / np.timedelta64(1, 'D')) / 7,

((df['Date Created'].values - df['StartDate'].values) \ .astype('timedelta64[D]') / np.timedelta64(1, 'D')) / 7)

>>> df['weeks_to_complete2'] = y

38 of 48

df.apply method

Pandas .dt

accessor

NumPy type casting

20.5 s

24 ms

12.8 ms

1,601x faster!

39 of 48

40 of 48

Using values from other rows

def really_slow(df):

output = []

for i, row in df.iterrows():

if i > 0:

if df.iloc[i]['ID'] == df.iloc[i-1]['ID']:

if (df.iloc[i]['Date'] - df.iloc[i-1]['Date']).days > 5:

output.append(0)

else:

output.append(1)

else:

output.append(1)

else:

output.append(np.nan)

return output

=IF(A2=A1, IF(L2-L1 < 5, 0, 1), 1))

41 of 48

prev_id = df['ID'].shift(1).fillna(0).astype(int)

prev_date = df['Date'].shift(1).fillna(pd.Timestamp('1900'))

conditions = [

((df['Internal ID'].values == prev_id) &

(df['Date Created'] - prev_date).astype('timedelta64[D]') < 5),

df['Internal ID'].values == prev_id

]

choices = [0, 1]

>>> df['time_col1'] = np.select(conditions, choices, default=1)

42 of 48

df.iterrows method

NumPy Vectorized

3min 17s

17.9 ms

n = 273,399

43 of 48

that’s 11,006x faster!

44 of 48

What about more complicated...er things?

45 of 48

Parallel .apply()

Dask

from multiprocessing import Pool

def p_apply(df, func, cores=4):

df_split = np.array_split(df, cores)

pool = Pool(n_cores)

df = pd.concat(pool.map(func, df_split))

pool.close()

pool.join()

return df

>>> df = p_apply(df, func=some_big_function)

https://docs.dask.org

Scaling beyond a single machine OR

more effectively parallelizing on your machine

46 of 48

Takeaways

Vectorization can speed things up immensely!

np.where → one logical condition

np.select → 2+ logical conditions

If your logic is too much for vectorization, there are other alternatives

Really know your logic and write test cases before optimizing.

Actual results may vary!

47 of 48

Thanks!

Nathan Cheever

nathan.cheever12@gmail.com

@data_cheeves

Repo: https://bit.ly/2AG9M3i

48 of 48

Resources

https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care

https://stackoverflow.com/questions/50744369/how-to-speed-up-pandas-string-function

https://stackoverflow.com/questions/18215317/extracting-days-from-a-numpy-timedelta64-value