1 of 49

Nathan Cheever

Data Scientist, AdvancedMD

1000x Faster Data Manipulation

vectorizing with Pandas and NumPy

2 of 49

A little about me

Data Scientist at AdvancedMD

I 💖 NLP

utahdatascientist.org

Have two kids

Claims to fame

3 of 49

What we’re going to talk about

Strategies for manipulating data in pandas

for loop approaches vs. vectorized approaches
what is vectorization?
when to use?
what can I do with it?
other options

Follow along notebook on the repo here:

bit.ly/37IM2Kg

4 of 49

Is that speed gain possible?

Actual results may vary!

Data size

Time

For Loop

Vectorization

1000x faster sounds like an infomercial. Python isn’t known for its speed. Can that really true?! Yes of course it’s possible. It’s all in how you approach your operations.

If you use a for loop over your data, the time it takes to complete will be proportional to the size of your data. But there’s another approach which can give the same result in a fraction of the time, and that’s vectorization.

What does that mean? That means taking code that takes for example 15 seconds and doing in closer to 15 ms.

Of course, depending on your dataset, your versions of Python, packages, hardware of course, so actual results may vary.

But, the big win for you is that you can keep your code in Python and do things fast, thanks to the great libraries of pandas and numpy

Ok let’s talk more about vectorization...

5 of 49

What is vectorization?

groupby
filter
transform
Column arithmetic operations ( +, - / * )
.str methods
pd.to_datetime()

Strategy for avoiding you writing ‘for’ loops over your dataset
Operating on an array simultaneously
Using NumPy ufuncs to push your operation to C level for-loops
Allows for huge speed increases

6 of 49

It’s catching on

7 of 49

Sofia Heisler’s PyCon Talk

Python for loop over rows

Looping with .iterrows()

.apply() method

Pandas series vectorization

NumPy array vectorization

bit.ly/35yZajt

If you haven’t seen it yet and are new to this subject, I highly recommend you check it out.

She explains well how to profile and time your code using lprun and notebook magic functions like time and timeit. Her central point is that overall looping should be avoided and when possible you should try to do vectorized operations.

Her talk showcases a distance calculation between a super hero comic book store in Brooklyn and other Lat/Long points. <SLOW> She does the same calculation in each of these ways to show how easy it was to gain, in her case a 500x improvement as you move towards vectorization.

Her talk was my intro to the subject before it became popular on Towards Data Science and other places. It prompted me to start questioning all my functions that I was putting into .apply() methods. But I quickly came to a bump in the road...something not in her talk...

8 of 49

>>> df['status'] = df.apply(set_lead_status, axis=1)

# 4.8 s

def set_lead_status(row):

if row['current_status'] == 'None':

return row['status_at_time_of_lead']

else:

return row['current_status']

9 of 49

>>> df['status'] = set_lead_status(df['cs'], df['st'])

def set_lead_status(col1, col2):

if col1 == 'None':

return col2

else:

return col1

def set_lead_status(row):

if row['cs'] == 'None':

return row['st']

else:

return row['cs']

# ValueError!: The truth value of a Series is ambiguous.

10 of 49

Vectorizing logical statements?

11 of 49

numpy.where()

vectorizing if/else

12 of 49

Syntax

np.where(

conditional statement → bool array,

series/array/function()/scalar if True,

series/array/function()/scalar if False

)

13 of 49

np.where(

df['status_at_time_of_lead'] == 'None',

df['current_status'],

df['status_at_time_of_lead']

)

>>> df['status'] = np.where(^^^)

14 of 49

np.where(

df['status_at_time_of_lead'] == 'None',

df['current_status'],

'NA'

)

>>> df['status'] = np.where(^^^)

15 of 49

>>> df['status'] = np.where(^^^)

np.where(

df['status_at_time_of_lead'].values == 'None',

df['current_status'].values,

df['status_at_time_of_lead'].values

)

16 of 49

df.apply method

NumPy Vectorized

4.8 s

5.8 ms

830x faster!

n = 273,399

17 of 49

What about

numpy.vectorize()?

18 of 49

>>> vectfunc = np.vectorize(set_lead_status)

>>> df['status'] = vectfunc(

df['current_status'],

df['status_at_time_of_lead']

)

# 96 ms

def set_lead_status(col1, col2):

if col1 == '- None -':

return col2

else:

return col1

19 of 49

What multiple conditions?

if...elif...elif...else

20 of 49

def lead_category(row):

if row['date_col1'] == row['date_col2']:

return 'New Lead'

elif row['norm_status'].startswith('CLI'):

return 'Client Lead'

elif row['norm_status'] in list2:

return 'MTouch Lead'

elif row['norm_status'] in list1:

return 'EMTouch Lead'

else:

return 'NA'

21 of 49

21

def lead_category(row):

if row['date_col1'] == row['date_col2']:

return 'New Lead'

elif row['norm_status'].startswith('CLI'):

return 'Client Lead'

elif row['norm_status'] in list2:

return 'MTouch Lead'

elif row['norm_status'] in list1:

return 'EMTouch Lead'

else:

return 'NA'

df['lead_category'] = \

np.where(df['date_col1'] == df['date_col2'], 'New Lead',

np.where(df['norm_status'].str.startswith('CLI'), 'Client Lead',

np.where(df['norm_status'].isin(list2), 'MTouch Lead',

np.where(df['norm_status'].isin(list1), 'EMTouch Lead',

'NA'))))

22 of 49

numpy.select()

vectorizing if...elif...else

23 of 49

conditions = [

df['date_col1'] == df['date_col2'],

df['norm_status'].str.startswith('CLI'),

df['norm_status'].isin(multi_touch_leads),

df['norm_status'].isin(eng_multit_leads)

]

choices = [

'New Lead',

'Client Lead',

'MTouch Lead',

'EMTouch Lead'

]

>>> df['leadcat'] = np.select(conditions, choices, default='NA')

24 of 49

df.apply method

NumPy Vectorized

6.9 s

82 ms

84x faster!

n = 273,399

25 of 49

def sub_conditional(row):

if row['Inactive'] == 'No':

if row['Providers'] == 0:

return 'active_no_providers'

elif row['Providers'] < 5:

return 'active_small'

else:

return 'active_normal'

elif row['duplicate_leads']:

return 'is_dup'

else:

if row['bad_leads']:

return 'active_bad'

else:

return 'active_good'

26 of 49

conditions = [

((df['Inactive'] == 'No') & (df['Providers'] == 0)),

((df['Inactive'] == 'No') & (df['Providers'] < 5)),

df['Inactive'] == 'No',

df['duplicate_leads'],

df['bad_leads'],

]

choices = [

'active_no_providers',

'active_small',

'active_normal',

'is_dup',

'active_bad',

]

>>> df['lead_type'] = np.select(conditions, choices, default='NA')

27 of 49

df.apply method

NumPy Vectorized

5.7 s

35.8 ms

160x faster!

n = 273,399

28 of 49

What about more complicated things?

working with

strings
dictionaries
dates
other rows

29 of 49

Strings

def find_paid_nonpaid(s):

if re.search(r'non.*?paid', s, re.I):

return 'non-paid'

elif re.search(r'Buyerzone|^paid\s+', s, re.I):

return 'paid'

else:

return 'NA'

>>> df['is_paid'] = df['Lead Source'].apply(find_paid_nonpaid)

30 of 49

Using pandas .str methods

conditions = [

df['Lead Source'].str.contains(r'non.*?paid', na=False),

df['Lead Source'].str.contains(r'Buyerzone|^paid\s+', na=False)

]

choices = ['non-paid', 'paid']

>>> df['is_paid'] = np.select(conditions, choices, default='NA')

31 of 49

Using np.vectorize

>>> vect_str = np.vectorize(find_paid_nonpaid)

>>> df['is_paid'] = vect_str(df['Lead Source'])

32 of 49

df.apply method

Using np.vectorize

pandas .str methods

480 ms

535 ms

493 ms

33 of 49

Dictionary lookups

def a_dict_lookup(row):

if row['Providers'] > 7:

return 'Upmarket'

else:

channel = channel_dict.get(row['Category'])

return channel

>>> df['dict_lookup'] = df.apply(a_dict_lookup, axis=1)

34 of 49

Using pandas .map method

>>> df['dict_lookup1'] = np.where(

df['Providers'] > 7,

'Upmarket',

df['Category'].map(channel_dict)

)

35 of 49

df.apply method

NumPy Vectorized

5.2 s

17.5 ms

294x faster!

n = 273,399

36 of 49

Dates

def weeks_to_complete(row):

if pd.isnull(row['Start Date']):

return (row['Original Date Created'] - row['Date Created']).days / 7

else:

return (row['Date Created'] - row['Start Date']).days / 7

>>> df['weeks_to_complete'] = df.apply(weeks_to_complete, axis=1)

37 of 49

Using .dt accessor

x = np.where(

df['Start Date'].isnull().values,

(df['Original Date Created'].values - df['Date Created']).dt.days / 7,

(df['Date Created'].values - df['Start Date']).dt.days / 7

)

>>> df['weeks_to_complete1'] = x

38 of 49

Using .dt accessor

x = np.where(df['Start Date'].isnull().values,

(df['Original Date Created'].values - df['Date Created']).dt.days / 7,

(df['Date Created'].values - df['Start Date']).dt.days / 7)

>>> df['weeks_to_complete1'] = x

ndarray type casting

y = np.where(df['Start Date'].isnull().values,

((df['Original Date Created'].values - df['Date Created'].values) \ .astype('timedelta64[D]') / np.timedelta64(1, 'D')) / 7,

((df['Date Created'].values - df['StartDate'].values) \ .astype('timedelta64[D]') / np.timedelta64(1, 'D')) / 7)

>>> df['weeks_to_complete2'] = y

39 of 49

df.apply method

Pandas .dt

accessor

NumPy type casting

12.7 s

19 ms

7.9 ms

1,606x faster!

40 of 49

41 of 49

Using values from other rows

def really_slow(df):

output = []

for i, row in df.iterrows():

if i > 0:

if df.iloc[i]['ID'] == df.iloc[i-1]['ID']:

if (df.iloc[i]['Date'] - df.iloc[i-1]['Date']).days > 5:

output.append(0)

else:

output.append(1)

else:

output.append(1)

else:

output.append(np.nan)

return output

=IF(A2=A1, IF(L2-L1 > 5, 0, 1), 1))

42 of 49

prev_id = df['ID'].shift(1).fillna(0).astype(int)

prev_date = df['Date'].shift(1).fillna(pd.Timestamp('1900'))

conditions = [

((df['Internal ID'].values == prev_id) &

(df['Date Created'] - prev_date).astype('timedelta64[D]') < 5),

df['Internal ID'].values == prev_id

]

choices = [0, 1]

>>> df['time_col1'] = np.select(conditions, choices, default=1)

43 of 49

df.iterrows method

NumPy Vectorized

2min 19s

11.1 ms

n = 273,399

44 of 49

that’s 12,523x faster!

45 of 49

What about more complicated...er things?

46 of 49

Dask

Parallel .apply()

https://docs.dask.org

Scaling beyond a single machine OR

more effectively parallelizing on your machine

from multiprocessing import Pool

def p_apply(df, func, cores=4):

df_split = np.array_split(df, cores)

pool = Pool(n_cores)

df = pd.concat(pool.map(func, df_split))

pool.close()

pool.join()

return df

>>> df = p_apply(df, func=some_big_function)

https://bit.ly/2RnlEit

47 of 49

Takeaways

Vectorization can speed things up immensely!

np.where → one logical condition

np.select → 2+ logical conditions

If your logic is too much for vectorization, there are other alternatives

Really know your logic and write test cases before optimizing.

Actual results may vary!

48 of 49

Thanks!

Nathan Cheever

nathan.cheever12@gmail.com

@data_cheeves

Repo: https://bit.ly/2AG9M3i

49 of 49

Resources

https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care

https://stackoverflow.com/questions/50744369/how-to-speed-up-pandas-string-function

https://stackoverflow.com/questions/18215317/extracting-days-from-a-numpy-timedelta64-value