Both pandas
apply
and NumPy
vectorize
functions are useful in manipulating the pandas DataFrame, but these functions can have specific uses and performance
characteristics.
pandas
apply
function can be used to apply built-in or custom functions along an axis of the DataFrame.
pandas
apply
function is very flexible and can be used for
applying complex manipulations
such as calculations with conditional logic
on the pandas DataFrame. However, pandas
apply
function is limited by the performance issue.
The NumPy
vectorize
can be also used on pandas DataFrame but it is developed for NumPy arrays with uniform data types.
NumPy
vectorize
has a greater performance on NumPy arrays, but it may not perform as good as pandas built-in vectorized operations.
The following examples demonstrate how to use pandas
apply
and NumPy
vectorize
functions along with their performance for pandas DataFrame manipulations.
Create DataFrame
Create a
random pandas DataFrame
,
# import package
import pandas as pd
import numpy as np
# set random seed for reproducibility
np.random.seed(42)
df = pd.DataFrame({
'col1': np.random.rand(1000000),
'col2': np.random.rand(1000000)
# view first few columns of DataFrame
df.head(4)
col1 col2
0 0.374540 0.595156
1 0.950714 0.364717
2 0.731994 0.005376
3 0.598658 0.561088
We will use this DataFrame for performing manipulations using pandas
apply
, NumPy
vectorize
, and pandas vectorization.
pandas
apply
We will use the pandas
apply
function for calculating the square of the
col2
df['col3']=df['col2'].apply(lambda x: x ** 2)
# view DataFrame
df.head(4)
col1 col2 col3
0 0.374540 0.595156 0.354210
1 0.950714 0.364717 0.133019
2 0.731994 0.005376 0.000029
3 0.598658 0.561088 0.314819
The
col3
contains the square values of
col2
.
NumPy
vectorize
Now, use NumPy
vectorize
function for calculating the square of the
col2
.
Here, we will create a custom function for calculating the square and will pass it to the
vectorize
function.
# import package
import numpy as np
# create a function
def square(a):
return a ** 2
# NumPy vectorize
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])
# view DataFrame
df.head(4)
col1 col2 col3 col4
0 0.374540 0.595156 0.354210 0.354210
1 0.950714 0.364717 0.133019 0.133019
2 0.731994 0.005376 0.000029 0.000029
3 0.598658 0.561088 0.314819 0.314819
The
col4
contains the square values of
col2
calculated using
vectorize
function.
pandas vectorization
Now, use the pandas vectorization function for calculating the square of the
col2
.
# import package
import pandas as pd
# create a function
df['col5'] = df['col2'] ** 2
df.head(4)
col1 col2 col3 col4 col5
0 0.374540 0.595156 0.354210 0.354210 0.354210
1 0.950714 0.364717 0.133019 0.133019 0.133019
2 0.731994 0.005376 0.000029 0.000029 0.000029
3 0.598658 0.561088 0.314819 0.314819 0.314819
The
col4
contains the square values of
col2
calculated using pandas vectorization.
We will compare the performance of pandas
apply
, NumPy
vectorize
, and pandas vectorization.
Calculate performance for pandas
apply
:
import time
start_time = time.time()
df['col3']=df['col2'].apply(lambda x: x ** 2)
print(f"time for pandas apply: {time.time() - start_time} seconds")
# output
time for pandas apply: 0.3888976573944092 seconds
The time required for pandas
apply
function to calculate the square for all values in the column is 0.38s.
Calculate performance for NumPy
vectorize
:
start_time = time.time()
vectorized_sq = np.vectorize(square)
df['col4'] = vectorized_sq(df['col2'])
print(f"time for NumPy vectorize: {time.time() - start_time} seconds")
# output
time for NumPy vectorize: 0.26853299140930176 seconds
The time required for NumPy
vectorize
function to calculate the square for all values in the column is 0.26s. This is much faster than the pandas
apply
function.
Calculate performance for pandas vectorization:
start_time = time.time()
df['col4'] = df['col2'] ** 2
print(f"time for pandas vectorize: {time.time() - start_time} seconds")
time for pandas vectorization: 0.010679006576538086 seconds
The time required for pandas vectorization to calculate the square for all values in the column is 0.01s. This is much more faster than the pandas
apply
and NumPy
vectorize
functions.
In summary, the NumPy
vectorize
function has better performance than the pandas
apply
function. But pandas vectorization outperforms NumPy
vectorize
and pandas
apply
.
Hence, it is essential to choose the right function for pandas data manipulation, especially for large datasets.