Parallelize NumPy Array Operations for Increased Speed
Enhance the array operational process with methods you may not have previously known.

The larger our dataset, the longer it takes to process. While we can wait for the whole process to end, sometimes it takes too much time and needs to meet the business standards. That’s why there are instances where we need to improve the process speed.
While the NumPy process is fast enough, parallelizing it further enhances the speed. This is especially true when processing large NumPy arrays.
This article will discuss various methods for parallelizing the NumPy array operations process.
Curious about them? Let’s get into it!
Parallelization in NumPy
Parallelization is dividing tasks into smaller chunks and executing them simultaneously on multiple processes. NumPy has leveraged many levels to provide parallelization under the hood, but we can explicitly perform parallelization to improve the speed.
Let’s try parallelizing NumPy Array Operations. First, let’s install several packages we will use for the tutorial.
pip install joblib numexpr dask
With the package installed, we should benchmark the process. For example, NumPy processes the array operations' time, as shown below.
import numpy as np
import time
arr1 = np.random.rand(100_000_000)
arr2 = np.random.rand(100_000_000)
start = time.time()
result = arr1 * arr2
end = time.time()
print(f"NumPy Process Time: {end - start:.4f} seconds")
Output:
NumPy Process Time: 0.2972 seconds
The above matrix multiplication for the single thread is fast enough, but we can still improve them with the parallelization process.
Let’s try it out with the OpenMP (Open Multi-Processing) process. OpenMP is a parallel programming interface that allows the user to access the API for multi-threading. Python didn’t generally access OpenMP, but NumPy libraries access them under the hood.
In this case. We can access the OpenMP multi-thread using the following code.
import os
import numpy as np
import time
os.environ["OMP_NUM_THREADS"] = "4"
start = time.time()
result = arr1 * arr2
end = time.time()
print(f"OpenMP processing time: {end - start:.4f} seconds")
Output:
OpenMP processing time: 0.2956 seconds
There is a slight improvement in the processing speed.
Next, we would try a parallelization process with the NumExpr. The NumExpr accelerates the NumPy array process evaluator using parallelization and better chase utilization. This means it was faster than the normal NumPy operation process and had better memory usage.
Let’s try it out with the following code.
import numexpr as ne
import time
start = time.time()
result = ne.evaluate("arr1 * arr2")
end = time.time()
print(f"NumExpr Process Time: {end - start:.4f} seconds")
Output:
NumExpr Process Time: 0.0696 seconds
The processing time is faster than the NumPy process benchmark. NumExpr would even achieve a speedier time with a higher number of arrays.
You can use several methods to try parallelization, even though it might not be faster than the previous example. However, it might be beneficial for the memory allocation.
For example, we can use Joblib.
from joblib import Parallel, delayed
import numpy as np
import time
chunks1 = np.array_split(arr1, 4)
chunks2 = np.array_split(arr2, 4)
start = time.time()
results = Parallel(n_jobs=4)(
delayed(lambda x, y: x * y)(chunk1, chunk2) for chunk1, chunk2 in zip(chunks1, chunks2)
)
final_result = np.concatenate(results)
end = time.time()
print(f"Joblib Process Time: {end - start:.4f} seconds")
Output:
Joblib Process Time: 3.5725 seconds
Then we can also use Dask as well as an alternative.
import dask.array as da
import time
arr1 = da.random.random(100_000_000, chunks=(25_000_000,))
arr2 = da.random.random(100_000_000, chunks=(25_000_000,))
start = time.time()
result = (arr1 * arr2).compute()
end = time.time()
print(f"Dask Process Time: {end - start:.4f} seconds")
Output:
Dask Process Time: 0.8063 seconds
Joblib or Dask might be necessary when processing is more complex, but NumExpr has shown better speed performance in our testing.
Conclusion
The NumPy operation process is already fast under the hood, but we can make it even faster with parallelization. In this article, we have discussed a few methods for parallelization; however, the faster speed-up process we found from our testing is using NumExpr.
I hope this has helped!
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.