Polars vs. Pandas: A performance Showdown for Data Processing

Introduction

Introduction

This report focuses on comparing the performance of Polars library with the Pandas Library which are part of the Python ecosystem. The experiments include measuring the time for reading various data files which includes file formats such as csv, json, parquet and excel.

The experiments additionally explore the time taken for column selection, filtering of rows and sorting of data frames.

All the experiments were done through statistical means by repeated experiments on the same dataset.

Key Takeaways

1. A comparison of Polars and Pandas Performance

Polars outperformed pandas in execution time for reading data, writing data and other dataframe operations which was validated by a t-test on repeated experiments on the same dataset.

2. Enhanced reading and writing speed

Data reading - The average time for Polars for reading data from csv's was 0.0234 seconds whereas in Pandas it was 0.3506 seconds. Similarly, the average time for reading data from an excel for polars awas 1.5653 seconds whereas for Pandas it was 23.2387 seconds.

Data Writing - Polars took an average of 0.0327 seconds, while Pandas took 0.0472 seconds for csv data. With p-values less than 0.05, statistical validation validated Polars advantage.

3. Trustworthy testing and methodology

100,000 customer records in 12 columns make up the dataset. The performance was compared on a machine with 16.0 GB of RAM (15.7 GB usable) and a 12th Gen Intel(R) Core (TM) i5-1235U 1.30 GHz processor. Robust statistical approach ensured reliable results.

Conclusion

The comparison of performance between Polars and Pandas indicates that Polars consistently outperformed Pandas in data reading, writing, selecting, filtering and sorting operations on a dataset of 100,000 customer records. Polars can be recommended over Pandas for tasks where execution time is crucial.

About the Authors

Kumara has 6 years of industry experience with Python, Data Engineering and High Scalable System Design. Work on designing ETL pipelines, Data warehouses and highly scalable APIs.

Kalinga working as an intern for past 4 months and has a great understanding on Python and SQL. Work alongside on tackling day to day problems and implement solutions for data systems.

Thank you for sharing your details

Share this on