Pandas in Python is an awesome library to help you wrangle with your data, but it can only get you so far. When you start moving into the Big Data space, PySpark is much more effective in accomplishing what you want. This post aims at helping you migrate what you know about Pandas to PySpark.

If you are new to Spark, checkout this post about Databricks, and go spin up a cluster to play around.

Apache Spark and PySpark

Before we get going, let’s take a step back and talk about Apache Spark. Spark is a fast and general engine for large-scale data processing. Spark uses distributed computing to accomplish higher speeds on large datasets. When you submit a request to Spark, the driver node distributes the workload to a number of worker nodes who processes parts of the request in parallel. Think of it as an improvement to original MapReduce.

The Spark DataFrame is a key component to Spark as it is a distributed dataset of tabular data. Additionally, the DataFrame is immutable, meaning that any changes you make to the DataFrame create new object references and leaves the old versions unchanged. The DataFrame is also lazy, meaning that any changes or calculations you are doing to the DataFrame, will not actually happen until output is requested.

PySpark vs. Pandas

Despite being similar, they do have their own nuances to accomplishing the same things. Here we will look at some of the differences between them, and how we can use them in similar ways.

ActionPandasPySpark
Load CSVdf = pd.read_csv(“csv_file.csv”)df = spark.read \
.options(header=True) \
.csv(“csv_file.csv”)
View Dataframedf
df.head(10)
df.show()
df.show(10)
View Columns and Data Typesdf.columns
df.dtypes
df.columns
df.dtypes
Rename Columnsdf.columns = [‘a’, ‘b’, ‘c’]
df.rename(columns = {‘old’: ‘new’})
df.toDF(‘a’, ‘b’, ‘c’)
df.withColumnRenamed(‘old’, ‘new’)
Drop Columndf.drop(‘column A’, axis=1)df.drop(‘column A’)
Add Columndf[‘inverse’] = 1 / df.acoldf.withColumn(‘inverse’, 1 / df.acol)
Fill Nullsdf.fillna(0)df.fillna(0)
Aggregationdf.groupby([‘col_a’, ‘col_b’]) \
.agg({‘miles’: ‘sum’, ‘cars’: ‘mean’})
df.groupBy([‘col_a’, ‘col_b’]) \
.agg({‘miles’: ‘sum’, ‘cars’: ‘mean’})

These are the more common functions that you would use when first getting your data into your environment. Now we are going to take a look at some more in-depth wrangling methods. But first, we need to import SQL functions in PySpark.

import pyspark.sql.functions as F

Now that that is done, we can continue with our quest.

Standard Transformations

# Pandas
df['log_mpg'] = np.log(df.mpg)

# PySpark
df.withColumn('log_mpg', F.log(df.mpg))

Row Conditional Statements

# Pandas
df['cond'] = df.apply(lambda x:
    1 if x.mpg > 25 else 2 if x.cylinders == 6 else 3,
    axis=1)

# PySpark
df.withColumn('cond', \
    F.when(df.mpg > 25, 1) \
     .when(df.cylinders == 6, 2) \
     .otherwise(3))

Python Functions

# Pandas
df['plus1'] = df.cars.apply(lambda x: x+1)

# PySpark
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType

fn = F.udf(lambda x: x+1, DoubleType())
df.withColumn('plus1', fn(df.cars))

If we want to use Python to create our own logic to apply to a column, we need to define a user-defined function (or UDF). This is different from the approach you would take in Pandas. This is because Spark wants to have a defined function to map to the worker nodes.

Joining DataFrames

Joining two (2) DataFrames together is very similar in both libraries.

# Pandas
left.merge(right, on='key')
left.merge(right, left_on='a', right_on='b')

# PySpark
left.join(right, on='key')
left.join(right, left.a == right.b)

Summary Statistics

One of the most useful methods for when you start out looking at a dataset is the pandas df.describe() method. The PySpark method is very similar.

# Pandas
df.describe()

# PySpark
df.describe().show()

SQL

Writing SQL functions against your data is actually only possible with PySpark. I have found this to be one of my favorite things about PySpark, as there are many things that are easier to do in SQL.

# PySpark

# First we need to create a table of the dataframe
df.createOrReplaceTempView('my_data')

# Now you can query the data and for example output the results to a new dataframe
df2 = spark.sql('SELECT * FROM my_data')

Conclusion

Now that you are “proficient” in PySpark, go out and have fun.

Make sure that you take advantage of the pyspark.sql.functions and other methods that are packaged along with PySpark. If you are into Data Science, you should take some time to look into the Spark MLLib library, as it contains your most used machine learning algorithms, and they are optimized to be run in a distributed computing network.

Thank you for taking the time to read this. Let me know in the comment section below if you have any questions or concerns. Have fun playing around with PySpark!

I have half a decade of experience working with data science and data engineering in a variety of fields both professionally and in academia. I ahve demonstrated advanced skills in developing machine learning algorithms, econometric models, intuitive visualizations and reporting dashboards in order to communicate data and technical terminology in an easy to understand manner for clients of varying backgrounds.

Write A Comment