Tag

Databricks

Browsing

Pandas in Python is an awesome library to help you wrangle with your data, but it can only get you so far. When you start moving into the Big Data space, PySpark is much more effective in accomplishing what you want. This post aims at helping you migrate what you know about Pandas to PySpark. If you are new to Spark, checkout this post about Databricks, and go spin up a cluster to play around. Apache Spark and PySpark Before we get going, let’s take a step back and talk about Apache Spark. Spark is a fast and general engine for large-scale data processing. Spark uses distributed computing to accomplish higher speeds on large datasets. When you submit a request to Spark, the driver node distributes the workload to a number of worker nodes who processes parts of the request in parallel. Think of it as an improvement to original…