Pandas in Python is an awesome library to help you wrangle with your data, but it can only get you so far. When you start moving into the Big Data space, PySpark is much more effective in accomplishing what you want. This post aims at helping you migrate what you know about Pandas to PySpark. If you are new to Spark, checkout this post about Databricks, and go spin up a cluster to play around. Apache Spark and PySpark Before we get going, let’s take a step back and talk about Apache Spark. Spark is a fast and general engine for large-scale data processing. Spark uses distributed computing to accomplish higher speeds on large datasets. When you submit a request to Spark, the driver node distributes the workload to a number of worker nodes who processes parts of the request in parallel. Think of it as an improvement to original…
With the rise of cloud computing and big data, columnar databases have increased in popularity. One of the main reasons for its rise in popularity is due to its efficiency for analytical queries and therefore business intelligence tools. This post aims to identify the key differences between these two database types and point you in the right direction for your future data warehouse.
Databricks launched a new open source product at the Spark AI Summit 2019 called Delta Lake. Delta Lake touts that it brings ACID transactions to Apache Spark and big data workloads.
I recently had to connect my Azure Databricks instance to our Azure Data Lake Storage (Generation 1) and was running into some problems getting everything set up. I am sure I am not the only one out there having these problems so if you do as well, here is a little guide to get you connected.
The purpose of a pie chart is to tell a story about the parts-to-whole aspect of your data. In other words, it describes how big one part is compared to other parts. This all sounds like a visualization with a good purpose, so why shouldn’t you use it?
Join Kara Annanie and I at the Denver SQL User Group meeting at Microsoft’s offices in Denver on Thursday. We will be presenting on Azure Databricks and how you can get started using it. https://www.meetup.com/Denver-SQL-Server-User-Group/events/jnhjcqyzgbxb/
This post describes the purpose and how to use SQL window functions. If you have looked up window functions in the official documentation, you might have noticed that it can be quite difficult to understand, but I am here to clarify a few things and help you make sense of it.