A logistic regression is a model that is appropriate to use when the dependent variable is binary, i.e. 0s and 1s, True or False, Yes or No. The logistic regression is part of the regression analysis library and could therefor be interpreted as a predictive analytics model.
Pandas in Python is an awesome library to help you wrangle with your data, but it can only get you so far. When you start moving into the Big Data space, PySpark is much more effective in accomplishing what you want. This post aims at helping you migrate what you know about Pandas to PySpark. If you are new to Spark, checkout this post about Databricks, and go spin up a cluster to play around. Apache Spark and PySpark Before we get going, let’s take a step back and talk about Apache Spark. Spark is a fast and general engine for large-scale data processing. Spark uses distributed computing to accomplish higher speeds on large datasets. When you submit a request to Spark, the driver node distributes the workload to a number of worker nodes who processes parts of the request in parallel. Think of it as an improvement to original…
I recently had to connect my Azure Databricks instance to our Azure Data Lake Storage (Generation 1) and was running into some problems getting everything set up. I am sure I am not the only one out there having these problems so if you do as well, here is a little guide to get you connected.
This post describes the purpose and how to use SQL window functions. If you have looked up window functions in the official documentation, you might have noticed that it can be quite difficult to understand, but I am here to clarify a few things and help you make sense of it.
Jupyter notebook is by far my all time favorite tool. It is the go-to tool for data exploration for any data scientist or data analyst out there.
Before we get started, let’s answer the question “what is
The goal of the platform is to make collaboration easier for teams on large projects. Git increases speed, integrity, and workflow efficiency. Git is used by more than 1,700 companies around the world (through different git platforms such as GitHub and GitLab).
K-Means clustering is one of the most popular unsupervised machine learning algorithms. It is a classification algorithm, meaning it’s purpose is to arrange the unlabeled data by shared qualities and characteristics.
Word clouds are a popular way of “summarizing” a large repository or corpus of text. In this tutorial we will be taking Winston Churchill’s speech “We Shall Fight on the Beaches”, and create the word cloud pictured below.