In this blog post I will lay out five (5) reasons you should consider Databricks before starting your next data science project.
First of all; what is Databricks? Databricks was founded by the creators of Apache Spark, a distributed computing framework, in 2013 as an alternative to Google’s MapReduce system. It is a web-based platform for working with Spark that very much mimics Jupyter-style notebooks.
1. Spark and PySpark
Databricks is built on Spark, which is built for performance and scale and enables you to dive into extremely large datasets for your AI and ML models. Spark also comes with built in libraries for machine learning (MLLib), Spark SQL and DataFrames, and easy integrations with many data sources and the common tools and frameworks (e.g. Keras and TensorFlow).
2. In-Memory Computing
Spark was engineered for performance, and with its in-memory computing, Spark is magnitudes faster than its counterpart Hadoop.
Databricks provides a collaborative workspace where your data scientists, engineers, and analysts can work together to accomplish more in less time. Databricks can transform your streaming or static datasets from a variety of different data sources, let your data scientists develop machine learning models against the transformed data, and further turn their findings into beautiful visualizations in third party frameworks such as Power BI or Tableau.
4. Multiple Languages
Databricks, unlike most notebook style platforms, lets you code in multiple languages in the same notebook. This lets you chose the optimal language for the type of process you are working on. The supported languages are Spark SQL, standard SQL, Java, Scala, R, and Python. Keep in mind, that if performance is your main priority, JVM languages will be your best bet as there are some performance penalties by employing languages that are not JVM based.
5. Scaling in the Cloud
Most of the major cloud providers like AWS and Azure offer Databricks as a service and lets you easily scale your Databricks session to further optimize and distribute your workload to as many clusters as you can possibly spin up.
When I wrote this, my favorite cloud provider for Databricks was Azure because of it’s managed solution.