I recently had to connect my Azure Databricks instance to our Azure Data Lake Storage (Generation 1) and was running into some problems getting everything set up. I am sure I am not the only one out there having these problems so if you do as well, here is a little guide to get you connected.

If you are new to Databricks, check out this post on 5 of its benefits!

In this post we will be covering the following:

Azure Data Lake Storage (ADLS)

If you are not familiar with Azure Data Lake Storage (gen 1), here is a little snippet from Microsoft’s website:

Azure Data Lake Storage Gen 1 (formerly Azure Data Lake Store, also known as ADLS) is an enterprise-wide hyper-scale repository for big data analytic workloads. Azure Data Lake Storage Gen1 enables you to capture data of any size, type, and ingestion speed in a single place for operational and exploratory analytics. Azure Data Lake Storage Gen1 is specifically designed to enable analytics on the stored data and is tuned for performance for data analytics scenarios.

What You Need

For the purposes of this post, we will be using Python in Databricks, and we will be looking at two different ways to access our data in the Data Lake; mounting the lake and accessing a specific file.

The first piece of information we need is the name of your data lake. You can find this on the landing page for your data lake, and is highlighted in the screenshot below.

Notice the name of my datalake: my-datalake.

In order to connect these two systems, we need to create an Identity to access the ADLS. The general steps to accomplish this are as follows:

  1. Create an Azure AD web application
  2. Retrieve the client ID, client secret, and endpoint token for the web application
  3. Configure access for the Azure AD web application on the Data Lake resources.

Follow this link for an in depth guide on how to accomplish the above steps.

You should now have the following pieces:

  • Client ID
  • Client Secret
  • Token Endpoint

Azure Databricks Connection

Now that we have all the required pieces of information, we can go ahead and dive into Databricks.

Mounting Data Lake

The first option we mentioned before is to mount the data lake as a drive to Databricks. To accomplish this, go ahead and create a new workbook in Databricks and copy the following code into your notebook.

configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
           "dfs.adls.oauth2.client.id": "<your-service-client-id>",
           "dfs.adls.oauth2.credential": "<client-secret>",
           "dfs.adls.oauth2.refresh.url": "https://login.microsoftonline.com/<your-directory-id>/oauth2/token"}

  source = "adl://<your-data-lake-store-account-name>.azuredatalakestore.net/<your-directory-name>",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Here you will need to fill in the pieces of information we gathered into their respective places in the code. Alternatively, you could save the client secret in Databricks Secrets and call the value with the following:

dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")

After you run this code, you will be able to access your Data Lake just like your regular Databricks file system.

A “Data Lake” – Photo by Kyle Roxas

Access File Directly using Spark APIs

The other option to accessing your data lake, is to access the files directly using the Spark API in Databricks.

Just like above, we still need to set the tokens, IDs, and endpoints:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark.conf.set("dfs.adls.oauth2.client.id", "<your-service-client-id>")
spark.conf.set("dfs.adls.oauth2.credential", "<client-secret>")
spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")

After you have provided these credentials, you can now easily read from Azure Data Lake Storage Gen1 using Spark APIs:



Connecting your Azure Databricks instance to your Azure Data Lake Store opens up the door to do so much more with Databricks. I wrote this because I was running into permissions issues with the Data Lake and Identities, so if you are for some reason unable to access your data lake, go back a few steps and make sure you followed the instructions to set up your identity correctly.

Thank you for reading. Drop me a comment below if you have any comments or anything to add.

I have half a decade of experience working with data science and data engineering in a variety of fields both professionally and in academia. I ahve demonstrated advanced skills in developing machine learning algorithms, econometric models, intuitive visualizations and reporting dashboards in order to communicate data and technical terminology in an easy to understand manner for clients of varying backgrounds.

1 Comment

  1. With all that in mind, Azure Databricks is an analytics platform that leverages the Apache Spark framework, optimizing it for Microsoft s cloud. This is aimed at big data and AI workloads, which makes the announcement of general availability for its RStudio integration all the more welcome. RStudio is an IDE written in R that s quite popular in the data science community, an IDE which can now run inside of Azure Databricks.

Write A Comment