FileStore: Easier Data Access in Azure Databricks

The Tech Platform
Apr 8, 2024
6 min read

Updated: Apr 15, 2024

Azure Databricks is a cloud-based platform that empowers you to perform large-scale data analytics using Apache Spark. It provides a collaborative workspace with built-in data processing, exploration, and visualization tools.

A crucial component of Azure Databricks is the Databricks File System (DBFS). Imagine DBFS as a secure, distributed file system that integrates with your workspace. It is central storage for data in your notebooks, making it readily accessible for processing and analysis.

However, DBFS primarily functions within the Azure Databricks environment. What if you need to access specific files directly through a web browser? FileStore is a special folder cleverly tucked away within DBFS. It offers a unique advantage: to access certain files stored in Azure Databricks directly from a web browser, eliminating the need to download and re-upload them.

Let's learn how FileStore simplifies data accessibility within the Azure Databricks workflows.

FileStore: Easier Data Access in Azure Databricks

The Challenge: Limited Data Accessibility in DBFS

The primary function to store and access data within notebooks in Azure Databricks is Databricks File System (DBFS). Notebooks in Azure Databricks can easily access data stored in DBFS.

A. Uploading Data to DBFS

You can upload data to DBFS using Scala code:

// Upload a file from the local filesystem to DBFS dbutils.fs.put("/path/to/local/file", "/dbfs/path/to/store/file", overwrite=true) 

// Example: Uploading a CSV file named "data.csv" to /dbfs/data dbutils.fs.put("file:/databricks/driver/data.csv", "/dbfs/data/data.csv", overwrite=true)

B. Accessing Data in DBFS from Notebooks

Once your data resides in DBFS, you can access it within your notebooks using two primary methods:

1. Using dbutils.fs.read Function (Scala)

// Read a text file from DBFS
val fileData = dbutils.fs.readText("/dbfs/data/data.txt")

// Read a CSV file from DBFS and create a DataFrame
val dataDF = spark.read
  .format("csv")
  .option("header", true)
  .option("inferSchema", true)
  .load("/dbfs/data/data.csv")

2. Using Spark DataFrames

Spark DataFrames are a powerful tool for data processing in Azure Databricks. You can directly specify the DBFS path while loading data into a DataFrame:

# Read a CSV file from DBFS and create a DataFrame (Python)
dataDF = spark.read.csv("/dbfs/data/data.csv", header=True, inferSchema=True)

# Read a JSON file from DBFS and create a DataFrame
jsonDataDF = spark.read.json("/dbfs/data/data.json")

C. Processing and Analyzing Data

With your data loaded into a DataFrame, you can leverage Spark's transformations and functions to process, analyze, and visualize your data.

Key Points:

DBFS offers a secure and scalable storage solution for your data within Azure Databricks.
Uploading data to DBFS can be done through the UI or programmatically using code.
Both Scala and Python offer functionalities to access data from DBFS within notebooks.
Spark DataFrames provide a powerful way to manipulate and analyze data stored in DBFS.

Azure Databricks relies on Databricks File System (DBFS) as the workhorse for storing and accessing data within notebooks. It's a fantastic system, but there is one issue ie., direct web browser accessibility. While DBFS excels at managing data used in notebooks you can't access individual files directly through a web browser.

This limitation caused a few problems for data scientists:

Download-Upload Issue: If you wanted to view or analyze a data file (like a specific image in a puzzle) outside of the Azure Databricks environment, you had to download it to your local machine (like taking a picture of the image). This could be time-consuming, especially for large datasets. Once finished, you had to upload the file to DBFS (like putting the picture back in the puzzle). This back-and-forth dance was repetitive and prone to version control issues.
Collaboration Hurdle: It is difficult to share data files with colleagues for their insights. You couldn't simply send them a link – they had to download the file from DBFS themselves. This extra step slowed down teamwork and hindered collaboration.
Security Considerations: Depending on how you worked with the data within notebooks, there might have been a risk of unauthorized access if proper security measures weren't in place, especially for sensitive data. This added an extra layer of complexity when managing data security.

To address these limitations and improve data accessibility, FileStore was introduced. It acts as a special folder within DBFS that bridges the gap by allowing you to access specific data files directly through your web browser, just like any other file on the internet.

This means:

No more Download-Upload: You can work on the data file directly in your browser, eliminating the need for repetitive downloads and uploads.
Seamless Collaboration: FileStore simplifies collaboration and keeps everyone on the same page.
Maintains Security: FileStore operates within the secure environment of Databricks, so you don't have to worry about compromising security measures for your data.

What is FileStore in Azure Databricks?

FileStore is a special folder within DBFS that bridges this gap by allowing you to access specific files directly through a web browser.

FileStore vs. DBFS

While both FileStore and DBFS serve data storage purposes within Azure Databricks, a key distinction lies in their accessibility:

Features	DBFS	FileStore
Primary Purpose	Data Storage for notebooks	Enhanced accessibility within notebooks
Accessibility	Accessible through databricks UI or Code only.	Accessible through Web Browser and databricks UI
Ideal Use Case	Storing data used for processing and analysis. Version control with notebooks.	Displaying images/libraries with displayHTML. Downloading specific data files for further analysis
Location	Centralized storage system	A special folder within DBFS
Mountable	It can be mounted within notebooks for code access using libraries like Spark DataFrames.	It cannot be directly accessed within notebooks.
Permissions	Permissions are managed at the DBFS level, impacting code and web browser access.	Permissions set on the FileStore folder and individual files control web browser access. Consider potential security implications when granting web browser access to sensitive data.

Here's how FileStore works:

STEP 1. Uploading Files to FileStore

There are two main ways to upload files to FileStore:

Using the Databricks UI: Navigate to the desired workspace in your Azure Databricks workspace. Click the "Files" tab and select "FileStore" from the left-hand menu. Click the "Upload" button and choose the files you want to add to FileStore.
Using Scala Code:

// Upload a file from the local filesystem to FileStore
dbutils.fs.put("/path/to/local/file", "/FileStore/path/to/store/file", overwrite=true)

// Example: Uploading a file named "image.jpg" to /FileStore/images
dbutils.fs.put("file:/databricks/driver/image.jpg", "/FileStore/images/image.jpg", overwrite=true)

STEP 2. Accessing Files in FileStore through Web Browser

Once uploaded, you can access files in FileStore directly through a web browser using their paths. Here's the format:

https://<databricks-instance>.blob.core.windows.net/filestore/<path/to/file>

Example:

Imagine you uploaded an image named "chart.png" to the /FileStore/images folder. You can access and display this image within your notebook using the following HTML code:

<img src="https://<databricks-instance>.blob.core.windows.net/filestore/images/chart.png" alt="Chart" width="400" height="300">

Benefits of FileStore

Simplified Visualization: FileStore allows you to embed images, libraries, or other web-accessible files within your notebooks using HTML, enhancing data visualization capabilities.
Easy Downloading: You can download files directly from FileStore to your local machine for further analysis without any need to download and upload them through the Databricks UI.
Improved Collaboration: Sharing data files with colleagues becomes easier. You can share the web browser link to the FileStore location.

Important Note:

While FileStore offers web browser accessibility, it's intended for specific use cases like visualization or downloading files. For core data processing tasks within notebooks, accessing data directly from DBFS using libraries like Spark DataFrames remains the preferred approach.

Use Cases for FileStore

You can use FileStore in two ways:

Rendering Visuals and Libraries with displayHTML
Simplifying Downloading Output Files

1. Rendering Visuals and Libraries with displayHTML

Imagine you've created a custom visualization library or have an insightful image that would perfectly complement your analysis. Using these elements within notebooks involved saving them locally, uploading them to DBFS, and referencing the DBFS path. FileStore eliminates this back-and-forth by directly referencing files stored within FileStore using the displayHTML function.

Example: Displaying an Image from FileStore

# Upload the image "chart.png" to /FileStore/images beforehand

# Reference the image path within FileStore using displayHTML
displayHTML("""
<img src="https://<databricks-instance>.blob.core.windows.net/filestore/images/chart.png" alt="Analysis Chart" width="500" height="300">
""")

This code snippet displays the uploaded image "chart.png" directly within your notebook, enhancing clarity and enriching your data exploration. Similarly, you can reference JavaScript libraries stored in FileStore using the <script> tag within displayHTML.

2. Simplifying Downloading Output Files (Target Keywords: download files)

Data analysis often involves generating reports or intermediate results for further exploration outside the Azure Databricks. Traditionally, you'd need to leverage the Databricks UI to download these files to your local machine. FileStore offers a more streamlined approach.

Downloading a File from FileStore

Access the desired file in the FileStore UI within your Azure Databricks workspace.
Right-click on the file and choose "Download."

Alternatively, you can leverage tools like dbutils within your notebooks to programmatically download files:

# Download a file from FileStore to the driver's local directory
dbutils.fs.get("/FileStore/path/to/file", "/databricks/driver/downloaded_file.csv")

# Example: Downloading a CSV file named "results.csv"
dbutils.fs.get("/FileStore/output/results.csv", "/databricks/driver/analysis_results.csv")

This code downloads the "results.csv" file from /FileStore/output and saves it as "analysis_results.csv" in the driver's local directory within the Databricks environment. You can access this downloaded file for further analysis using your preferred tools. Commonly downloaded file formats include CSV (Comma-Separated Values), JSON (JavaScript Object Notation), and Parquet, depending on the nature of your analysis.

Benefits of FileStore for Data Analysis

Enhanced Visualization: FileStore empowers you to seamlessly integrate custom visualizations and libraries, enriching your data exploration capabilities.
Streamlined Downloading: Downloading results for further analysis becomes a breeze, eliminating the need to rely solely on the Databricks UI.
Improved Collaboration: Sharing data files with colleagues is simplified as you can directly share the web browser link to the FileStore location.

Important Considerations When Using FileStore

While FileStore offers a convenient way to access specific data through web browsers, it's essential to understand some additional points to ensure optimal utilization:

Upon creating a new workspace in Azure Databricks, you'll notice several pre-populated folders within FileStore.

Some common examples include:

/FileStore/jars: This folder might contain pre-loaded JAR (Java Archive) files used by specific libraries or functionalities within your workspace.
/FileStore/tables: This folder could house temporary table metadata or configuration files generated by certain operations within your notebooks.

Warning: It's crucial to avoid deleting files within these pre-populated folders. Doing so might inadvertently impact functionalities or configurations within your workspace and potentially lead to unexpected behavior. If you're unsure about the purpose of a file, it's best to consult the Azure Databricks documentation or reach out to support for clarification.

Conclusion

FileStore emerges as a valuable companion to DBFS within Azure Databricks. By bridging the gap between web browser accessibility and secure data storage, it empowers data scientists to:

Visualize Effectively: Incorporate custom libraries and images directly within notebooks for richer data exploration.
Simplify Downloading: Download specific data files for further analysis outside the Azure Databricks.
Enhance Collaboration: Share data files with colleagues using straightforward web browser links, fostering streamlined teamwork.

FileStore is best suited for specific use cases that benefit from web browser access. Core data processing tasks within notebooks still leverage data directly from DBFS.