In data management, ample storage space is crucial for handling large volumes of data. While the Databricks File System (DBFS) provides a reliable storage solution within the Databricks workspace, there may arise scenarios where additional storage capacity is required. Fortunately, DBFS offers the flexibility to mount extra storage locations, providing users with expanded storage capabilities tailored to their needs.
In this article, we'll learn mounting in DBFS, exploring its significance and practical applications. We'll start by understanding what mounting entails and why it's essential to consider mounting extra storage in DBFS. Then, we'll walk through the step-by-step process of mounting additional storage locations within the DBFS environment. Along the way, we'll address potential challenges that may arise and provide strategies for overcoming them. Additionally, we'll examine real-world use cases to illustrate the benefits and versatility of mounting storage in DBFS.
What is Mounting in DBFS?
Mounting in DBFS refers to creating a link between a Databricks workspace and cloud object storage. This will interact with cloud object storage using familiar file paths relative to the Databricks file system.
Mounts work by creating a local alias under the /mnt directory that stores the following information:
Location of the cloud object storage.
Driver specifications to connect to the storage account or container.
Security credentials are required to access the data.
Why Mount an Extra Storage Location?
Here are some reasons why you might want to mount extra storage locations in DBFS:
Simplify Data Access: Mounting cloud object storage to DBFS simplifies data access patterns for users who are unfamiliar with cloud concepts.
Familiar File Paths: Azure Databricks mounts create a link between a workspace and cloud object storage, enabling you to interact with cloud object storage using familiar file paths relative to the Databricks file system.
Security: Mounts work by creating a local alias under the /mnt directory that stores the location of the cloud object storage, driver specifications to connect to the storage account or container, and security credentials required to access the data.
Data Governance: Databricks recommends you migrate using mounts and instead manage data governance with Unity Catalog.
How to Mount Extra Storage Locations?
STEP 1: Create Storage Container
First create a storage account using Azure Portal, Azure PowerShell, or Azure CLI. After that create a container.
A container organizes a set of blobs, similar to a directory in a file system. To create a container in the Azure portal, follow these steps:
In the portal navigation pane on the left side of the screen, select Storage Accounts and choose a storage account.
In the navigation pane for the storage account, scroll to the Data storage section and select Containers.
Within the Containers pane, select the + Container button to open the New Container pane.
Within the New Container pane, provide a Name for your new container. The container name must be lowercase, must start with a letter or number, and can include only letters, numbers, and the dash (-) character. The name must also be between 3 and 63 characters long.
Set the Anonymous access level for the container. The recommended level is Private (no anonymous access).
Select Create to create the container.
STEP 2: Upload Blobs to the Container
Blob storage supports block blobs, append blobs, and page blobs. You can use Azure Storage Explorer or Azure Portal to upload blobs to the container. In Azure Storage Explorer, you can right-click on the container and select Upload to upload the blob.
STEP 3: Mount with dbutils.fs.mount()
The dbutils.fs.mount() command is used to mount a storage location in Databricks. This command will create a link between a workspace and cloud object storage, enabling you to interact with cloud object storage using familiar file paths relative to the Databricks file system.
Here’s the syntax for the dbutils.fs.mount() command:
dbutils.fs.mount(
source="<source>",
mount_point="/mnt/<mount-point>",
extra_configs={
"<config-key>": "<config-value>"
}
)
In this syntax:
<source> is the location of the cloud object storage.
/mnt/<mount-point> is the local alias under the /mnt directory.
<config-key> and <config-value> are the driver specifications and security credentials required to access the data.
STEP 4: Verify the Mount Point
After mounting a storage location, you can verify the mount point using the dbutils.fs.mounts() command. This command returns a list of all mount points and their corresponding source URIs.
STEP 5: List the Contents
The dbutils.fs.ls() command is a method provided by Databricks Utilities (dbutils) for listing the contents of a directory in DBFS. The command displays all the files and directories available in the specified path.
Here’s the syntax for using dbutils.fs.ls():
dbutils.fs.ls("<path>")
In this syntax, <path> is the directory you want to list. The path can be a DBFS path (e.g., “/foo” or “dbfs:/foo”).
The command returns a list of FileInfo objects, each representing a file or a directory. Each FileInfo object has the following attributes:
path: The full path of the file or directory.
name: The name of the file or directory.
size: The size of the file in bytes, or 0 if the path is a directory.
Here’s an example of how you might use dbutils.fs.ls():
files = dbutils.fs.ls("/mnt/my-data")
for file in files:
print(file.name)
This code lists the names of all files in the “/mnt/my-data” directory.
Unmounting Storage Locations
Unmounting a storage location in DBFS involves detaching it from the file system. This means that the data in the storage location is no longer accessible through the mount point in DBFS. Here are some reasons why you might want to unmount a storage location:
Access Control: If you need to restrict access to the data in the storage location, you can unmount it. Once unmounted, the data is no longer accessible through DBFS.
Data Security: If the security credentials used to mount the storage location have been compromised, you should unmount the storage location to prevent unauthorized access.
Storage Management: If you no longer need to access the data in the storage location through DBFS, you can unmount it to simplify your storage management.
Credential Rotation: If the storage relies on a secret that is rotated, expires, or is deleted, errors can occur, such as 401 Unauthorized. To resolve such an error, you must unmount and remount the storage.
Here’s how you can unmount a storage location in DBFS:
dbutils.fs.unmount("/mnt/<mount-name>")
In this syntax, replace <mount-name> with the name of your mount point. This command removes the mount point from DBFS.
Also, to avoid errors, avoid unmounting a storage location while other jobs are reading or writing it. After modifying a mount, always run dbutils.fs.refreshMounts() on all other running clusters to propagate any mount updates.
Challenges while working with mounted locations in DBFS
Working with mounted locations in DBFS can present several challenges:
Data Governance: Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and instead managing data governance with Unity Catalog.
Security: Improper configuration of mounts can provide unsecured access to all users in your workspace. Therefore, you should check with your workspace and cloud administrators before configuring or altering data mounts.
Data Access: To avoid errors, never modify a mount point while other jobs are reading or writing. After modifying a mount, always run dbutils.fs.refreshMounts() on all other running clusters to propagate any mount updates.
Credential Rotation: If the storage relies on a secret rotated, expires, or is deleted, errors can occur, such as 401 Unauthorized. To resolve such an error, you must unmount and remount the storage.
Compatibility with Unity Catalog: Unity Catalog secures access to data in external locations by using full cloud URI paths to identify grants on managed object storage directories. DBFS mounts use a different data access model that bypasses Unity Catalog entirely. not reuse cloud object storage volumes between DBFS mounts and UC external volumes.
Real-world use case of mounting storage
Here are some real-world use cases of mounting storage in DBFS:
Data Lake Storage: If your organization uses a data lake for storing large amounts of unstructured or semi-structured data, you can mount your data lake storage to DBFS. This allows Databricks to access and analyze the data directly from the data lake.
Machine Learning Workflows: In machine learning workflows, you often need to access large datasets stored in cloud storage. By mounting your cloud storage bucket to DBFS, you can read the data directly into your Machine learning models running on Databricks.
Data Migration: If you’re migrating data from one storage system to another, you can mount both storage systems to DBFS and use Databricks to perform the data migration.
Collaboration: If you’re working with a team and want to share data or results, you can mount a shared cloud storage bucket to DBFS. This way, all team members can read from and write to the same shared location.
Backup and Archiving: You can mount your backup storage to DBFS and use Databricks to automate your backup and archiving processes.
Conclusion
Mounting extra storage in DBFS enhances data management capabilities in Databricks, allowing for scalable solutions to accommodate growing data needs. Despite challenges, proactive planning and real-world use cases demonstrate the value of this approach in enabling organizations to innovate and derive actionable insights from their data. Ultimately, by leveraging DBFS to mount additional storage, users can future-proof their data infrastructure and drive business success through data-driven strategies.
コメント