A New Way of Managing Deep Learning Datasets

What is Hub?
Hub by Activeloop is an open-source Python package that arranges data in Numpy-like arrays. It integrated smoothly with deep learning frameworks such as Tensorflow and PyTorch for faster GPU processing and training. We can update the data, visualize the data, and create machine learning pipelines using Hub API.
Hub allows us to store images, audio, video, and time-series data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud. The data can directly be used in the training Pytorch model so that you don't need to set up data pipelines. The Hub also comes with data version control, dataset search queries, and distributed workloads.
My experience with Hub was amazing, as I was able to create and push data to the cloud within a couple of minutes. In this blog, we are going to see how Hub can be used to create and manage the dataset.
Initializing a dataset on Activeloop cloud
Processing the images
Pushing the data to the cloud
Data version control
Data visualization
Activeloop Storage
Activeloop provides free storage for open-source datasets and private datasets. You can also earn up to 200 GBs of free storage by referring people. Activeloop's Hub interfaces with the Database for AI, that allows us to visualize dataset with labels and complex search queries allows us to analyze the data in an effective way. The platform also contains more than 100 datasets on image segmentation, classification, and object detection.

To create the account you can sign up using the Activeloop website or type `!activeloop register`. The command will ask you to add a username, password, and email. After successfully creating an account, we will login using `!activeloop login`. Now, we can create and manage cloud datasets directly from a local machine.
If you are using a Jupyter Notebook, then use “!” otherwise directly add commands in the CLI without it.
!activeloop register
!activeloop login -u -p
Initializing a Hub Dataset
In this tutorial, we are going to use the Kaggle dataset Multi-class Weather under (CC BY 4.0). The dataset contains four folders based on weather classification; Sunrise, Sunshine, Rain, and Cloudy.
First, we need to install the hub and kaggle packages. The kaggle package will allow us to download the dataset directly and unzip it.
!pip install hub kaggle
!kaggle datasets download -d pratik2901/multiclass-weather-dataset
!unzip multiclass-weather-dataset
In the next step, we will create a hub dataset on the Activeloop cloud. The dataset function can also create a new dataset or access the old one. You can also provide an AWS bucket address to create a dataset on the Amazon server. To create a dataset on Activeloop, we need to pass a URL containing the username and dataset name.
“hub://<username>/<datasetname>”
import hub
ds = hub.dataset('hub://kingabzpro/muticlass-weather-dataset')
Data Preprocessing
We need to prepare the data before processing the data into hub format. The code below will extract the folders names and store it in the `class_names` variable. In the second part, we will be creating a list of files available in the dataset folder.
from PIL import Image
import numpy as np
import os
dataset_folder = '/work/multiclass-weather-dataset/Multi-class Weather Dataset'
class_names = os.listdir(dataset_folder)
files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
for filename in filenames:
files_list.append(os.path.join(dirpath, filename))
The file_to_hub function takes in three arguments file name, dataset, and class names. It extracts labels from each image and converts them into integers. It also converts image files into Numpy-like arrays and appends them to tensors. For this project, we only need two tensors, one for labels and one for image data.
@hub.compute
def file_to_hub(file_name, sample_out, class_names):
## First two arguments are always default arguments containing:
# 1st argument is an element of the input iterable (list, dataset, array,...)
# 2nd argument is a dataset sample
# Other arguments are optional
# Find the label number corresponding to the file
label_text = os.path.basename(os.path.dirname(file_name))
label_num = class_names.index(label_text)
# Append the label and image to the output sample
sample_out.labels.append(np.uint32(label_num))
sample_out.images.append(hub.read(file_name))
return sample_out
Let’s create an image tensor with ‘png’ compression and a simple label tensor. Make sure the names of tensors should be similar to the ones we have mentioned in the file_to_hub function. To learn more about tensors: API Summary - Hub 2.0
Finally, we will run the file_to_hub function by providing files_lists, hub dataset instance “ds”, and class_names. It will take a few minutes as the data will be converted and pushed to the cloud.
with ds:
ds.create_tensor('images', htype = 'image', sample_compression = 'png')
ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
file_to_hub(class_names=class_names).eval(files_list, ds, num_workers = 2)
Data Visualization
The dataset is now publicly available at multiclass-weather-dataset. We can explore the dataset with labels or add a description so that others can learn more about license information and the distribution of data. The Activeloop is constantly adding new features to make the viewing experience better.

We can also access our dataset using Python API. We will use PIL’s Image function to convert an array to an image and display it in a Jupyter notebook.
Image.fromarray(ds["images"][0].numpy())

For accessing the label, we will use class_names which contain categorical information and use the "labels" tensor to display the label.
class_names = ds["labels"].info.class_names
class_names[ds["labels"][0].numpy()[0]]
>>> 'Cloudy'
Committing
We can also create different branches and manage different versions, like Git and DVC. In this section, we are going to update class_names information and create a commit with the message.
ds.labels.info.update(class_names = class_names)
ds.commit("Class names added")
>>> '455ec7d2b49a36c14f3d80d0879369c4d0a70143'
As we can see our logs show that we have successfully committed changes to the main branch. To learn more about version control, check out Dataset Version Control - Hub 2.0.
log = ds.log()
---------------
Hub Version Log
---------------
Current Branch: main
Commit : 455ec7d2b49a36c14f3d80d0879369c4d0a70143 (main)
Author : kingabzpro
Time : 2022-01-31 08:32:08
Message: Class names added
You can also view all of your branches and commits using Hub UI.

Gif by author
Conclusion
The Hub 2.0 comes with new data management tools that are making ML engineers' lives easy. The Hub can be integrated with AWS/GCP storage and provide a direct data stream for deep learning frameworks such as PyTorch. It also provides interactive visualization through the Activeloop cloud and version control for tracking the ML experiments. I think Hub will become an MLOps solution for data management in the future as it will solve a lot of core issues that data scientists and engineers face daily.
Source: KDNuggets
The Tech Platform