A New Way of Managing Deep Learning Datasets

Create, version-control, query, and visualize image, audio, and video datasets using Hub 2.0 by Activeloop.



A New Way of Managing Deep Learning Datasets
Image by author

 

What is Hub?

 
Hub by Activeloop is an open-source Python package that arranges data in Numpy-like arrays. It integrated smoothly with deep learning frameworks such as Tensorflow and PyTorch for faster GPU processing and training. We can update the data, visualize the data, and create machine learning pipelines using Hub API. 

Hub allows us to store images, audio, video, and time-series data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud. The data can directly be used in the training Pytorch model so that you don't need to set up data pipelines. The Hub also comes with data version control, dataset search queries, and distributed workloads.

My experience with Hub was amazing, as I was able to create and push data to the cloud within a couple of minutes. In this blog, we are going to see how Hub can be used to create and manage the dataset. 

  • Initializing a dataset on Activeloop cloud
  • Processing the images 
  • Pushing the data to the cloud 
  • Data version control
  • Data visualization 

 

Activeloop Storage

 
Activeloop provides free storage for open-source datasets and private datasets. You can also earn up to 200 GBs of free storage by referring people. Activeloop's Hub interfaces with the Database for AI, that allows us to visualize dataset with labels and complex search queries allows us to analyze the data in an effective way. The platform also contains more than 100 datasets on image segmentation, classification, and object detection.

 

A New Way of Managing Deep Learning Datasets
Activeloop’s Database for AI

 

To create the account you can sign up using the Activeloop website or type `!activeloop register`. The command will ask you to add a username, password, and email. After successfully creating an account, we will login using `!activeloop login`. Now, we can create and manage cloud datasets directly from a local machine.

If you are using a Jupyter Notebook, then use “!” otherwise directly add commands in the CLI without it.

!activeloop register
!activeloop login -u  -p 


 

Initializing a Hub Dataset

 
In this tutorial, we are going to use the Kaggle dataset Multi-class Weather under (CC BY 4.0). The dataset contains four folders based on weather classification; Sunrise, Sunshine, Rain, and Cloudy. 

First, we need to install the hub and kaggle packages. The kaggle package will allow us to download the dataset directly and unzip it.

!pip install hub kaggle
!kaggle datasets download -d pratik2901/multiclass-weather-dataset
!unzip multiclass-weather-dataset


In the next step, we will create a hub dataset on the Activeloop cloud. The dataset function can also create a new dataset or access the old one. You can also provide an AWS bucket address to create a dataset on the Amazon server. To create a dataset on Activeloop, we need to pass a URL containing the username and dataset name. 

“hub://<username>/<datasetname>”

import hub
ds = hub.dataset('hub://kingabzpro/muticlass-weather-dataset')


 

Data Preprocessing

 
We need to prepare the data before processing the data into hub format. The code below will extract the folders names and store it in the `class_names` variable. In the second part, we will be creating a list of files available in the dataset folder.

from PIL import Image
import numpy as np
import os

dataset_folder = '/work/multiclass-weather-dataset/Multi-class Weather Dataset'

class_names = os.listdir(dataset_folder)

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))


The file_to_hub function takes in three arguments file name, dataset, and class names. It extracts labels from each image and converts them into integers. It also converts image files into Numpy-like arrays and appends them to tensors. For this project, we only need two tensors, one for labels and one for image data. 

@hub.compute
def file_to_hub(file_name, sample_out, class_names):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    # Other arguments are optional
    
    # Find the label number corresponding to the file
    label_text = os.path.basename(os.path.dirname(file_name))
    label_num = class_names.index(label_text)
    
    # Append the label and image to the output sample
    sample_out.labels.append(np.uint32(label_num))
    sample_out.images.append(hub.read(file_name))
    
    return sample_out


Let’s create an image tensor with ‘png’ compression and a simple label tensor. Make sure the names of tensors should be similar to the ones we have mentioned in the file_to_hub function. To learn more about tensors: API Summary - Hub 2.0

Finally, we will run the file_to_hub function by providing files_lists, hub dataset instance “ds”, and class_names. It will take a few minutes as the data will be converted and pushed to the cloud.

with ds:
    ds.create_tensor('images', htype = 'image', sample_compression = 'png')
    ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
    file_to_hub(class_names=class_names).eval(files_list, ds, num_workers = 2)


 

Data Visualization

 
The dataset is now publicly available at  multiclass-weather-dataset. We can explore the dataset with labels or add a description so that others can learn more about license information and the distribution of data. The Activeloop is constantly adding new features to make the viewing experience better. 

 

A New Way of Managing Deep Learning Datasets
Image by author | muticlass-weather-dataset

 

We can also access our dataset using Python API. We will use PIL’s Image function to convert an array to an image and display it in a Jupyter notebook. 

Image.fromarray(ds["images"][0].numpy())


A New Way of Managing Deep Learning Datasets

 

For accessing the label, we will use class_names which contain categorical information and use the "labels" tensor to display the label. 

class_names = ds["labels"].info.class_names
class_names[ds["labels"][0].numpy()[0]]
>>> 'Cloudy'


Committing

 
We can also create different branches and manage different versions, like Git and DVC. In this section, we are going to update class_names information and create a commit with the message. 

ds.labels.info.update(class_names = class_names)

ds.commit("Class names added")
>>> '455ec7d2b49a36c14f3d80d0879369c4d0a70143'


As we can see our logs show that we have successfully committed changes to the main branch. To learn more about version control, check out Dataset Version Control - Hub 2.0.

log = ds.log()
---------------
Hub Version Log
---------------

Current Branch: main

Commit : 455ec7d2b49a36c14f3d80d0879369c4d0a70143 (main) 
Author : kingabzpro
Time   : 2022-01-31 08:32:08
Message: Class names added


You can also view all of your branches and commits using Hub UI.

 

A New Way of Managing Deep Learning Datasets
Gif by author

 

Conclusion

 
The Hub 2.0 comes with new data management tools that are making ML engineers' lives easy. The Hub can be integrated with AWS/GCP storage and provide a direct data stream for deep learning frameworks such as PyTorch. It also provides interactive visualization through the Activeloop cloud and version control for tracking the ML experiments. I think Hub will become an MLOps solution for data management in the future as it will solve a lot of core issues that data scientists and engineers face daily. 

In this blog, we have learned about Hub and how to create and push data to the Activeloop cloud. The next natural step will be using the same dataset to train the model and deploy it to production. So, if you are interested in learning more and want to train an image classification model then check out Training an Image Classification Model in Pytorch.

 

Deep learning Projects Using Hub

 

 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.