A New Way of Managing Deep Learning Datasets
Create, version-control, query, and visualize image, audio, and video datasets using Hub 2.0 by Activeloop.
Image by author
What is Hub?
Hub by Activeloop is an open-source Python package that arranges data in Numpy-like arrays. It integrated smoothly with deep learning frameworks such as Tensorflow and PyTorch for faster GPU processing and training. We can update the data, visualize the data, and create machine learning pipelines using Hub API.
Hub allows us to store images, audio, video, and time-series data in a way that can be accessed at lightning speed. The data can be stored on GCS/S3 buckets, local storage, or on Activeloop cloud. The data can directly be used in the training Pytorch model so that you don't need to set up data pipelines. The Hub also comes with data version control, dataset search queries, and distributed workloads.
My experience with Hub was amazing, as I was able to create and push data to the cloud within a couple of minutes. In this blog, we are going to see how Hub can be used to create and manage the dataset.
- Initializing a dataset on Activeloop cloud
- Processing the images
- Pushing the data to the cloud
- Data version control
- Data visualization
Activeloop Storage
Activeloop provides free storage for open-source datasets and private datasets. You can also earn up to 200 GBs of free storage by referring people. Activeloop's Hub interfaces with the Database for AI, that allows us to visualize dataset with labels and complex search queries allows us to analyze the data in an effective way. The platform also contains more than 100 datasets on image segmentation, classification, and object detection.
Activeloop’s Database for AI
To create the account you can sign up using the Activeloop website or type `!activeloop register`. The command will ask you to add a username, password, and email. After successfully creating an account, we will login using `!activeloop login`. Now, we can create and manage cloud datasets directly from a local machine.
If you are using a Jupyter Notebook, then use “!” otherwise directly add commands in the CLI without it.
!activeloop register !activeloop login -u-p
Initializing a Hub Dataset
In this tutorial, we are going to use the Kaggle dataset Multi-class Weather under (CC BY 4.0). The dataset contains four folders based on weather classification; Sunrise, Sunshine, Rain, and Cloudy.
First, we need to install the hub and kaggle packages. The kaggle package will allow us to download the dataset directly and unzip it.
!pip install hub kaggle !kaggle datasets download -d pratik2901/multiclass-weather-dataset !unzip multiclass-weather-dataset
In the next step, we will create a hub dataset on the Activeloop cloud. The dataset function can also create a new dataset or access the old one. You can also provide an AWS bucket address to create a dataset on the Amazon server. To create a dataset on Activeloop, we need to pass a URL containing the username and dataset name.
“hub://<username>/<datasetname>”
import hub ds = hub.dataset('hub://kingabzpro/muticlass-weather-dataset')
Data Preprocessing
We need to prepare the data before processing the data into hub format. The code below will extract the folders names and store it in the `class_names` variable. In the second part, we will be creating a list of files available in the dataset folder.
from PIL import Image import numpy as np import os dataset_folder = '/work/multiclass-weather-dataset/Multi-class Weather Dataset' class_names = os.listdir(dataset_folder) files_list = [] for dirpath, dirnames, filenames in os.walk(dataset_folder): for filename in filenames: files_list.append(os.path.join(dirpath, filename))
The file_to_hub function takes in three arguments file name, dataset, and class names. It extracts labels from each image and converts them into integers. It also converts image files into Numpy-like arrays and appends them to tensors. For this project, we only need two tensors, one for labels and one for image data.
@hub.compute def file_to_hub(file_name, sample_out, class_names): ## First two arguments are always default arguments containing: # 1st argument is an element of the input iterable (list, dataset, array,...) # 2nd argument is a dataset sample # Other arguments are optional # Find the label number corresponding to the file label_text = os.path.basename(os.path.dirname(file_name)) label_num = class_names.index(label_text) # Append the label and image to the output sample sample_out.labels.append(np.uint32(label_num)) sample_out.images.append(hub.read(file_name)) return sample_out
Let’s create an image tensor with ‘png’ compression and a simple label tensor. Make sure the names of tensors should be similar to the ones we have mentioned in the file_to_hub function. To learn more about tensors: API Summary - Hub 2.0
Finally, we will run the file_to_hub function by providing files_lists, hub dataset instance “ds”, and class_names. It will take a few minutes as the data will be converted and pushed to the cloud.
with ds: ds.create_tensor('images', htype = 'image', sample_compression = 'png') ds.create_tensor('labels', htype = 'class_label', class_names = class_names) file_to_hub(class_names=class_names).eval(files_list, ds, num_workers = 2)
Data Visualization
The dataset is now publicly available at multiclass-weather-dataset. We can explore the dataset with labels or add a description so that others can learn more about license information and the distribution of data. The Activeloop is constantly adding new features to make the viewing experience better.
Image by author | muticlass-weather-dataset
We can also access our dataset using Python API. We will use PIL’s Image function to convert an array to an image and display it in a Jupyter notebook.
Image.fromarray(ds["images"][0].numpy())
For accessing the label, we will use class_names which contain categorical information and use the "labels" tensor to display the label.
class_names = ds["labels"].info.class_names class_names[ds["labels"][0].numpy()[0]] >>> 'Cloudy'
Committing
We can also create different branches and manage different versions, like Git and DVC. In this section, we are going to update class_names information and create a commit with the message.
ds.labels.info.update(class_names = class_names) ds.commit("Class names added") >>> '455ec7d2b49a36c14f3d80d0879369c4d0a70143'
As we can see our logs show that we have successfully committed changes to the main branch. To learn more about version control, check out Dataset Version Control - Hub 2.0.
log = ds.log() --------------- Hub Version Log --------------- Current Branch: main Commit : 455ec7d2b49a36c14f3d80d0879369c4d0a70143 (main) Author : kingabzpro Time : 2022-01-31 08:32:08 Message: Class names added
You can also view all of your branches and commits using Hub UI.
Gif by author
Conclusion
The Hub 2.0 comes with new data management tools that are making ML engineers' lives easy. The Hub can be integrated with AWS/GCP storage and provide a direct data stream for deep learning frameworks such as PyTorch. It also provides interactive visualization through the Activeloop cloud and version control for tracking the ML experiments. I think Hub will become an MLOps solution for data management in the future as it will solve a lot of core issues that data scientists and engineers face daily.
In this blog, we have learned about Hub and how to create and push data to the Activeloop cloud. The next natural step will be using the same dataset to train the model and deploy it to production. So, if you are interested in learning more and want to train an image classification model then check out Training an Image Classification Model in Pytorch.
Deep learning Projects Using Hub
- Creating Object Detection Datasets
- Creating Complex Datasets
- Creating Video Datasets
- Creating Time-Series Datasets
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.