There and Back Again… a RAPIDS Tale

This blog post explores the challenges of acquiring sufficient data and the limitations posed by biased datasets using RapidsAI cuDF.

By Kris Manohar, Executive Director at ICPC on June 14, 2023 in Machine Learning

By Kris Manohar & Kevin Baboolal

Image by Editor

Editor's Note: We are thrilled to announce that this post has been selected as the winner of the KDnuggets & NVIDIA Blog Writing Contest.

Introduction

Machine learning has revolutionized various domains by leveraging vast amounts of data. However, there are situations where acquiring sufficient data becomes a challenge due to cost or scarcity. In such cases, traditional approaches often struggle to provide accurate predictions. This blog post explores the limitations posed by small datasets and unveils an innovative solution proposed by TTLAB that harnesses the power of the nearest neighbor approach and a specialized kernel. We will delve into the details of their algorithm, its benefits, and how GPU optimization accelerates its execution.

The Challenge of Limited Data

In machine learning, having a substantial amount of data is crucial for training accurate models. However, when faced with a small dataset comprising only a few hundred rows, the shortcomings become evident. One common issue is the zero frequency problem encountered in some classification algorithms such as the Naive Bayes Classifier. This occurs when the algorithm encounters an unseen category value during testing, leading to a zero probability estimation for that case. Similarly, regression tasks face challenges when the test set contains values that were absent in the training set. You may even find your choice of algorithm is improved (though sub-optimal) when these missing features are excluded. These issues also manifest in larger datasets with highly imbalanced classes.

Overcoming Data Scarcity

Although train-test splits often mitigate these issues, there remains a hidden problem when dealing with smaller datasets. Forcing an algorithm to generalize based on fewer samples can lead to suboptimal predictions. Even if the algorithm runs, its predictions may lack robustness and accuracy. The simple solution of acquiring more data is not always feasible due to cost or availability constraints. In such situations, an innovative approach proposed by TTLAB proves to be robust and accurate.

The TTLAB Algorithm

TTLAB's algorithm tackles the challenges posed by biased and limited datasets. Their approach involves taking the weighted average of all rows in the training dataset to predict the value for a target variable in a test sample. The key lies in adjusting the weights of each training row for every test row, based on a parameterized non-linear function that calculates the distance between two points in the feature space. Although the weighting function used has a single parameter (the rate of decay of influence of a training sample as its distance from the test sample increases), the computing effort to optimize over this parameter could be large. By considering the entire training dataset, the algorithm delivers robust predictions. This approach has shown remarkable success in enhancing the performance of popular models such as random forests and naive Bayes. As the algorithm gains popularity, efforts are underway to further enhance its efficiency. The current implementation involves tuning the hyperparameter kappa, which requires a grid search. To expedite this process, a successive quadratic approximation is being explored, promising faster parameter optimization. Additionally, ongoing peer reviews aim to validate and refine the algorithm for broader adoption.

To implement the TTLAB algorithm for classification for loops and numpy proved inefficient resulting in very long runtimes. The CPU implementation showcased in the linked publication focuses on classification problems, demonstrating the versatility and efficacy of the approach. https://arxiv.org/pdf/2205.14779.pdf. The publication also shows that the algorithm benefits greatly from vectorization, hinting at further speed improvements that can be gained from GPU acceleration with CuPy. In fact, to perform hyper-parameter tuning and random K-folds for result validation would have taken weeks for the multitude of datasets being tested. By leveraging the power of GPUs, the computations were distributed effectively, resulting in improved performance.

Accelerating Execution with GPUs

Even with optimizations like vectorization and .apply refactoring, the execution time remains impractical for real-world applications. However, with GPU optimization, the runtime is drastically reduced, bringing execution times down from hours to minutes. This remarkable acceleration opens up possibilities for using the algorithm in scenarios where prompt results are essential.

Following the lessons learnt from the CPU implementation, we attempted to further optimize our implementation. For this, we moved up the layer to CuDF Dataframes. Vectorizing calculations onto the GPU is a breeze with CuDF. For us, it was as simple as changing import pandas to import CuDF (you must vectorize properly in pandas.)

train_df["sum_diffs"] = 0
train_df["sum_diffs"] = train_df[diff_cols].sum(axis=1).values
train_df["d"] = train_df["sum_diffs"] ** 0.5
train_df["frac"] = 1 / (1 + train_df["d"]) ** kappa
train_df["part"] = train_df[target_col] * train_df["frac"]
test_df.loc[index, "pred"] = train_df["part"].sum() / train_df["frac"].sum()

Further down our rabbit hole we need to rely on Numba kernels. At this point, things get tricky. Recall why the algorithm’s predictions are robust because each prediction uses all the rows in the training dataframe. However, the Numba kernels don’t support passing CuDF dataframes. Right now the we are experimenting with some tricks suggested on Github to handle this case. (https://github.com/rapidsai/cudf/issues/13375)

For now, we can at least pass of the raw compute to a numba kernel via the .apply_rows

def predict_kernel(F, T, numer, denom, kappa):
    for i, (x, t) in enumerate(zip(F, T)):
        d = abs(x - t)  # the distance measure
        w = 1 / pow(d, kappa)  # parameterize non-linear scaling
        numer[i] = w
        denom[i] = d


_tdf = train_df[[att, target_col]].apply_rows(
    predict_kernel,
    incols={att: "F", "G3": "T"},
    outcols={"numer": np.float64, "denom": np.float64},
    kwargs={"kappa": kappa},
)

p = _tdf["numer"].sum() / _tdf["denom"].sum()  # prediction - weighted average

At this point, we did not eliminate all for loops, but simply pushing most of the number crunching to Numba reduced the CuDf runtime > 50% landing us in around the 2 to 4 seconds for the standard 80-20 train-test split.

Wrap up

It has been an exhilarating and delightful journey exploring the capabilities of the rapids, cupy, and cudf libraries for various machine learning tasks. These libraries have proven to be user-friendly and easily understandable, making it accessible to most users. The design and maintenance of these libraries are commendable, allowing users to dive deep into the intricacies when necessary. In just a few hours a day over the course of a week, we were able to progress from being novices to pushing the boundaries of the library by implementing a highly customized prediction algorithm. Our next aim is to achieve unprecedented speed, aiming to break the micro-second barrier with large datasets ranging from 20K to 30K. Once this milestone is reached, we plan to release the algorithm as a pip package powered by rapids, making it available for wider adoption and usage.

Kris Manohar is a executive director at ICPC, Trinidad and Tobago.