Innovations in Deep Learning for Recommender Systems
Written on
Chapter 1: The Rapid Evolution of Recommender Systems
Recommender systems are rapidly transforming within the realm of industrial Machine Learning. This trend is unsurprising from a business perspective: enhanced recommendations lead to increased user engagement. The underlying technology, however, is quite intricate. With the advent of deep learning, fueled by the accessibility of GPUs, the complexity of recommender systems has surged.
In this article, we will explore several significant modeling advancements from the past decade, retracing the crucial milestones in the rise of deep learning within recommender systems. This narrative showcases technological innovations, scientific inquiries, and a global race among various organizations.
Our journey begins in 2017 in Singapore.
Section 1.1: Neural Collaborative Filtering (NCF)
No discussion of deep learning in recommender systems would be complete without highlighting one of the field's pivotal developments: Neural Collaborative Filtering (NCF), presented by He et al. (2017) from the University of Singapore.
Before NCF, matrix factorization was the prevailing method, where latent vectors (or embeddings) for users and items were learned to generate recommendations. The dot product between the user and item vectors indicated the strength of the predicted match. This approach essentially functions as a linear model of latent factors.
The revolutionary concept behind NCF is substituting the inner product in matrix factorization with a neural network. Practically, this involves concatenating user and item embeddings and feeding them into a multi-layer perceptron (MLP) that predicts user engagement, such as clicks. Both the MLP weights and embedding weights are adjusted during model training through backpropagation.
The hypothesis underpinning NCF is that user/item interactions are non-linear, contrary to the linear assumptions of matrix factorization. He et al. demonstrated that adding layers to the MLP enhances performance. With four layers, they outperformed the best matrix factorization techniques at that time by approximately 5% in hit rate on benchmark datasets like Movielens and Pinterest.
This research marked a significant shift towards deep learning in recommender systems.
This video titled "Deep Learning for Recommender Systems" by Nick Pentreath provides an insightful overview of the advancements in this field.
Section 1.2: Wide & Deep Learning
Our exploration continues from Singapore to Mountain View, California. While NCF revolutionized recommender systems, it overlooked a critical element: cross features. The concept of cross features was highlighted in Google's 2016 paper, "Wide & Deep Learning for Recommender Systems."
Cross features are second-order features created by combining two original features. For instance, in the Google Play Store, original features could include the impressed app and the list of user-installed apps. Combining these can yield powerful cross-features, such as:
AND(user_installed_app='netflix', impression_app='hulu')
This is true if the user has Netflix installed and the impressed app is Hulu. The authors argue that incorporating cross features of varying granularity facilitates both memorization (from detailed crosses) and generalization (from broader crosses).
The Wide&Deep architecture consists of a wide module, a linear layer that directly inputs cross features, and a deep module, akin to NCF. Both modules are combined into a unified output task head to learn from user/app interactions.
The results were compelling: moving from a deep-only model to a wide and deep approach resulted in a 1% increase in online app acquisitions. Given Google's significant revenue from its Play Store, the impact of Wide&Deep is evident.
In the video "Deep Learning for Personalized Search and Recommender Systems Part 1," the significance of cross features is further elaborated.
Section 1.3: Deep and Cross Neural Networks (DCN)
While Wide&Deep demonstrated the importance of cross features, it necessitated manual engineering, a labor-intensive process requiring significant resources and expertise. This is where "Deep and Cross Neural Networks" (DCN), introduced in a 2017 Google paper, comes into play.
DCN replaces the wide component of Wide&Deep with a cross neural network designed to learn cross features of any order. Unlike a standard MLP, where each neuron in the next layer is a linear combination of the previous layer's neurons, the cross neural network forms second-order combinations of the first layer with itself.
As a result, a cross neural network of depth L learns cross features represented as polynomials of degrees up to L. Experiments confirmed that DCN outperformed a model with only the deep component, achieving a statistically significant 0.1% lower log loss on the Criteo display ads benchmark dataset—all without manual feature engineering.
Section 1.4: DeepFM
Next, our journey takes us to Huawei in 2017, where the "DeepFM" architecture was introduced. DeepFM similarly replaces manual feature engineering in the wide component of Wide&Deep with a dedicated neural network for learning cross features. However, instead of a cross neural network, it employs a factorization machine (FM) layer.
The FM layer calculates dot products for all pairs of embeddings. For example, if a movie recommender uses four ID features, such as user ID, movie ID, actor IDs, and director ID, the model learns embeddings for each and computes six dot products for various combinations. This approach revives the concept of matrix factorization, combining the FM layer's output with the deep component to generate predictions.
DeepFM has proven effective, outperforming several competitors, including Google's Wide&Deep, by over 0.37% and 0.42% in AUC and log loss, respectively, on internal datasets.
Chapter 2: Recent Advancements in Recommender Systems
As we progress, we shift our focus to Meta's DLRM ("Deep Learning for Recommender Systems") architecture, introduced in 2019. DLRM transforms all categorical features into embeddings using embedding tables, while dense features are processed through an MLP to compute embeddings. All embeddings share the same dimension, and the model computes dot products of all embedding pairs, concatenating them into a single vector for the final MLP, which outputs predictions.
DLRM resembles a simplified version of DeepFM, focusing primarily on feature interactions captured through dot products. Research indicates that DLRM outperformed DCN in training and validation accuracy on the Criteo display ads dataset, suggesting that the deep component in DCN may be superfluous for optimal recommendations.
However, DLRM's feature interactions are limited to second-order interactions. To address this limitation, the final milestone in our exploration is DHEN, or "Deep Hierarchical Ensemble Network." DHEN's innovative approach creates a hierarchy of cross features that expand with the number of layers.
For example, if two input features (A and B) enter DHEN, a two-layer module would generate a hierarchy of cross features up to the second order, including interactions like:
A, A x A, A x B, B, B x B
where "x" can represent various interactions such as dot products or self-attention.
DHEN's complexity is considerable, necessitating a new distributed training paradigm termed "Hybrid Sharded Data Parallel," which achieves significantly higher throughput.
In conclusion, our exploration highlights the critical evolution of recommender systems. Each landmark represents a significant advancement:
- NCF: The foundational role of embeddings for users and items, facilitated by MLPs.
- Wide&Deep: The importance of directly incorporating cross features into the task head.
- DCN: The need for automated cross feature learning, moving beyond manual engineering.
- DeepFM: Integrating cross features through FM layers while retaining deep components.
- DLRM: Emphasizing feature interactions through dot products, potentially minimizing deep components.
- DHEN: Expanding the hierarchy of feature interactions beyond second order, along with optimization strategies.
The journey of innovation in this field is far from over. As of this writing, DCN has progressed to DCN-M, and DeepFM has evolved into xDeepFM. The competitive landscape continues to evolve, with Huawei's FinalMLP currently leading the Criteo competition. Given the substantial economic incentives for enhancing recommendations, we can expect ongoing breakthroughs in the future. Stay tuned for more developments.