# deep neural networks for youtube recommendations

Published:

This is summary and what I learned from the paper “Deep neural networks for youtube recommendations”

### 0. Introduction

As YouTube recommendations are responsible for helping more than a billion users discover personalized content, this paper will cover how deep learning has affected to YouTube video recommendations system. It is challenging from 3 major perspectives: Scale from massive user base and corpus, Freshness that many hours of video are uploaded per second, and Noise from sparsity and a variety of unobservable external factors.

In following section, this paper will deal with system overview, model description with experimental results.

### 1. System Overview

As printed above, the system is comprised of 2 neural networks : one for candidate generation, one for ranking. Candidate generation networks take users’ activity history as input and provides broad personalization. Ranking networks take a rich set of features describing the video and user as input then present a few “best” recommendations in a list. This 2-stage approach has some advantages, one is to make small number of personalized recommendation list appearing on the device from a very large corpus of videos and second is to enable blending candidates generated from different sources. For iterative improvements, has used offline metrics like precision, recall and ranking loss. And for the final determination of model effectiveness, used live A/B testing to measure subtle changes in CTR, watch time, etc. Let’s see from Candidate generate networks.

### 2. Candidate generation

With candidate generation networks, YouTube corpus is winnowed to hundreds of user-relevant videos.

### 2-1) Recommendation as Classification

Pose recommendation as extreme multiclass classification, so it can have below formula which means the probability of watching video class $i$$at time $t$$ ($w_t$$) of user $U$$ and context $C$$, while V is a corpus. $P(w_t=i|U,C)=\frac{e^{v_iu}}{\sum_{j\in V^{e^{v_ju}}}}$ here, $u\in\mathbb{R}^n$$ represents a high-dimensional set of user, context pair and $v_j\in\mathbb{R}$$represent embedding of each candidate videos. The task of DNN is to learn $u$$ as a function of user’s history and context.

To efficiently train with millions of classes, used “candidate sampling” that samples negative classes and correct this sampling with importance weighting. At serving time, it requires to choose top N videos to recommend to the user, so scoring millions of items should be provided to the users during a tens of milliseconds.

### 2-2) Model architectue

Feed High dimensional embeddings for each video in a fixed vocabulary into neural network. A user’s watch history is represented by a variable-length sequence of sparse video IDs which is mapped to a dense vector representation via this high dimensional embeddings. And the average of embeddings performed best when has used as inputs to network then the embeddings are learned through normal gradient descent back propagation updates.

As the figure above, features are concatenated into a wide first layer, follwed by several layers of ReLU. As a key advantage of DNN, arbitrary continuous and categorical features can be eaily added to the model. Simiarly to watch history, search history is tokenized, embeded, averaged then represent a dense search history. In case of new users, demographic features are important for resonable recommendations so user’s geographic region and device are concatenated. (Purple color of the figure above). And as the yellow part of the image, simple features like gender, logged-in state and age are input directly into the network.

As machine learning systems are trained to predict future from historical examples, it exhibit bias towards the past videos. To correct this bias, feed the age of the training example as a feature during training and at serving time, this is** set to zero**.

Recommendation often solves a surrogate problem and transfers the result to to a particular context. For example, predicting ratings of movie can lead to effective movie recommendations. With experiments, belows have extracted.

• Episodic series are usually watched sequentially
• Users often discover from the beginning with the popular then focusing on smaller niches. So found much better performance predicting the uses’s next watch, rather than predicting a randomly held-out watch.

### 2-3) Label, Context selection.

While many systems choose the labels and context by holding out random item and predict it from user’s history. But it leaks future information and ignores asymmetric comsumption patterns described above. So rollback a user’s history by choosing a random watch and only input actions the user took before the held-out label watch.

In experiments, as features and depth are added, it improves precision.

### 3-1) Network overview

Ranking layer specializes and calibrate predictions for the particular user interface like thumbnail. Ranking network has only a few hundred videos being scored while millions scored in candidate generation, so have access to many more features describing video and user-video relationship. Ranking network is similar DNN as candidate generation that scores to each video impression using logistic regression. Final ranking is constantly changed from A/B testing.

### 3-2) Feature engineering

There are several classification of features.

1. Continuous / Categorical
2. Univalent (i.e video ID) / Multivalent (i.e list of video IDs user watched)
3. Impression (: describe properties of item, computed for each item scored) / Query (: describe properties of user/context, computed for once)

And there are 3 important things in feature engineering.

1. Features describing** past user actions on related items** are powerful. For example, the feature like “When was the last time the user watched a video on this topic” describes user’s action well.
2. Propagate information from candidate generation into ranking in the form of features are important like ‘which sorces nominated this video candidate?’
3. Features describing the frequency of past impressions are important for introducing “churn” in recommendation For ecample, if a user was recommended a video but didn’t watch, then the model will degrade the priority of this video

Also, likely to candidate generation, used embeddings to map categorical features to dense representations. Each unique ID space, so called “vocabulary” has a separate learned embeddings and these vocabularies are simple look-up tables. When vocabulary is with very large cardinality (like video IDs), then it just includes top N after sorting. For categorical features in same ID space (such as video ID of impression, last video ID watched by user, etc.) shares underlying embeddings. Even though each features is fed separately for learning specialized representations, it can speed up training and reduce memory requirements.

Proper normalization of continuous features was critical for convergence. Feature $x$$with distribution $f$$ is transformed to $\widetilde{x}$$by scaling which is distributed in $[0,1)$$. And powers of this ($\widetilde{x}^2$$) has improved offline accuracy. ### 3-3) Objective and experiments Goal is to predict expected watch time with given training set of positive (video impression was clicked) and negative. To predict this time, used weighted logistic regriession which positive impressions get weight by watch time and negative impressions get unit weight. Then, the odds by logistic regression are $\frac{\sum(T_i)}{N-k}$$. ($T_i$$: watch time of $i$$th impression, $k$$: number of positive impressions, $N$$ : number of training data) If positive data is small, then odds are approximately $E[T](1+P)$$, where $E[T]$$ is expected watch time and $P$$is click probability. And since $P$$ is small, odds is close to $E[T]$\$.

With different hidden layer configurations experiements, increasing the width of hidden layers improves results as increasing the depth.

### 4. Conclusion

1. Using age of training example as an input feature removes bias towards the past videos.
2. Deep learning adjusted to Ranking has outperformed linear and tree-based method.
3. Logistic regression was modified by weighting trainig examples with positive and negative examples. And with this approach, watch-time weighted ranking evaluation is better than predicting CTR directly

Categories: