Practical Lessons from Predicting Clicks on Ads at Facebook

5 minute read


Want to share what I learned, feeled after I read and study the paper. Thesedays, usually study with machine learning, so will start with that category first.

The first paper to study is ‘Practical Lessons from Predicting Clicks on Ads at Facebook’ which has issued from Facebook team for predicting clicks on Ads at Facebook.

Let’s start the journey.

0. Introduction

According to the paper, over 750 million daily active users and over 1 million active advertisers are being predicted to click on Facebook ads. Authors has improved the click prediction rate by combining decision trees with logistic regression and they described this outperformed either of these mathods on its own by over 3%.

1. Expereiment condition

1-1) Data

Used offline data selected from an arbitrary week of the 2013, 4Q Then partitioned into training and test set

1-2) Evaluation metrics

Uses accuracy of prediction instead of profit-revenue related metrics. In this work, uses NE and calibration as major evaluation metric.

NE (Normalized Entropy, Normalized Cross Entropy)

\(NE=\frac{-\frac{1}{N}\sum_{i=1}^{n}\left ( \frac{1+y_{i}}{2} \log \left ( p_{i} \right ) + \frac{1-y_{i}}{2} \log \left ( 1-p_{i} \right )\right )}{-(p*\log(p)+(1-p)*log(1-p))}\)

NE is the predictive log loss normalized by the entropy of the background CTR then it can be set as above, when training set has N examples with labels \(y_{i}\in\left \{-1,1 \right \}\) and estimated probability of click \(p_{i}\). (\(i=1,2,3,...N\)) So as prediction accuracy is higer (as number of \(y_{i}\) is larger), NE will be lower.

Calibration (Normalized Entropy, Normalized Cross Entropy)

Calibration is the ratio of the average estimated CTR and empirical CTR. In other words, it is the ratio of the number of expected clicks to the number of actually observed clicks as below. \(\frac{number \thinspace of \thinspace actually \thinspace observed \thinspace clicks}{number \thinspace of \thinspace expected \thinspace clicks}\)


2-1) Model structure

Suggests the concatenation of boosted decision trees and of a probabilistic sparse linear classifier. For training, used SGD based linear regression and *BOPR (Bayesian online learning scheme for probit regression) To imporve the accuracy of linear classifier, there are 2 ways. 1 : Treat continuous features as categorical feature 2 : Build tuple input features like crating a new categorical feature that taking all possible values with Cartesian product. (Of course useless combinations can be pruned out)

And boosted decision trees are a powerful and very convenient way to implement non-linear and tuple transformations described above. When you see the tree above, and assume each node has binary output like 0,1,0,1,0. Then linear classifier has binary vector input as [0,1,0,1,0]. The NE comparison table shows combination of decision tree and LR decreases NE by more than 3.4% relative to the model with no tree transforms.

2-2) Data freshness

Experiment shows data freshness effects the prediction accuracy. To maximize the data freshness, the linear online classifier is one option to train directly as the labelled Ads impressions arrive. From experiments, it shows “Per-coordinate learning rate” \(\eta_{t,i}=\frac{\alpha }{\beta+\sqrt{\sum_{j=1}^{t}\bigtriangledown_{j,i}^{2}}}\) has the best prediction rate for SGD-based online learning of logistic regression, compared to “per-weight square root”,”per-weight”,”global” or “constant” rate.

Trained the data set and compared the prediction accuracy between LR and BOPR. Relative to LR, BOPR has a liitle bit lower NE (99.82% / 100%). One advantage of LR is smaller model size because there is only a weight associated to each feature. One advantage of BOPR is that being a Bayesian formulation, it provides a full predictive distribution over the probability of click. This can be used to compute percentiles of the predictive distribution for explore/exploit learning schemes.

3. Implementation in real world

3-1) Number of Boosting tree

Proceed experiments with tree number 1 ~ 2000. As the number of trees increases NE decreases, but the gain from adding trees yields diminishing return as the graph of -log(x).

3-2) Boosting features

Usually, small number of features contributes the majority of explanatory power while the remaining features have only a marginal contribution. And historical features provide considerably more explanatory power than contextual features. After ordering features by importance, top 10 features are all historical features and there are only 2 features among the top 20 features.

The value of contextual features depends exclusively on current information regarding the context in which an ad is to be shown, such as the device used by the users or the current page that the user is on.
On the contrary, the historical features depend on previous interaction for the ad or user, for example the click through rate of the ad in last week, or the average click through rate of the user

3-3) Uniform subsampling, Negative downsampling, Model Re-calibration

To reduce the cost of training, uniform subsampling is considered. To improve class imbalance between +1 and -1, negative downsampling can improve the performance After downsampling with negative label, it needs to re-calibrate the model because negative downsampling effects the performance as well. (For example, if the average CTR before samling is 0.1% and we do a 0.01 negative downsampling, the empirical CTR will become roughly 10%)