https://research.google.com/pubs/archive/45530.pdf
This paper describes the system used for YouTube’s personalized recommendations. It’s a somewhat typical “experience” paper, but notable in a few ways:
- The scale is large: obviously YouTube has a huge number of users, and they are searching through a large recommendation space (~millions of videos). They need to have high-performance models as a result, but they also have enough data to effectively train models with hundreds of millions of parameters.
- They don’t use the now very popular recurrent models, instead relying on a standard feed-forward network with extensive feature engineering.
- They don’t use an explicit matrix factorization model, which is quite common for this task. There is an implicit factorization in the form of the embeddings learned as part of the network.
- They find more evidence that hierarchical softmax isn’t as effective as negative sampling.
The overall model setup is refreshingly straightforward: a number of input features are fed as a single large vector; a number of fully connected layers with ReLUs is then applied before making a prediction. Two different models are used: a “generative” model (not the standard usage of the term) is used to build a lookup vector to quickly identify related videos. Then a more expensive model computes an individual score of each video.
The input features are roughly what you’d expect:
- Video embedding: an embedding learned from a users watch history. To compact the embeddings into a fixed shape, they took the average over the input sequence.
- Search embedding: an average of the embeddings of a users search queries.
- Language embedding: not described well? Some combination of a language model for the user (searches?) and that of the video.
- Statistics on the last time the user watched, or was recommended/shown the video (e.g. last watch time, square(last watch time), sqrt(last watch time).
- Population features: age, gender, geography.
The generator model uses the video and search embeddings, along with the population features to learn an embedding for each video. This embedding can then be used with an approximate nearest neighbors lookup to select candidate videos for recommending.
The second model takes each of the videos from the generative and produces a score to order them.
Overall, I enjoyed reading the paper. I would have liked the description of the features to be a bit more organized, but the approach used was refreshing in it’s pragmatism. Seeing an effective use of deep learning at such large scale is interesting as well.