Plover Tsai: 7月 2018

Paper: https://static.googleusercontent.com/media/research.google.com/zh-TW//pubs/archive/34407.pdf

ABSTRACT:

Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs.

INTRODUCTION:

The goal is to create a personalized page of video recommendations that not only shows the latest and most popular videos, but also provides users with recommendations tailored to their viewing habits.

The Video Co-View Graph:

Video-video co-view graph
User-video co-view graph
We attempt to use the broadest definition of co-view whenever possible.

THE ADSORPTION ALGORITHM

The genesis of the family of algorithms that we will collectively call adsorption is the following question: assuming we wish to classify a node in a graph in terms of (class) labels present on some of the other nodes, what is a principled way to do it?
The idea of recommending videos commonly co-viewed with a user’s watched videos nevertheless can be meaningfully abstracted as finding labeled nodes that have multiple short paths from the user node.
Thus our desiderata for our recommendation system include the ideas that a video v is relevant to a user u if:

u and v have a short path between them;
u and v have several paths between them;
u and v have paths that avoid high-degree nodes.

Three equivalent views.

Adsorption via Averaging

Each node has two roles, forwarding labels and collecting labels.
In the “full-information” model, let us imagine that each node keeps track of the history of all labels it receives.
Formal description:

G = (V, E, w), where V denotes the set of vertices (nodes), E denotes the set of edges, and w : E → R denotes a nonnegative weight function on the edges.
Let L denote a set of labels, and assume that each node v in a subset V_L ⊆ V carries a probability distribution L_v on the label set L.

We often refer to V_L as the set of labeled nodes.

For the sake of exposition, we will introduce a pre-processing step, where for each vertex v ∈ V_L, we create a “shadow” vertex v~ with exactly one out-neighbor, namely v, connected via an edge (v, v~) of weight 1; furthermore, for each v ∈ V_L, we will re-locate the label distribution L_v from v to v~, leaving v with no label distribution.
Let V~ denote the set of shadow vertices, V~ = {v~ | v ∈ V_L}.
From now on, we will assume that at the beginning of the algorithm, only vertices in V~ have non-vacuous label distributions.

Algorithm.

Adsorption via Random Walks

For a vertex v ∈ V, denote by N_v the probability distribution on the set of neighbors of v described by N_v(u) = w(u, v)/(Σ_u w(u, v)), that is, the probability of u is proportional to the weight on the edge (u, v).
The label distribution of a vertex v is simply a convex combination of the label distributions at its neighbors, that is, L_v = Σ_u N_v(u) L_u.
Therefore, if we pick an in-neighbor u of v at random according to N_v and sample a label according to the distribution L_u, then for each label l ∈ L, L_v(l) is precisely equal to Exp_u [L_u(l)], where the expectation arises from the process of picking a neighbor u according to N_v.
Extending this to neighbors at distance 2, it is easy to see that for each label l ∈ L, L_v(l) = Exp_w Exp_u [L_w(l)], where an in-neighbor u of v is chosen according to N_v and an in-neighbor w of u is chosen according to N_u. Expanding this out, we see that L_v(l) = Σ_w Σ_u N_v(u) N_u(w) L_w(l).
N_v(u) is the probability of reaching u from v in one step of a random walk starting from v and picking a neighbor according to N_v, and similarly, N_u(w) is the probability of picking a neighbor w of u according to N_u.
Notice also the crucial use of the Markov property (memorylessness) here, that is, conditioned on the random walk having reached u, the only information used in picking w is N_u, which depends only on u, and not on where we initiated the random walk from.
Algorithm.

Adsorption via Linear Systems

Doing so would give an embedding of the vertices of the graph into L_1 (since each convex combination can be written as a vector of unit L_1 norm in n dimensions, where n = |V|); this is a topic of active research in computer science.
The system of linear equations has a unique solution if and only if the underlying graph is connected.
Algorithm.

Injection and Dummy Probabilities

The probability with which we enter the absorbing state v~ might be determined as a function of various features of v. Thus, this parameter, which we call the injection probability is usually a crucial design choice in deploying a live recommendation system.
Instead of considering the standard random walk on an edge-weighted graph, one may consider a “hobbled walk,” where at each step, with some probability, which we call the abandonment probability or the dummy probability, the algorithm abandons the random walk without producing any output label.

EXPERIMENTAL SETUP

Algorithms and Variants Tested

The first reference algorithm is called GP, for global popularity.

This algorithm first sorts all videos by their popularity during the training period.
When we need recommendations for a user u, we proceed in order of this popularity measure, and output top videos that the user has not watched during the training period.

The second algorithm, called LP (for local popularity) takes into account the fact that what is globally popular is poorly personalized for many users. The idea behind this algorithm is another very commonly employed item-based-recommendation heuristic, and is based on co-views.

For each video v, compute the list C(v) of videos z that it was co-viewed with, that is, users who watched v also watched z.
Furthermore, for each video v and video z co-viewed with v, assign a score c(v, z) given by the number of users who co-viewed v and z.
When we need recommendations for user u, look at the set W(u) of all videos watched by u during the training period, and consider the set C(W(u)) = {z ∈ C(v) | v ∈ W(u)} of videos co-viewed with the videos watched by u.
For each z ∈ C(W(u)), assign a score c(u, z) = Σ_v∈W(u) c(v, z).
Sort this set by total score, and output the first from this sorted list that user u has not watched already.

The third and final reference algorithm rectifies this: rather than define the “co-view score” c(v, z) between videos v and z to be the number of users who co-viewed v and z, we define c(v, z) by c(v, z) = |U(v)∩U(z)|/|U(v)∪U(z)|, where U(v) denotes the set of users who watched v.

As before, we define c(u, z) for user u and video z to be Σ_v∈W(u) c(v, z).
We call the resulting algorithm LR (for local rare).

Adsorption-N: At each user node, injection probability and dummy probability are 0; probability of continuing the random walk is 1. At each video node, injection probability is 1/4, dummy probability is 0, and probability of continuing the random walk is 1/2.
Adsorption-D: At each user node, injection probability is 0, dummy probability is 1/4, and probability of continuing the random walk is 3/4. At each video node, injection probability is 1/4, dummy probability is 1/4, and probability of continuing the random walk is 1/2.

How the adsorption algorithm is used for the video recommendation problem:

One way to use the algorithm is to run the algorithm on the user–video view graph, and use the label distribution derived for each user as the recommendations; the labels will be the most relevant videos for that user.
An equivalent mechanism is one where for each user, we collect the distributions computed for each of her watched videos and take their average.
Although this yields equivalent answers, it has an important benefit: we can use this procedure to make recommendation to new users (those that were not in the training set), as soon as they watch even a single video.

EMPIRICAL RESULTS

Top-1 performance:

First, note that GP has the best performance for users who viewed just one video during the training period.

But it quickly degrades as the view frequency increases, becoming the worst among all the algorithms!
This suggests that for very casual users (who visit a system infrequently), recommending the most popular videos is a useful strategy.
However, for users with very high view frequency, the idea of the “popular but unwatched” videos is a somewhat questionable notion.

Algorithm Adsorption-D performs significantly better as the view frequency increases beyond 5, and the gains become much larger as the view frequency increases further.

Plover Tsai

2018年7月18日星期三

[Notes] Precalculated Values

2018年7月10日星期二

[Notes] Binary Indexed Tree

2018年7月9日星期一

[Notes] CNN-RNN: A Unified Framework for Multi-label Image Classification

[Notes] Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

2018年7月8日星期日

[Notes] Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph

2018年7月18日 星期三

[Notes] Precalculated Values

2018年7月10日 星期二

[Notes] Binary Indexed Tree

2018年7月9日 星期一

[Notes] CNN-RNN: A Unified Framework for Multi-label Image Classification

[Notes] Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba

2018年7月8日 星期日

[Notes] Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph

2018年7月18日星期三

2018年7月10日星期二

2018年7月9日星期一

2018年7月8日星期日