Reference Book: Simon J. D. Prince, Computer Vision: Models, Learning, and Inference
Question: How to fit probability models to data {x_i}?
Answer: Learn about the parameters θ of the model.
- maximum likelihood
- maximum a posteriori
- Bayesian approach
Maximum likelihood (ML)
- Likelihood function:
- Pr(x_i | θ) at single data point x_i
- Pr(x_{1...I} | θ) for a set of points
- Assume that drawn independently from the distribution
- Pr(x_{1...I} | θ) = \prod_{i from 1 to I} Pr(x_i | θ)
- Estimate of the parameter
- θ^{\hat} = argmax_θ [ Pr(x_{1...I} | θ) ]
- Example #1: The skip-gram model
- Reference:
- Given a corpus of words w and their contexts c
- Consider the conditional probabilities Pr(c|w)
- Goal: Set the parameters θ of Pr(c|w;θ) so as to maximize the corpus probability:
- argmax_θ \prod_{w in Text} [ \prod_{c in C(w)} Pr(c|w;θ) ]
- argmax_θ \prod_{(w, c) in D} Pr(c|w;θ)
- Model in Pr(c|w;θ):
- e^{v_c \dot v_w} / \sum_{c' in C} e^{v_c' \dot v_w}
- v_c, v_w: vector representation for c and w
- C: all available contexts
- Estimate: Take log
- argmax_θ \sum_{(w, c) in D} log Pr(c|w;θ)
- argmax_θ \sum_{(w, c) in D} (v_c \dot v_w - log(...) )
- Very expensive to compute due to log(...)
- Solutions:
- Hierarchical softmax
- Negative sampling
- Negative sampling:
- Pr( D=1 | w,c;θ ) = σ(v_c \dot v_w), σ: sigmoid
- Estimate: argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w)
- D': all incorrect random (w, c) pairs
- Estimate:
- argmax_θ \sum_{(w, c) in D} log Pr( D=1 | w,c;θ ) + \sum_{(w, c) in D'} log Pr( D=0 | w,c;θ )
- argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w) + \sum_{(w, c) in D'} log σ(-v_c \dot v_w)
- Random sampling:
- Distributed Representations of Words and Phrases and their Compositionality
- \sum_{i = 1 to k} E_{w_i ~ P_n(w)} log σ(-v_c \dot v_w)
- Example #2: Bernoulli trial