2017年5月6日 星期六

[word2vec] Fitting Probability Models


Reference Book: Simon J. D. Prince, Computer Vision: Models, Learning, and Inference

Question: How to fit probability models to data {x_i}?

Answer: Learn about the parameters θ of the model.

Methods:
  • maximum likelihood
  • maximum a posteriori
  • Bayesian approach


Maximum likelihood (ML)
  • Likelihood function: 
    • Pr(x_i | θ) at single data point x_i
    • Pr(x_{1...I} | θ) for a set of points 
      • Assume that drawn independently from the distribution
      • Pr(x_{1...I} | θ) = \prod_{i from 1 to I} Pr(x_i | θ)
  • Estimate of the parameter
    • θ^{\hat} = argmax_θ [ Pr(x_{1...I} | θ) ]
  • Example #1: The skip-gram model
    • Reference: https://arxiv.org/pdf/1402.3722v1.pdf
    • Given a corpus of words w and their contexts c 
    • Consider the conditional probabilities Pr(c|w)
    • Goal: Set the parameters θ of Pr(c|w;θ) so as to maximize the corpus probability: 
      • argmax_θ \prod_{w in Text} [ \prod_{c in C(w)} Pr(c|w;θ) ]
      • argmax_θ \prod_{(w, c) in D} Pr(c|w;θ)
    • Model in Pr(c|w;θ):
      • e^{v_c \dot v_w} / \sum_{c' in C} e^{v_c' \dot v_w}
        • v_c, v_w: vector representation for c and w
        • C: all available contexts
    • Estimate: Take log
      • argmax_θ \sum_{(w, c) in D} log Pr(c|w;θ)
      • argmax_θ \sum_{(w, c) in D} (v_c \dot v_w - log(...) )
      • Very expensive to compute due to log(...)
      • Solutions:
        • Hierarchical softmax
        • Negative sampling
    • Negative sampling:
      • Pr( D=1 | w,c;θ ) = σ(v_c \dot v_w), σ: sigmoid
      • Estimate: argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w)
      • D': all incorrect random (w, c) pairs
      • Estimate: 
        • argmax_θ \sum_{(w, c) in D} log Pr( D=1 | w,c;θ ) + \sum_{(w, c) in D'} log Pr( D=0 | w,c;θ )
        • argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w) + \sum_{(w, c) in D'} log σ(-v_c \dot v_w)
      • Random sampling:
  • Example #2: Bernoulli trial

沒有留言:

張貼留言