Plover Tsai: 5月 2017

2017年5月23日星期二

《蝦蟆的油》黑澤明尋找黑澤明

（P114）

「全班一起看著我。」

「我滿臉發紅，一時無法動彈。」

「以前的老師，很多都是具有自由精神、個性豐富的人物。」

「比較起來，現在的老師，普通上班族太多。部隊，是官僚式人物太多。」

（P228）

「沒有人比當權豢養的小官僚更可怕。」

「就拿納粹當例證，希特勒當然是狂人，但看看希姆萊、艾希曼即知，愈到組織下層，天才性的狂人愈是輩出。至於集中營的所長和守衛，更是超乎想像的獸人。」

別論：「五年前，當我從中國的飛機下來，離開杜勒斯機場的航站樓時，正準備戴上我準備的五個口罩中的一個，我呼吸到了第一口美國的空氣，立即放下了口罩。這空氣是如此清甜新鮮，有種奇妙的奢侈感，我感到很吃驚。我在中國長大，在我的家鄉，每次出門我都會戴上口罩，否則有可能會生病。但是，在機場外呼吸的那一刻，我感到很自由。」

反應是「自賤者必被人賤。」「一個連自己國家都不愛的人，又怎麼會愛別人？」。

黑澤明說的似乎有些真實，我在 Sunnyvale 發現當地空氣真的很奢侈，藍天烈日也很奢侈，更不用說奢侈的薪資水平及不斷浪費的石油與食物。每個人可以愛自己的國家，但真的不要懷疑，美國就是那麼棒~~（除了監獄有點過分、除了種族問題有點嚴重、除了食物不夠美味、除了工作壓力大、除了對某些國家不甚友善、除了點點點）~~。

2017年5月6日星期六

[word2vec] Fitting Probability Models

Reference Book: Simon J. D. Prince, Computer Vision: Models, Learning, and Inference

Question: How to fit probability models to data {x_i}?

Answer: Learn about the parameters θ of the model.

Methods:

maximum likelihood
maximum a posteriori
Bayesian approach

Maximum likelihood (ML)

Likelihood function:

Pr(x_i | θ) at single data point x_i
Pr(x_{1...I} | θ) for a set of points

Assume that drawn independently from the distribution
Pr(x_{1...I} | θ) = \prod_{i from 1 to I} Pr(x_i | θ)

Estimate of the parameter

θ^{\hat} = argmax_θ [ Pr(x_{1...I} | θ) ]

Example #1: The skip-gram model

Reference: https://arxiv.org/pdf/1402.3722v1.pdf
Given a corpus of words w and their contexts c
Consider the conditional probabilities Pr(c|w)
Goal: Set the parameters θ of Pr(c|w;θ) so as to maximize the corpus probability:

argmax_θ \prod_{w in Text} [ \prod_{c in C(w)} Pr(c|w;θ) ]
argmax_θ \prod_{(w, c) in D} Pr(c|w;θ)

Model in Pr(c|w;θ):

e^{v_c \dot v_w} / \sum_{c' in C} e^{v_c' \dot v_w}

v_c, v_w: vector representation for c and w
C: all available contexts

Estimate: Take log

argmax_θ \sum_{(w, c) in D} log Pr(c|w;θ)
argmax_θ \sum_{(w, c) in D} (v_c \dot v_w - log(...) )
Very expensive to compute due to log(...)

https://www.tensorflow.org/tutorials/word2vec

Solutions:

Hierarchical softmax
Negative sampling

Negative sampling:

Pr( D=1 | w,c;θ ) = σ(v_c \dot v_w), σ: sigmoid
Estimate: argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w)
D': all incorrect random (w, c) pairs
Estimate:

argmax_θ \sum_{(w, c) in D} log Pr( D=1 | w,c;θ ) + \sum_{(w, c) in D'} log Pr( D=0 | w,c;θ )
argmax_θ \sum_{(w, c) in D} log σ(v_c \dot v_w) + \sum_{(w, c) in D'} log σ(-v_c \dot v_w)

Random sampling:

Distributed Representations of Words and Phrases and their Compositionality
\sum_{i = 1 to k} E_{w_i ~ P_n(w)} log σ(-v_c \dot v_w)

Example #2: Bernoulli trial

Part 1: https://www.youtube.com/watch?v=I_dhPETvll8
Part 2: https://www.youtube.com/watch?v=Z582V53dfr8
Part 3: https://www.youtube.com/watch?v=jpHreXjtw1Q

2017年5月23日 星期二

《蝦蟆的油》黑澤明尋找黑澤明

2017年5月6日 星期六

[word2vec] Fitting Probability Models

2017年5月23日星期二

2017年5月6日星期六