# Skip-Gram Modelling

##### The paper below is A Closer Look at Skip-gram Modelling by Guthrie et al.

First of all, beyond n-grams. In addition to n-grams, it allows tokens to be skipped.

### Definition of k-skip-n-gram

we define k-skip-n-gram for a sentence w_1 … w_m to be the set {w_i_1, w_i_2, … w_i_n | \sum_{j=1}^n i_j - i_{j-1} < k}

In English, it is that for a certain skil distance k allow a total k or less skips to construct the n gram. For example, 12345 can be constructed for 2-skip-bi-gram to be {12, 13, 23, 24, 34, … }

##### The paper below is word2vec Parameter Learning Explained by Xin Rong

### Continuous bag-of-word model

#### One word context

This is the simplest continuous bag os word that consider one word per context, then we will predict target word by a given context word. The we have a one-hot encoded vector, going through NN and have (simply copying the kth row).

The objective is to maximize the conditional probability of observing the actual output word given the input context by max p(word output | word input) |

#### Multi-word context

When we have multiple words, then what we see is
max p(word output | word**s** input)
and the output weight function is basically the same.

### Skip-Gram model

opposite to continuous bag of words, when input layer is target word, and context is output layer. Instead of outputing one multinomial distribution, we are outpuing multinomial distributions. Each output is computed using the same hidden→output matrix. Then the loss function is still softmax.

### Optimizations

#### Hierarchical softmax

It uses binary tree to represent all words in the vocabularies. Then it doesn’t use vector representaiton of the words but use huffman tree to encode everything.

To go through the tree, we have a random walk on p(n, left) = softmax(v^T, h)

#### Negative Sampling

To deal with difficulty that we have too many vectors to update per iteration, we do a negative sampling that (besides updating output word)have sample a few words as negative samples.