Introduction

Let's imagine that we are the first research scientists ever who are attempting to design some algorithms which will make our computer able to make some sense of human language (We'll select English as the human language for sake of this blog). We start by first finding tasks which our computer should perform with certain level of competence if it finally gets the abilities we want to bless it with. Trying to solve the problem of analogies seems like a reasonable task for this purpose. If we ask our computer king is to queen as man is to ___ ? Hopefully, our computer should reply "woman"?

Looking at the task at hand closely, we realize that we can't just type in our computer "King is to queen as man is to what?" and expect it to answer. Our computer isn't a human which can make sense of english symbols yet. The written language is a symbolic representation of our thoughts. We encode our thoughts in symbols so that we can convey those to other human. People who are unable to speak encode their thoughts in form of sign languages. There is one problem with process of conveying thoughts to other person by encoding them in symbols. The other person should know how to decode those symbols. You are able to decode the symbolic representation of English which you are so effortlessly doing right now while reading this blog. But if this blog was written in a language you didn't know, then this whole blog will just be a collection of meaningless symbols to you. Therefore, we have to encode our thoughts or messages by the rules which the receiver knows how to decode. In our case, the receiver is a computer, therefore, we have to convey our query to the computer in a form which it can understand.

Let's suppose in the end we design a function find_analogy which takes in three arguments arg1, arg2 and arg3. It does some mathematical computations on these three arguments and returns a word such that the word is answer to question arg1 is to arg2 as arg3 is to what. If we want answer to our question "King is to queen as man is to what?" we need to run the line of code find_analogy(king, queen, man). But our computer doesn't know what the word "king","queen" and "man" mean. We have to encode the words in a form the computer understands. Well, we know that a computer can make sense of mathematical entities like scalars and vectors. Why don't we encode every word in dictionary so that each is represented by a unique scalar/vector.

Let's start with simplest solution and represent our words with scalars such that each word is represented by it's position in the standard English dictionary. So the word 'aardvark' is represented by scalar 1 and word "zebra" is represented by scalar 273000 (approximate number of words in Oxford dictionary). We can immediately sense something wrong about this representation. What would adding or subtracting representations of two words even mean? The analogy task by which we intend to challenge understanding of our computer requires having a sense of meaning. It should be able to reason that king-queen relationship is same as man-woman since we remove abstract concept of masculinity from king and man and add another abstract concept of femininity to arrive at queen and woman respectively. If we trick our computer by asking a different question like "king is to man as queen is to what?" then it should be able to do some computation which resembles subtracting abstract concept of royalty from king and queen to arrive at man and woman respectively. And there's no way a computer can capture such abstract concepts if we just encode each word by it's position in the dictionary. Thus, with a little bit of thought we come to conclusion that representing words with vectors makes more sense.

And so, we arrive at our next problem. How do we find good vector representations of words? Let's answer this question by trying out different possibilities.

Let |V| be the length of our vocabulary/dictionary. We can represent each word by a |V|- dimensional vector which has 1 at index which is equal to position of word in dictionary and 0 everywhere else. Therefore, representation for word 'aardvark' is ${\left[\begin{array}{ccc}1\\0\\0\\.\\.\\.\\0\end{array}\right]}$ and 'Zebra' will be represented by ${\left[\begin{array}{ccc}0\\0\\0\\.\\.\\.\\1\end{array}\right]}$. We'll call this encoding the 'one-hot' representation of our words.

Now, digressing a little bit, let us ponder on the question that given two vectors how do we know if those two vectors point in somewhat same direction? We can quantify this notion by calculating the cosine of angle between those two vectors. Let ${a}$ and ${b}$ be two vectors and ${\theta}$ be the acute angle between them. Then, ${\cos\theta = \large\frac{a.b}{\Vert a\Vert \Vert b \Vert}}$. If ${\cos\theta = 1}$ then vectors point in the same direction. If it's $0$ then the vectors are perpendicular to each other and if it's $-1$ then the vectors point in opposite directions.

Note: We’ll restrict the word vectors to be unit vectors. Then, calculating cosine similarity is same as taking dot-product.

How does this relate to our problem in hand? Well, we can say that we want that the notion of similarity between two words to be captured by an operation as simple as dot-product of the vectors of these two words. In other words, we want vectors of similar meaning to point in somewhat same direction. This way we can design a function called is_similar which takes in vectors for two words and will compute their dot product. If the dot product is close to 1 then the function will say yes these two words are similar. So, we hope that we can design word vectors in such a way that vectors of synonymous words point in somewhat same direction (and antonyms in the opposite).

Now, knowing the English language we know that words 'motel' and 'hotel' are very similar and so we want the vectors of these two words to give a dot-product close to 1, according to concept we just discussed. But, if we take the dot product of vectors from our 'one-hot' representation of these words, it comes out as 0. In-fact dot-product of any two word vectors from our one-hot representation is 0. This fact is disheartening since this means that one-hot representation are not capable of capturing similarity between two words. Each word-vector is orthogonal to every other word-vector in this representation which is same as saying no two words are similar to each other nor are they oppoisite which isn't the case in real world. So, we part our ways with the one-hot representation and search for other alternatives.

Count-based Methods:

We are looking for some concept about words which we can latch onto to form good vector representations of words. Through a fortunate stroke of serendipity we come across a quote by English linguist John Rupert Firth: "A word is characterzied by the company it keeps". This quote implies that a word's meaning is decided by the context in which it appears. We can validate this statement by looking at some examples. Notice the use of word "bank" in following two sentences.

I deposited the cheque in the bank.
I was sitting alongside a river bank.

In the two sentences above we can see that word "bank" takes on two completely different meaning based on the context in which it appears. In first sentence the word "bank" means a building where financial operations are carried out while in the other sentence it means the ground alongside a river. Therefore, meaning of word bank changed depending on the words it was surrounded by (context words). Moreover, consider the following examples:

The service at hotel we stayed in was very good.
The motel we are going to stay at is famous for the excellent service it provides.

Since, motel and hotel are pretty much synonymous, they tend to be surrounded by similar words like 'stay', 'service' etc. Let's refer surrounding words of a word as "context words". We can say that two words are similar if they have very similar context words.

To make use of this concept let's just count co-occurences of words. We first collect a large corpus of text. We then design a two dimensional matrix ${M}$ (let's call it count matrix) where ${M_{ij}}$ = Number of times the $j^{th}$ word of dictionary occurred in the context of $i^{th}$ word of dictionary in the corpus of text. Each row of this matrix will represent a word in vocabulary. The first row will represent the word 'aardvark' and first cell of this row will be the number of times word 'aardvark' occurred in context of word 'aardvark'. Similarly, the last cell of this row will represent the number of times word 'zebra' occurred in context of word 'aardvark'. Let $|V|$ be the number of words in our vocabulary. Thus each row contains ${|V|}$ cells and there are ${|V|}$ such rows. Therefore, ${M}$ is a ${|V|\times|V|}$ matrix. We need to specify one more detail to design this matrix; what does it mean by occurring in context of something? To formally specify this we design a hyperparameter called "context window" and denote it by letter c. c is the number of words either side of a word that classify as it's context words for.eg. in the sentence "I hope this year is better than the previous one." if we choose c = 2 and our center word is 'year' then words 'hope', 'this', 'is' and 'better' are it's context words.

To efficiently fill out the cells in the matrix we first make a ${|V|\times|V|}$ matrix filled with 0's. Then, we take the first word in the our large corpus and collect it's context words. Then we increment the count in respective cells. We do this step for each word in the corpus.

We can see intutively that synonymous words will have very similar rows because in a large corpus of text they will have similar context words. We can take the row for any word, normalise it (to turn it into a unit vector) and declare that row as the vector representation of that word. Then, taking the dot-product of vectors of similar words will result in a number close to 1.

This marks our first breakthrough!!! We have devised a representation of words which our computer can understand and then tell us which words are similar by doing an operation as simple as a dot-product. But it is only when we put this algorithm to practice in real-world do we notice that it has numerous shortcomings:

Every word is represented by a ${|V|}$ dimensional vector. ${|V|}$ can be of the order of millions because it is the length of our vocabulary. Storing vector for every word takes a toll on memory of our computer.
${M}$ is a huge ${|V|\times |V|}$ matrix which is sparse. This means that most of the entries of our matrix are 0. This is because most of the words don't occur in context of some particular words for.eg. words 'summer' and 'snow' probably never occur in context of each other.
Some of the words occur numerous times in context of other words which isn't very informative for.eg words like 'the' and 'is' occur in contexts of almost every word many times. This leads to drastic imbalance in word-frequency.
New words are constantly being added to vocabulary. Words like 'tweet' and 'google' may not be in vocabulary of ancient texts but they are so prevalent now that they need their own definitions in a standard dictionary.

Let's try to tackle these shortcomings ony by one.

We have the concept of Singular Value Decomposition (SVD) to our rescue for tackling the first shortcoming. Let's understand what SVD does by taking a simple example. We can write a large number like 68 as $2\times2\times17$. Writing down $68$ in factorized form reveals a lot of facts like it is divisible by $2,4$ and 17, it is multiple of a prime number etc. Similarly, factoring a matrix can reveal a lot of facts about it. SVD says that any matrix A can be factorized into three simpler matrices ${A, \Sigma}$ and ${B}$ as ${A\Sigma B^{T}}$ where $A$ and $B$ are orthogonal matrices and ${\Sigma}$ is a diagonal matrix.

Let's do this with our count matrix ${M}$. We can write $M$ as ${A\Sigma B^{T}}$. Here, each of ${A, \Sigma}$ and ${B}$ is a ${|V|\times |V|}$ matrix. Then, we take only the first $k$ columns of ${A}$ which results in ${|V|\times k}$ matrix. We declare the rows of this matrix as vector representations of our words. This now solves our first problem because each word is now a ${k}$ dimensional vector where ${k}$ can be chosen by us based on some threshold.

But the method of SVD which we used to circumvent one of our problems has some shortcomings in itself. SVD is not a trivial operation to perform. It does not scale well for large matrices because the amount of time it takes to perform SVD increases in quadratic fashion with respect to size of matrix. The matrix on which we want to perform SVD is already a huge matrix which will only get larger as new words are added in the vocabulary. These shortcomings are severe enough to compel us to look for other ways to find good vector representation of our words.

Iteration-based methods:

Although our previous method didn't work out in the end, we found that guiding principle of "A word is characterized by the company it keeps" is pretty good. We would want to continue on this train of thought to devise other methods of finding word vectors. How else does the company of word characterize it? We can notice one fact that context words are pretty good predictors of a center word. That's what we have been doing since childhood when we are solving 'Fill in the blanks" type questions in our exams. Try to complete the following sentence: "A ___ was seen flying in the sky after taking off from airport." If you just read till the word "sky" there would be many possibilities in your mind for the word that fills in the blank for eg 'bird', 'airplane', 'birds', 'helicopter' etc. If you read the whole sentence and see the word "airport" you immediately lock in "airplane" as the most likely answer. We can also see it other way round. Given a word, say "football" you would assume that words like "kick", "score", "goal" will be in it's vicinity in a large corpus, rather than words like "racket", "serve" etc.

This shows that:

1. A word is a good predictor of it's context words.

2. Context words are good predictor of a center word.

These two concepts give us the foundation for building good vectors for our words.

Skip-gram model:

Let's go with the first concept above and see what we can conjure up. We can start by taking the first word in a large corpus and then try to predict it's context words using it's word vector. If predictions are correct then everything is fine, but if they are not then there must be something wrong with the word vectors. In that case, try to modify the vectors so that next time they don't commit the same mistake again. Do this for every word in the corpus and hopefully we would have corrected all the bad word vectors by the end. But, to correct bad word vectors we need to have them in the first place. Therefore, initially we'll initialize d-dimensional word vectors for every word randomly and we'll improve them over time using the algorithm we just envisaged.

I've highlighted the words predict and modify because it is not clear how would we carry out these operations.

Let's tackle the problem of prediction first. How do we predict the context words from a given word? For that we can change our notion of dot-product. Earlier we considered dot-product of vectors of two words to be a measure of similarity of those two words. Now, let's see the dot product of two word vectors as a measure of probability of them existing in the vicinity of each other. In other words, if dot-product of word vectors of two words, say 'a' and 'b' is close to 1 then that means they are very likely to occur nearby in the corpus, which means when we see 'a' we can safely predict that 'b' will be nearby.

Note: - I am using terms "dot-product" to mean cosine similarity between the vectors. If vectors are of unit length cosine similarity is the same as dot-product

Let's see how we can convert dot-product of word vectors to probabilities with an example. Take word vector of "deep" and take it's dot product with vector of every word in the vocabulary and collect the results in a list. This will give us a list of length ${|V|}$. We can see this list as a "score" for probability of each word in vocabulary occurring with word "deep". We hope that score for word "learning" is close to 1 (because if we see the word "deep" there's high chances that word "learning" is nearby in a corpus). Wouldn't it be great if we could turn this list of scores in a list of probabilities depending on the score. This list of probabilities would then tell the probability of each word in vocabulary occurring in context of word "deep". Since all the scores are between -1 and 1 let's exponentiate them to turn them into positive numbers (because ${e^{x}}$ is always positive and if ${x \lt y \implies e^{x} \lt e^{y}}$ which will make sure larger scores turn into larger positive numbers). To squeeze these numbers between 0 and 1 let's divide each entry in the list by the sum of all the entries in the list. This way all the numbers sum to 1 which makes them a valid probability distribution. The sequence of operations we did to convert scores to probabilities are collectively called Softmax operation.

Let's write down what we did in mathematical notations.

Let ${u}$ denote the word vector of word "deep" and ${v_{1}, v_{2}, ..., v_{|V|}}$ be the word vectors of every word in vocabulary. (If you're wondering why we're using two letters ${u}$ and ${v}$ for word vectors, it's because the roles of words are different. ${u}$ denotes the word vector of center word and ${v}$ denotes the word vectors of context words).

Take the dot product of each ${v_{i}}$ with ${u}$ and collect the results in a list called ${scores_{deep}}$.

${scores_{deep} = \left[u.v_{1},\space u.v_{2}, ..., \space u.v_{|V|}\right]}$.
Exponentiate each of the entry in ${scores_{deep}}$

${scores_{deep}} = {\left[exp(u.v_{1}),\space exp(u.v_{2}), ..., \space exp(u.v_{|V|})\right]}.$

Store sum of all the entries of ${scores_{deep}}$in a variable 'Sum'

Sum = ${exp(u.v_{1})\, +\, exp(u.v_{2})\, +\, ... \,+ \,exp(u.v_{|V|}) = \normalsize\Sigma_{\small i=1}^{\small|V|} \small exp(u.v_{i})}$
Divide each entry by Sum and rename the list to $probabilities_{deep}$ because the entries of list now convey the probabilities of words occurring alongside "deep".

${probabilities_{deep}} = {\left[\large\frac{exp(u.v_{1})}{Sum},\space \frac{exp(u.v_{2})}{Sum}, ..., \space \frac{exp(u.v_{|V|})}{Sum}\right]}.$

If we want the probability of word "aardvark" occurring in context of "deep" we'll just index the first element from list $probabilities_{deep}$ to get our answer which is ${\frac{\normalsize exp(u.v_{1})}{\normalsize Sum}}$. This quantity is basically $P( aardvark | deep)$ i.e. probability of seeing word "aardvark" given that we have just seen word "deep". The lists like $probabilities_{deep}$ are called a "probability distribution" over all the words in vocabulary.

Now that we've worked out a way to convert dot-product to probability let's see how we can design good word vectors for our words.

Let " A blog about word vectors" be a string of words in our big corpus. Since words "A", "about", "word" and "vectors" are in context of word "blog", we would want $P(a, about, word, vectors | blog)$ to be close to 1.

Let’s make a “naive” assumption that context words are independent of each other given the center word.

Then, according to chain rule of probability:

$L = P(a | blog)\times P(about | blog)\times P(word| blog)\times P(vectors| blog)$

We have already devised a way to calculate each of the four factors above.

$L = \Large\frac{exp(u_{blog}.v_{a})}{\Sigma_{i=1}^{|V|} exp(u_{blog}.v_{i})}\times \Large\frac{exp(u_{blog}.v_{about})}{\Sigma_{i=1}^{|V|} exp(u_{blog}.v_{i})}\times \Large\frac{exp(u_{blog}.v_{word})}{\Sigma_{i=1}^{|V|} exp(u_{blog}.v_{i})}\times \Large\frac{exp(u_{blog}.v_{vectors})}{\Sigma_{i=1}^{|V|} exp(u_{blog}.v_{i})}$

When calculating $L$ on our computer, we encounter one problem; $L$ is product of numbers which are between 0 and 1. Multiplying many such numbers together leads to underflow errors. To circumvent this, we instead try to maximize $log L$. Taking log would turn products of numbers between $0$ and $1$ into sum of large negative numbers which our computer would be able to handle.

Writing our new objective down:

$logL = log\frac{exp(u_{a}.v_{blog})}{\Sigma_{i=1}^{|V|} exp(u_{i}.v_{blog})} + log\frac{exp(u_{about}.v_{blog})}{\Sigma_{i=1}^{|V|} exp(u_{i}.v_{blog})} + log\frac{exp(u_{word}.v_{blog})}{\Sigma_{i=1}^{|V|} exp(u_{i}.v_{blog})} + log\frac{exp(u_{vectors}.v_{blog})}{\Sigma_{i=1}^{|V|} exp(u_{i}.v_{blog})}$

Now that, we've worked out a method to calculate our objective using an example, let's try to write it in more general terms.

Since we are going through an entire corpus word by word, predicting context for every word, we'll likely encounter every word in two scenarios. One in which that word will be the center word and other in which it'll be the context word for some other word. Let's use ${v_{w}}$ to denote the embedding for a word ${w}$ when it acts as a centre word and ${u_{w}}$ to denote it's embedding when it acts as context word.

Let the size of context window be c which means for each word we consider c words to it's left and c words to it's right as it's context. Let $w_{i}$ and ${v_{w_i}}$ be the center word and it's word vector respectively. Let ${u_{w_{i}}}$ be the vector for word ${w_{i}}$ and ${u_{k}}$ be the vector for word in $k^{th}$ position in dictionary.

Then our objective for this particular word is :

$logL_{\large w_{i}} = \sum_{j=-c}^{c} logP(w_{i+j}| w_{i}) = \sum_{j=-c}^{c} log\Large\frac{exp(u_{w_i+j}.v_{w_i})}{\sum_{k=1}^{|V|}exp(u_{k}.v_{w_i})}$

Simplifying further,

$logL_{w_{i}}= \normalsize\sum_{j=-c}^{c}u_{w_i+j}.v_{w_i} - 2c\sum_{k=1}^{|V|}exp(u_{k}.v_{w_i})$

Instead of maximizing $logL_{w_{i}}$ let's minimize $-logL_{w_{i}}$. Both are essentially the same operation but we choose to go the later way because now we can refer $-logL_{w_{i}}$ as $Loss$ which is appealing intuitively (we can say that we are minimising the $Loss$)

So, $Loss =-\normalsize\sum_{j=-c}^{c}u_{w_i+j}.v_{w_i} + 2c \,log\sum_{k=1}^{|V|}exp(u_{k}.v_{w_i})$

The problem we face now is how do we minimize it? Well, we can see that Loss depends on vectors $u_{1},...,u_{|V|}$ and $v_{i}$. We could try to tinker with these to minimize the $Loss$. The way we do that is by using an algorithm called Gradient Descent.

Suppose we want to modify ${v_{i}}$ to decrease $Loss$. According to gradient descent we should first take the gradient of $Loss$ with respect to $v_{i}$ which we can denote by $\large\frac{\delta Loss}{\delta v_{i}}$. Then we should modify $v_{i}$ using the following rule :

$v_{i} \leftarrow v_{i} - \alpha\frac{\delta Loss}{\delta v_{i}}$ where $\alpha$ is called the 'learning rate'.

We can modify $u_{1},...,u_{|V|}$ similarly.

Summary:

For every word in large corpus of words:

Collect context for word.
Calculate $Loss$ using the word vectors of context words and word.
Modify those word vectors using the gradient descent.

Hopefully, at the end of this procedure, we will have satisfactory word vectors.

Continuous bag of words:

Now, let's try to go the other way around and predict a word using it's context words. Let $w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i}, w_{i+1},...,w_{i+c}$ be a string of words from our corpus. This time the thing we are trying to maximize is $P( w_{i} | w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i+1},...,w_{i+c})$ where $w_{i}$ is any word in the corpus and $w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i+1},...,w_{i+c}$ are $2c$ context words from it's left and right context.

How do we express $P( w_{i} | w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i+1},...,w_{i+c})$ in terms of word vectors of these words ? First, we can design a single vector for all the context word vectors by averaging them together. Let that single vector be denoted by $v_{w_{i}}$.

$v_{w_{i}} = \frac{v_{w_{i-c}}\,+\,...\,+ {v_{w_{i-1}}\,+\,{v_{w_{i+1}}\,+...\,+{v_{w_{i+c}}}}}}{2c}$. Let $u_{w_{i}}$ be the word vector for word $w_{i}$. Then, we can calculate $P( w_{i} | w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i+1},...,w_{i+c})$

as follows:

$P( w_{i} | w_{i-c}, w_{i-c+1},..., w_{i-1},w_{i+1},...,w_{i+c}) = \frac{\large exp(u_{w_{i}}.v_{w_{i}})}{\large\sum_{k=1}^{|V|}exp(u_{k}.v_{w_{i}})}$. Same as before, we try minimize negative log of this quantity which we have aptly named $Loss$.

$Loss = -log\frac{exp(u_{w_{i}}.v_{w_{i}})}{\sum_{k=1}^{|V|}exp(u_{k}.v_{w_{i}})} = -u_{w_{i}}.v_{w_{i}} + log\sum_{k=1}^{|V|}exp(u_{k}.v_{w_{i}})$

We now modify word vectors by Gradient descent as described above.

Summary:

For every word in large corpus of words:

Treat that word as a blank
Collect context words and take average of their word vectors.
Calculate $Loss$.
Modify those word vectors using the gradient descent.

Hopefully, at the end of this procedure, we will have satisfactory word vectors.