2-Dimensional Word Embeddings Example (source)

Representing linguistic objects (such as words and documents) as numeric vectors in a vector space allows these objects to be used in statistical and machine-learning models, and for them to be transformed and manipulated algebraically.

Here are some common applications of this:

Word (or document) embeddings are a particular type of numeric vector representation. Word (or document) embeddings are a dense vector representation (i.e. have few non-zero elements) which aim to capture the semantic meaning of a word (or content of a document) in a low dimensional space. This dense representation better captures the linguistic relationships between words (or documents), which tends to make them better suited for natural language applications. A dense representation is also computationally much more efficient than a sparse representation such as one-hot encoding, and requires less memory to store.

I go here through a conceptual explanation of various word and document embedding methods, and show implementation of some of them in python using GenSim for a similar product recommendation task. I intend to continue to add algorithms, information and code to this post in future. Many novel and powerful algorithms are being published in the natural language domain all the time. I’ve made a list of some of them here.

Bag-Of-Words Representation of a Document

A Bag-Of-Words (BOW) representation of a piece of text is simply a count of word occurrences:

\(\text{raw text}\\\text{(document)}\) because being but dreams feet have I my on only poor softly spread tread under you your
But I, being poor, have only my dreams 0 1 1 1 0 1 1 1 0 1 1 0 0 0 0 0 0
I have spread my dreams under your feet 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 0 1
Tread softly because you tread on my dreams 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0
My Dreams, softly I have my dreams 0 0 0 2 0 1 1 2 0 0 0 1 0 0 0 0 0
Your feet! Your feet tread on my poor dreams 0 0 0 1 2 0 0 1 1 0 1 0 0 1 0 0 2

This table is called a document-term matrix. Each row is a document and each column is a term (a word).

For example, the Bag-Of-Words representation of the piece of text “My Dreams, softly I have my dreams” is the numeric vector

\[\Big[0,0,0,2,0,1,1,2,0,0,0,1,0,0,0,0,0\Big]\]

The number of columns in the document-term matrix is equal to the size of the vocabulary in the whole text corpus i.e. the number of unique words across all documents.

Notice that word order is not captured by the Bag-Of-Words representation, nor are synonyms (different words with similar meanings) or homonyms (words with the same spelling but different meanings).

Each column of the document-term matrix is a numeric vector representation (embedding) of a word, describing the word using it’s distribution over documents. Each row of the document-term matrix is a numeric vector representation (embedding) of a document, describing the document using it’s distribution over words.

TF-IDF and Other Term-Weighting Schemes

Instead of simply containing the word frequencies in the sparse vector Bag-Of-Word representation of documents described in the previous section, certain words can be given greater importance. Terms (words) given greater weighting will have more influence in calculations of distance between documents. This weighting scheme is entirely under the control of the user, and can be tailored to any particular situation.

A popular weighting approach is TF-IDF, which up-weights words which are common within the document and down-weights words which are common across all documents:

\[\underset{\text{in document } i}{\underbrace{\text{weight of term } j}} \quad=\quad \underset{\text{prevalence of term } j \text{ in document } i}{\underbrace{\text{Term Frequency}}} \quad\times\quad \underset{\text{rarity of term } j \text{ over all documents}}{\underbrace{\text{Inverse Document Frequency}}}\]

There are various different choices for the TF and IDF components of the calculation. The Wikipedia TF-IDF page provides a good selection.

Word Embedding Methods

Word2Vec Word Embeddings

The Word2Vec model generates word embeddings by modelling the relationship between a word and its context (context meaning the set of other words which surround it).
A single training example for the Word2Vec model consists of a target word and a set of surrounding context words:

\[\begin{array}{lcl} \text{example with context window of size 2:} &\space& \text{I have measured} \underset{\text{context words}}{ \underbrace{\text{out my}}} \overset{\text{target word}}{\overbrace{\text{life}}} \underset{\text{context words}}{ \underbrace{\text{with coffee}}} \text{ spoons.} \\ \text{target word} &=& \{\text{life}\} \\ \text{context words} &=& \{ \text{out,my,with,coffee} \} \\ \end{array}\]

There are two variants of the Word2Vec model:

  1. The Skip-Gram Word2Vec model predicts what the context words are, given a target word as input.

  2. The Continuous Bag of Words (CBOW) Word2Vec model predicts what the target word is, given a set of context words as input.

The Skip-Gram model generates the word vector representations via maximisation of the following objective function: (source)

\[\begin{array}{lcl} \text{skip-gram objective function} &=& \text{average total log probability (under the model) that the true context words } \\ &\space& \quad w_{t-c},...,w_{t-1},w_{t+1},...,w_{t+c} \quad \text{ are in the context of target word } w_t \\ &=& \displaystyle\frac{1}{T}\sum_{t=1}^T \sum_{-c\leq j \leq c, j\neq0} log \space p\Big(w_{t+j}\Bigl|w_t\Big) \\ &=& \displaystyle\frac{1}{T}\sum_{t=1}^T \sum_{-c\leq j \leq c, j\neq0} log \space \displaystyle\frac{exp\Big({v'_{t+j}}^T v_t \Big)}{\sum_{w=1}^W exp\Big({v'_w}^T v_t \Big)} \hspace{20mm} \text{i.e. } p(\cdot|\cdot) \text{ is the softmax function} \\ w_i &=& \text{word } i \\ v_i &=& \text{vector representation of target word } i \\ v'_i &=& \text{vector representation of context word } i \\ &\space& \quad \text{i.e. a word has a different vector representation as a target word and as a context word} \\ T &=& \text{number of target words } (w_1,w_2,...,w_T) \text{ in training data} \\ W &=& \text{number of words in vocabulary (i.e. over all text in training data)} \\ c &=& \text{size of training context (number of context words to left and right of target word)} \\ \end{array}\]

This objective is computationally impractical to optimise in reality, and approximations of it such as Hierarchical Softmax and Negative Sampling are used instead (there are some details in this paper).

The word vectors are typically optimised via gradient descent using a neural network.

The objective function for the Continuous Bag of Words model is very similar, except that the target word is predicted based on the context words.

FastText Word Embeddings

The FastText algorithm (original paper) is based on the Word2Vec (CBOW) skip-gram model, except that the model learns vector representations for parts of words (n-grams) rather than for complete words. Words are then represented as a sum of their constituent n-grams.

Each word in the training corpus contributes multiple n-grams to the training data (\(n\) a hyperparameter to be chosen by the user), for example:

\[\begin{array}{lcl} \text{original word: } \\ \hspace{30mm} \text{<where>} \\ \text{constituent 3-grams: } \\ \hspace{30mm} \text{<wh>} \hspace{20mm}\text{(the 2-gram at beginning of the word is included in the n-grams)}\\ \hspace{30mm} \text{<whe>} \\ \hspace{30mm} \text{<her>} \\ \hspace{30mm} \text{<ere>} \\ \hspace{30mm} \text{<re>} \hspace{20mm}\text{(the 2-gram at end of the word is included in the n-grams)}\\ \hspace{30mm} \text{<where>} \hspace{20mm}\text{(the complete word is included in its n-grams)}\\ \end{array}\]

A notable strength of the FastText algorithm is that it can provide meaningful word embeddings for words not seen during model training. For example, it could provide a word embedding for the word “nightly” even if the word “nightly” were not in the training data, but the words “night” and “rightly” were (e.g. it could learn 4-grams \(<nigh>,<ight>,..\) from “night”, and 4-grams \(<ight>,<ghtl>,<htly>,..\) from “rightly”). If n-gram(s) that are required in order to generate a word embedding are completely missing from the training data, then the FastText algorithm can still provide a useful (if possibly noisy/lossy) word embedding for that word by simply ignoring the missing n-gram(s), and summing over the available learned n-gram(s).

Document Embedding Methods

Simple Averaging of Word Vectors

A simple way to generate a document embedding (numeric representation of a document) is a simple average over the word vector representations of all of the words in the document (using any word embedding method):

\[\begin{array}{lcl} D &=& \displaystyle\frac{1}{n} \begin{bmatrix} w_{11} \\ w_{21} \\ . \\ . \\ w_{d1} \\ \end{bmatrix} + \displaystyle\frac{1}{n} \begin{bmatrix} w_{12} \\ w_{22} \\ . \\ . \\ w_{d2} \\ \end{bmatrix} + \quad . . . \quad + \displaystyle\frac{1}{n} \begin{bmatrix} w_{1n} \\ w_{2n} \\ . \\ . \\ w_{dn} \\ \end{bmatrix} \\ n &=& \text{number of words in document} \\ d &=& \text{dimension of word embedding} \\ \end{array}\]

A word-weighting scheme such as TF-IDF could be used to perform a weighted average.

This method has the advantage of simplicity and speed although (as with all methods involving means) it is sensitive to outliers, and vectors pointing in opposite directions cancel each other out. Despite this, in practice I’ve seen it work very well.

Latent Semantic Indexing/Analysis (LSI/LSA)

Singular Value Decomposition (SVD) is a mathematical technique in which a real-valued matrix \(\mathbf{A}\) is decomposed into a product of 3 separate matrices (each of which have specific properties).

\[\underset{m\times n}{\mathbf{A}} \quad=\quad \underset{m\times m}{\mathbf{U}} \hspace{2mm} \underset{m\times n}{\mathbf{\Sigma}} \hspace{2mm} \underset{n\times n}{\mathbf{V}^T} \\\]

This decomposition can be used to obtain an optimal low-rank approximation \(\hat{\mathbf{A}}\) of the original matrix \(\mathbf{A}\) (optimal in the squared error sense). This approximating matrix \(\hat{\mathbf{A}}\) has the same dimensions as the original matrix, but has additional restrictions on its structure. Specifically, the number of linearly independent columns in \(\hat{\mathbf{A}}\) is equal to the rank of the approximation.

Latent Semantic Analysis (also called Latent Semantic Indexing) is the application of Singular Value Decomposition (SVD) to a document-term matrix (i.e. this matrix) in order to force dependency between words (the additional restrictions on the structure of the approximation \(\hat{\mathbf{A}}\) force the representation to exploit the semantic relationships between words as it tries to recreate the original matrix \(\mathbf{A}\)). The structure of the low-rank approximation of the original document-term matrix using SVD has a very elegant interpretation: words and documents are expressed as linear combinations of latent underlying topics/concepts:

source: Latent Semantic Indexing (https://www.youtube.com/watch?v=M1duqgg8-IM)

A term-weighting scheme such as TF-IDF can be applied to the document-term matrix before applying Latent Semantic Analysis.

There are many alternatives to SVD for factorising the Document-Term Matrix, for example Probabilistic Latent Semantic Analysis and Non-Negative Matrix Factorisation.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) assumes that the corpus of documents was generated by a probabilistic (hierarchical bayesian) model (here is the original paper).

The LDA model assumes that each document \(\mathbf{w}=\big\{w_1,w_2,...,w_N\big\}\) in the corpus was created by the following probabilistic process:

\[\begin{array}{lcl} 1. &\space& \text{choose } N \sim Poisson\big(\xi\big) \hspace{16mm}\text{(choose number of words in document) } \\ 2. &\space& \text{choose } \theta \sim Dirichlet\big(\alpha\big) \hspace{15mm}\text{(choose distribution of topics in document)} \\ 3. &\space& \text{for each of the } N \text{ words } w_n \text{ in the document:} \\ &\space& \hspace{10mm} (a) \quad \text{choose a topic } z_n \sim Multinomial\big(\theta\big) \\ &\space& \hspace{10mm} (b) \quad \text{choose a word } w_n \text{ from } p\big(w_n\bigl|z_n,\beta \big) \hspace{10mm} (\text{i.e. from the distribution of words in topic } z_n ) \\ &\space& \\ N &=& \text{number of words in document} \\ \theta &=& k-\text{dimensional vector of parameters of multinomial distribution (distribution of topics in document)} \\ k &=& \text{total number of topics/concepts (assumed known - specified by user)} \\ \beta &=& k\times V \text{ matrix of word probabilities } \\ &\space& \text{(row-wise, gives the distribution of words in each topic)} \\ \beta_{ij} &=& \text{entry in row } i \text{ column } j \text{ of matrix } \beta \\ &=& \text{probability of word } j \text{ given topic } i \\ &=& p(w^{(j)}=1|z^{(i)}=1) \\ V &=& \text{size of vocabulary (count of all unique words across all documents)}\\ \end{array}\]

Given this system, posterior distributions of the model parameters can be generated from the observed data (the corpus of documents) using Bayes rule.

Here are some excerpts from the paper:

  • “The parameters \(\alpha\) and \(\beta\) are corpus-level parameters, assumed to be sampled once in the process of generating a corpus.” (i.e. they are sampled once, and then are used to generate all of the documents).

  • “The variables \(\theta_d\) are document-level variables, sampled once per document.”

  • “The variables \(z_{dn}\) and \(w_{dn}\) are word-level variables and are sampled once for each word in each document.”

Posterior estimates of the distribution of topics within a particular document (\(\theta\)) provide a document embedding for that document (i.e. a numeric vector representation of that document). In other words, documents containing similar mixtures of topics appear closer to one another in the embedding space.

Another interesting application of LDA is collaborative filtering, where the LDA algorithm becomes a recommender system for proposing items to users. This is achieved by modelling a user-item matrix rather than a document-term matrix. Each user is modelled as a mixture/distribution of product categories.

Measuring Distance/Similarity Between Embeddings

Because the Bag-Of-Words (BOW) (or any other embedding) method represents pieces of text as numeric vectors, metrics which measure the similarity/distance between numeric vectors can be used to measure how close any 2 given texts are to one another. Here are some common options:

\[\begin{array}{lcl} \overset{\rightarrow}{p} &=& \Big[p_1,p_2,...,p_k\Big] \\ \overset{\rightarrow}{q} &=& \Big[q_1,q_2,...,q_k\Big] \\ \end{array}\]

Cosine Similarity:

\[s_{\text{cosine}}\Big(\overset{\rightarrow}{p},\overset{\rightarrow}{q}\Big) \quad=\quad \displaystyle\frac{\overset{\rightarrow}{p}\cdot\overset{\rightarrow}{q}}{||\overset{\rightarrow}{p}||\hspace{2mm}||\overset{\rightarrow}{q}||} \quad\in\quad [-1,1] \]

Euclidean Distance:

\[d_{\text{euclidean}}\Big(\overset{\rightarrow}{p},\overset{\rightarrow}{q}\Big) \quad=\quad \sqrt{ \displaystyle\sum_{i=1}^k \Big(p_i-q_i\Big)^2 } \quad\in\quad \mathbb{R}_{\geq0}\\\]

Other measures:

Word-Movers Distance

Word-Movers Distance is a measure of the dissimilarity between 2 documents, based on the amount of “work” required to convert one document into the other. It is based on Earth Movers Distance (link to original paper), which is a measure of the dissimilarity between 2 multivariate probability distributions.

Conceptually, if a histogram of a distribution \(X\) is considered as piles of dirt within bins, then the total amount of work required to move all of that dirt across into a new set of empty bins such that the new arrangement of dirt results in the distribution \(Y\) (using the most optimal dirt moving strategy), then this total amount of work is the Earth Movers Distance between distribution \(X\) and distribution \(Y\). Note that this amount of work is a function both of the amount of dirt that is moved, and also the distance that that dirt is moved e.g. it takes less work to move dirt (density) between adjacent bins than it takes to move the same amount of dirt (density) between bins that are far apart.

Earth Movers Distance is calculated by solving the following linear program::

\[\begin{array}{lcl} P &=& \Big\{ (p_1,w_{p_1}), (p_2,w_{p_2}), ..., (p_m,w_{p_m}) \Big\} \hspace{20mm} (\text{source distribution}) \\ Q &=& \Big\{ (q_1,w_{q_1}), (q_2,w_{q_2}), ..., (q_n,w_{q_n}) \Big\} \hspace{20mm} (\text{target distribution}) \\ p_i &=& \text{label of histogram bin } i \text{ of source distribution} \\ w_{p_i} &=& \text{amount of density ("dirt") in bin } p_i \text{ of source distribution} \\ q_j &=& \text{label of histogram bin } j \text{ of target distribution} \\ w_{q_j} &=& \text{amount of density ("dirt") in bin } q_j \text{ of target distribution} \\ d_{ij} &=& \text{ground distance between bin } p_i \text{ of source distribution and bin } q_j \text{ of target distribution} \\ f_{ij} &=& \text{amount of density ("dirt") moved from bin } p_i \text{ of source distribution to bin } q_j \text{ of target distribution} \\ &\space& \\ \\ &\space& \text{minimize the overall cost } \hspace{5mm} WORK(P,Q,F)=\displaystyle\sum_{i=1}^m\sum_{j=1}^n d_{ij} f_{ij} \hspace{5mm} \text{ subject to:} \\ f_{ij} &\geq& 0 \hspace{20mm} 1\leq i\leq m,\space 1\leq j \leq n \hspace{20mm} \text{ can only move a positive amount of density ("dirt") from bin } p_i \text{ to bin } q_j \\ \displaystyle\sum_{j=1}^n f_{ij} &\leq& w_{p_i} \hspace{20mm} 1 \leq i \leq m \hspace{20mm} \text{can't move more density ("dirt") out of source bin } p_i \text{ than the total density ("dirt") in source bin } p_i \\ \displaystyle\sum_{i=1}^m f_{ij} &\leq& w_{q_j} \hspace{20mm} 1 \leq j \leq n \hspace{20mm} \text{can't move more density ("dirt") into target bin } q_j \text{ than the total required density ("dirt") in target bin } q_j \\ \displaystyle\sum_{i=1}^m\sum_{j=1}^n f_{ij} &=& \text{min}\Big( \displaystyle\sum_{i=1}^m w_{p_i}, \space \displaystyle\sum_{j=1}^n w_{q_j} \Big) \hspace{20mm} \text{total amount of density ("dirt") moved cannot exceed the total density ("dirt") in the distribution with least density ("dirt")} \\ &\space& \hspace{60mm} \text{(although the total amount of density ("dirt") will be the same in both distributions for proper probability distributions)} \\ \text{EMD}\Big(P,Q\Big) &=& \displaystyle\frac{\text{total work}}{\text{total flow}} \\ &=& \displaystyle\frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^n d_{ij} f_{ij}}{\displaystyle\sum_{i=1}^m\sum_{j=1}^n f_{ij}} \\ \end{array}\]

This concept is illustrated in the figure below:

The metric described in the original paper is actually a little more general than this, extending beyond this histogram case.

Word-Movers Distance applies this Earth Movers Distance concept to documents:

  • Each document is represented by a normalized (unit-length) Bag-Of-Words vector. Using the previous terminology, this is the histogram. Each word in the document is a bin in the histogram. The amount of “dirt” (density) in each histogram bin (word in document) is the amount of each word present in the document (and the word rarity too, if using TF-IDF word vectors).

  • The distance between any 2 words is measured using the distance between them in a word-embedding space (any word embedding method and distance measure can be used).

Here is a simplified example from the original paper (noting that the algorithm in practice can map a single source word to multiple target words):

Document Similarity Implementations in Python using GenSim

I implemented the following four document-embedding methods on an Amazon product dataset using the GenSim python library:

  1. Each document represented as an average over its word embeddings, using pretrained GloVE wiki-100 word embeddings

  2. TF-IDF Bag-of-Words representation

  3. Latent Semantic Indexing

  4. Latent Dirichlet Allocation

Here are 2 examples of the similar product recommendations produced:

The recommendations were obtained using only text data (I fetched the images afterward for illustration).

You can download the full code here: Document embedding implementation in GenSim (jupyter notebook)