................................................................................11111111................................................................................................................
.........................................................................1111..11000000000001111111.....................................................................................................
........................................................................1111..1110000000000000000000000111111...........................................................................................
........................................................................111..111100000000000000000000000000001..........................................................................................
......................................................................1111...111100000000000000000000000000001..........................................................................................
......................................................................111...111.100000000011111100000000000001..........................................................................................
......................................................................11...111..100000000001111111111111111001..........................................................................................
......................................................................1...111...100000000001111111111111111111..........................................................................................
.........................................................................111...1100000000001111111111111111111..........................................................................................
........................................................................1111..11100000000001111111111111111111........111111.1111111....................................................................
.......................................................................1111..111100000000000111111111111111111....11111111.1..111111111.................................................................
......................................................................1111..1111100000000000111111111111111111.....11111.0.0..000000111.................................................................
......................................................................111..1111.100000000000111111111111111111.....11111.0.0..000000001.................................................................
......................................................................11..1111..100000000000111111111111111111.....11111.0.0..000000001.................................................................
......................................................................1..1111..1100000000000111111111111111111.....11111.0.0..000000001.................................................................
...................................................10111111..............111...1100000000000011111111111111111.....11111.0.0..000000001.................................................................
............................................1.1111.10000000000011.......111...11100000000000011111111111111111.....11111.0.0..000000001.................................................................
.....................................1111.1.....11.100000000000001....1111...111100000000000011111111111111111.....11111.0.0..000000001.................................................................
...................................111......1.1....100000000000001....111...111.100000000000011111111111111111.....11111.0.0..000000001.................................................................
...................................1111.......1111.100000000000001....11...111..100000000000001111111111111111.....11111.0.0..000000001.................................................................
...................................1....1.1.1.1....100000000000001....1...111...100000000000001111111111111111.....11111.0.0..000000001.................................................................
...................................1111.....1.1111.100000000000001.......1111..1100000000000001111111111111111.....11111.0.0..000000001.................................................................
.......................................11.1.1...11.100000000000001......1111..11100000000000001111111111111111.....11111.0.0..000000001.................................................................
...................................11.....1.1.111..100000000000001.....1111..111100000000000000111111111111111.....11111.0.0..000000001.................................................................
.....................................1111.1...1.11.100000000000001....1111..1111100000000000000111111111111111.....111.1......000000001.................................................................
........................................1.1.1.1....100000000000001....111..1111.100000000000000111111111111111........1.1.111.00000000011111............................................................
.....................................11.1.......11.100000000000001....11..1111..100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................
...................................111111.1........100000000000001......1111....100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................
...................................1....1.1.1.111..100000000000001.......1111111100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................
.....................................11.1.....1.11.100000000000001..111111111111100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................
......................................1...1.1.1....100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000..111001....................................................
.................1111111..............1.....1.1111.100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001...................................................
..........11100000000001...........1111...1.1......100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001....................1001111........................
.........111111111000001...........1....1.1.1.1111.100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001................1...10000001.......................
..........11111111000001...........1111.1..........100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001................1...10000001.......................
..........11111111000001..............111.1.1.1....100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001..............111...10000001.......................
..........1111111.1111111110111....11.......1.1.11.100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001..............11..1.10000001.......................
..........1111..1.1.1.1.1110000001...1111...1.1....100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001.............11...1.10000001.......1111............
..........1111..1.1.1.1.1110000000......1.1.1.1111.100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.00000000000000111110011.........1000111..1.10000001......000000...........
..........1111..1.1.1.1.1110000000.1111.1.1........100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.0000000000000011................1000000111.10000001..1000000001...........
..........1111..1.1.1.1.1110000000...1111.1.1...11.100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.0000000000000011................1000000111.10000001.00000000110001........
..........1111..1.1.1.1.1110000000.11.........1.11.100000000000001.1111111111111100000000000000000000111111111..111.0.0.0.111.0000000000000011................1000000111.10000001.10000000110001........
..........1111..1.1.1.1.1110000000...11...1.1.1....100000000000001.1111111111111100000000000000000000000000001..111.0.0.0.111.0000000000000011................100000011..10000001..11111111.............
..........1111..1.1.1.1.1110000000........1.1...11.100000000000001.1111111111111000000010000000000000000000000..111.0.0.0.111.0000000000000011................100000011..10000001........11.............
.1000001..1111..1.1.1.1.1110000000.1111...1........100000000000001.111111111111000001......10000000000000000000.111.0.0.0.111.0000111000000011................1000000111101000001...1000001.......11....
10000000010001..1.1.1.1.1110000000.1.1111.1.1.1111.100000000000001.1111111111111............10000000000000000000011.0.0.0.111.0001....00000011...............1100011111....000001..000000001111..00001..
000000000000001.1.1.1.1.1110000000.1...............100001111000001.11111110000..............100000000000000000000001010101011.0001...100000011.......111111111.....1111111100000000000000000000001001...
.10000000000001.1.1.1.1.1110000000.111111.1.1.1....10000.....00001.00000000000.............1000000000000000000000000000000001.00000110100000111111111.....11111111111....1.0000000000000000000000..1....
..100110111111..1.1.1.1.1111.10000......1...1.1.11.1000011.1100001.000000000000111111.11..10000000000000000000000000000000001.0011011..1000011....1111111111.....11111000010000000000101111000000..1101.
.10001.0..1111..1.1.1.1.1111..0000.1111.1...1......1000000.0110001.00000000000000000111....1000000000000000000000000000000001.00111110000000111111......11111111111111.....0000000000000111....11111....
.10001101100001.1.1.1.1.1110111000.111111.1...1111.1000000.1110001.000000000000000001.1....0000000000000000000000000000000001.0000011000000011..11111111111............111.0000000000001111........1....
.......00000001.1.1.1.1.1110111000.................1000000.0000001.000000000000000001.000000000000000000000000000000000000001.000001100000001111.......111111111000000000010000000000....11........1....
.......00000001.1.1.1.1.1.10100001.111111.1.1.1111.1000000.000000..0000000000000000011000000000000000000000000000000000000001.0000011000000011111100000000000000000000000011000000000....11........1....

For many recommender algorithms (algorithms for making personalised product recommendations to customers), it is convenient to represent products and customers as vectors of information.

I refer here to the context of recommending products to customers, but the frameworks discussed could just as easily be used to recommend any kind of item (e.g. movie) to any kind of item consumer (e.g. web-page user).

This page explores and illustrates the concept of describing customers and products using numeric vectors.

Suppose that our hypothetical business has the following four products to offer:

Suppose that the only attribute information that we have for these products is the colouring of these products i.e. the amount of red, blue and green shown by each product. In real applications, each product vector could be thousands of attributes long, coding any desirable information about that product.

product_id product_desc
product_1 red, green, green, green
product_2 red, blue, green
product_3 blue, blue, blue
product_4 red, red, blue
product_id product_desc tf_red tf_blue tf_green
product_1 red, green, green, green 1 0 3
product_2 red, blue, green 1 1 1
product_3 blue, blue, blue 0 2 0
product_4 red, red, blue 2 1 0

Here are the four products, coded as numeric vectors

\[(1,0,3)\]

\[(1,1,1)\]

\[(0,2,0)\]

\[(2,1,0)\]

Now that they are represented as vectors, we can visualise these four products in euclidean space:

In the language of information retrieval, these are called term frequency vectors, since they are counts of word occurrence in the bag-of-words describing each product.

The smaller the angle \(\theta\) between any 2 product vectors, the more similar those 2 products are. Because we are only concerned with angle, we can scale each of the product vectors to have length 1 (L2 norm).

Here are the same product vectors in the product space, scaled to have (L2) length 1:

Notice that products with more attributes (colours) in common have a smaller angle between them.

The angles (in degrees) between the product vectors are:

##      rggg rbg bbb rrb
## rggg    0  43  90  74
## rbg    43   0  55  39
## bbb    90  55   0  63
## rrb    74  39  63   0

The largest angle possible between a pair of products is \(90^\circ\) (\(\displaystyle\frac{\pi}{2}\) radians). This occurs when 2 products have no attributes in common.

The smallest possible angle between a pair of products is 0 degrees. This occurs when the two products have identical attributes.

The cosine of the angle \(\theta\) between 2 vectors gives a convenient metric for measuring similarity between the 2 vectors, since \(cos(\theta)\) lies within the range [0,1]:

\(cos(\theta)\) meaning
1 vectors point in an identical direction (they are the same vector)
0 vectors are orthogonal (\(90^\circ\))

\(cos(\theta)\), in this context, is called cosine similarity.

Since in this context vectors are non-negative valued, only angles in \(\theta\in[0^\circ,90^\circ]\) are possible

The cosine similarities between the product vectors are:

##           rggg       rbg       bbb       rrb
## rggg 1.0000000 0.7302967 0.0000000 0.2828427
## rbg  0.7302967 1.0000000 0.5773503 0.7745967
## bbb  0.0000000 0.5773503 1.0000000 0.4472136
## rrb  0.2828427 0.7745967 0.4472136 1.0000000

Customers as Vectors

If we can represent customers as vectors in the same euclidean space as the product vectors, then we could use the angle between a customer vector and a product vector as a measure of how appropriate (‘similar’) that product is to that customer.

For example, we might code a customer who likes the colour Red a lot, likes Blue a little, and dislikes Green as:

\[\Big(5,2,0\Big)\]

The angle between this customer and the product rbg is:

\[\begin{array}{lcl} \theta &=& cos^{-1}\Bigg(\displaystyle\frac{(5,2,0)\cdot(1,1,1)}{|(5,2,0)|\space|(1,1,1)|}\Bigg) \\ &=& 0.722 \text{ radians} \\ &=& 41.37^\circ \\ cos(\theta) &=& 0.75 \end{array}\]

..whereas a product better suited to them might be rb, which has an angle of

\[\begin{array}{lcl} \theta &=& cos^{-1}\Bigg(\displaystyle\frac{(5,2,0)\cdot(1,1,0)}{|(5,2,0)|\space|(1,1,0)|}\Bigg) \\ &=& 0.405 \text{ radians} \\ &=& 23.2^\circ \\ cos(\theta) &=& 0.92\\ \end{array}\]

between it and the customer vector

One way to generate customer vectors like this (which have the same form as the product vectors) is to represent each customer as an aggregation of all of the products (vectors) that they have historically bought.

For example, suppose now that a customer has bought the following products:

Product Red Blue Green n_bought
product_2 1 1 1 1
product_4 2 1 0 1

We can represent this customer as a vector in the product-vector space by averaging over these 2 product vectors that the customer has bought:

\[\begin{array}{lcl} \overset{\rightarrow{}}{\text{customer}}_i &=& (\displaystyle\frac{1+2}{2} \space,\space \frac{1+1}{2} \space,\space \frac{1+0}{2})\\ &=& ( 1.5 \space,\space 1 \space,\space 0.5 ) \end{array}\]

Here is the customer plotted in the product-vector space:

Notice that the customer vector likes precisely in the middle of the two products that they bought.

Suppose that this customer buys another of product_4, making their purchase history:

Product Red Blue Green
product_2 1 1 1
product_4 2 1 0
product_4 2 1 0

The customer vector is now:

\[\begin{array}{lcl} \overset{\rightarrow{}}{\text{customer}}_i &=& (\displaystyle\frac{1+2+2}{3} \space,\space \frac{1+1+1}{3} \space,\space \frac{1+0+0}{3})\\ &=& ( 1.67 \space,\space 1 \space,\space 0.33 ) \end{array}\]

Here is the customer (under this new scenario) plotted in the customer/product euclidean space:

Notice that by buying this product (rrb), this customer (vector) has shifted closer to that product (vector).

The angle (in degrees) between this customer vector and each of the product vectors is:

## $`customer angle to product vector1: (rggg)`
##          [,1]
## [1,] 64.68351
## 
## $`customer angle to product vector2: (rbg)`
##          [,1]
## [1,] 28.56083
## 
## $`customer angle to product vector3: (bbb)`
##          [,1]
## [1,] 59.52964
## 
## $`customer angle to product vector4: (rrb)`
##         [,1]
## [1,] 10.6707

The customer, naturally, is closest to the products that they actually bought.

Suppose now that we want to recommend one of the following 2 products to this customer:

Product Red Blue Green angle \(\theta\) between customer & product vectors \(cos(\theta)\)
product_5 0 1 1 \(61^\circ\) 0.4780914
product_6 1 1 0 \(17^\circ\) 0.9561829

Because the angle between the customer vector and the product_6 vector is smaller than the angle between the customer vector and product_5, we’d recommend product_6 to them. We could have also made this decision based on \(cos(\theta)\) being larger.

As a reminder, the customer vector is \(\Big(1.67, 1, 0.33\Big)\).

Inverse Document Frequency (I.D.F.)

If some product attributes are more common than others (e.g. 80% of products have colour ‘black’ in them), then we would like to place less weight on these attributes in the cosine similarity calculation. For example, if a customer has bought lots of products with the word black in their product description, then the customer should not be described as having a preference for the colour black if the word black appears in almost all of the products.

A commonly heuristic used to combat this behaviour is Inverse Document Frequency (IDF).

IDF doesn’t have a single definition, but many variants (see examples at https://en.wikipedia.org/wiki/Tf-idf).

The definition of IDF used by the python library scikit-learn (when using the default settings) is

\[\begin{array}{lcl} idf(w) &=& log\Bigg(\displaystyle\frac{1+n}{1+df(w)}\Bigg) \\ w &=& \text{particular word/attribute under consideration} \\ n &=& \text{total number of products} \\ df(w) &=& \text{total number of products containing this word/attribute} \end{array}\]

[https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction]

IDF upweights words that are rare and downweights words that are common.

Here is how this IDF function looks for a database of 1000 different unique products:

Here is how the inclusion of IDF affects the product vectors in our example:

product_id product_desc tf_red tf_blue tf_green red_idf blue_idf green_idf red_tf_idf blue_tf_idf green_tf_idf
product_1 red, green, green, green 1 0 3 0.2231436 0.2231436 0.5108256 0.2231436 0.0000000 1.5324769
product_2 red, blue, green 1 1 1 0.2231436 0.2231436 0.5108256 0.2231436 0.2231436 0.5108256
product_3 blue, blue, blue 0 2 0 0.2231436 0.2231436 0.5108256 0.0000000 0.4462871 0.0000000
product_4 red, red, blue 2 1 0 0.2231436 0.2231436 0.5108256 0.4462871 0.2231436 0.0000000

Each element (word/attribute) of a product vector is now the term frequency (number of times a word/attribute appears in this product) multiplied by the IDF weighting for this word/attribute (scaling for word rarity).

Here are the TF-IDF-transformed product vectors (all vectors are normalised to have L2 unit length):

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Notice that the influence of the word green is stronger than the influence of the words blue or red. For example, the product “RedBlueGreen” has a smaller angle with the green axis than it does with the blue or red axes. Green, being a more rare attribute, has been upweighted by the TF-IDF relative to the other attributes.

Further Reading

  • Statistical Methods for Recommender Systems. Deepak K. AGARWAL, Bee-Chung CHEN. 2016.