................................................................................11111111................................................................................................................ .........................................................................1111..11000000000001111111..................................................................................................... ........................................................................1111..1110000000000000000000000111111........................................................................................... ........................................................................111..111100000000000000000000000000001.......................................................................................... ......................................................................1111...111100000000000000000000000000001.......................................................................................... ......................................................................111...111.100000000011111100000000000001.......................................................................................... ......................................................................11...111..100000000001111111111111111001.......................................................................................... ......................................................................1...111...100000000001111111111111111111.......................................................................................... .........................................................................111...1100000000001111111111111111111.......................................................................................... ........................................................................1111..11100000000001111111111111111111........111111.1111111.................................................................... .......................................................................1111..111100000000000111111111111111111....11111111.1..111111111................................................................. ......................................................................1111..1111100000000000111111111111111111.....11111.0.0..000000111................................................................. ......................................................................111..1111.100000000000111111111111111111.....11111.0.0..000000001................................................................. ......................................................................11..1111..100000000000111111111111111111.....11111.0.0..000000001................................................................. ......................................................................1..1111..1100000000000111111111111111111.....11111.0.0..000000001................................................................. ...................................................10111111..............111...1100000000000011111111111111111.....11111.0.0..000000001................................................................. ............................................1.1111.10000000000011.......111...11100000000000011111111111111111.....11111.0.0..000000001................................................................. .....................................1111.1.....11.100000000000001....1111...111100000000000011111111111111111.....11111.0.0..000000001................................................................. ...................................111......1.1....100000000000001....111...111.100000000000011111111111111111.....11111.0.0..000000001................................................................. ...................................1111.......1111.100000000000001....11...111..100000000000001111111111111111.....11111.0.0..000000001................................................................. ...................................1....1.1.1.1....100000000000001....1...111...100000000000001111111111111111.....11111.0.0..000000001................................................................. ...................................1111.....1.1111.100000000000001.......1111..1100000000000001111111111111111.....11111.0.0..000000001................................................................. .......................................11.1.1...11.100000000000001......1111..11100000000000001111111111111111.....11111.0.0..000000001................................................................. ...................................11.....1.1.111..100000000000001.....1111..111100000000000000111111111111111.....11111.0.0..000000001................................................................. .....................................1111.1...1.11.100000000000001....1111..1111100000000000000111111111111111.....111.1......000000001................................................................. ........................................1.1.1.1....100000000000001....111..1111.100000000000000111111111111111........1.1.111.00000000011111............................................................ .....................................11.1.......11.100000000000001....11..1111..100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................ ...................................111111.1........100000000000001......1111....100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................ ...................................1....1.1.1.111..100000000000001.......1111111100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................ .....................................11.1.....1.11.100000000000001..111111111111100000000000000011111111111111..111.0.0.0.111.00000000000000............................................................ ......................................1...1.1.1....100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000..111001.................................................... .................1111111..............1.....1.1111.100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001................................................... ..........11100000000001...........1111...1.1......100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001....................1001111........................ .........111111111000001...........1....1.1.1.1111.100000000000001.1111111111111100000000000000001111111111111..111.0.0.0.111.00000000000000111110001................1...10000001....................... ..........11111111000001...........1111.1..........100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001................1...10000001....................... ..........11111111000001..............111.1.1.1....100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001..............111...10000001....................... ..........1111111.1111111110111....11.......1.1.11.100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001..............11..1.10000001....................... ..........1111..1.1.1.1.1110000001...1111...1.1....100000000000001.1111111111111100000000000000000111111111111..111.0.0.0.111.00000000000000111110001.............11...1.10000001.......1111............ ..........1111..1.1.1.1.1110000000......1.1.1.1111.100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.00000000000000111110011.........1000111..1.10000001......000000........... ..........1111..1.1.1.1.1110000000.1111.1.1........100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.0000000000000011................1000000111.10000001..1000000001........... ..........1111..1.1.1.1.1110000000...1111.1.1...11.100000000000001.1111111111111100000000000000000011111111111..111.0.0.0.111.0000000000000011................1000000111.10000001.00000000110001........ ..........1111..1.1.1.1.1110000000.11.........1.11.100000000000001.1111111111111100000000000000000000111111111..111.0.0.0.111.0000000000000011................1000000111.10000001.10000000110001........ ..........1111..1.1.1.1.1110000000...11...1.1.1....100000000000001.1111111111111100000000000000000000000000001..111.0.0.0.111.0000000000000011................100000011..10000001..11111111............. ..........1111..1.1.1.1.1110000000........1.1...11.100000000000001.1111111111111000000010000000000000000000000..111.0.0.0.111.0000000000000011................100000011..10000001........11............. .1000001..1111..1.1.1.1.1110000000.1111...1........100000000000001.111111111111000001......10000000000000000000.111.0.0.0.111.0000111000000011................1000000111101000001...1000001.......11.... 10000000010001..1.1.1.1.1110000000.1.1111.1.1.1111.100000000000001.1111111111111............10000000000000000000011.0.0.0.111.0001....00000011...............1100011111....000001..000000001111..00001.. 000000000000001.1.1.1.1.1110000000.1...............100001111000001.11111110000..............100000000000000000000001010101011.0001...100000011.......111111111.....1111111100000000000000000000001001... .10000000000001.1.1.1.1.1110000000.111111.1.1.1....10000.....00001.00000000000.............1000000000000000000000000000000001.00000110100000111111111.....11111111111....1.0000000000000000000000..1.... ..100110111111..1.1.1.1.1111.10000......1...1.1.11.1000011.1100001.000000000000111111.11..10000000000000000000000000000000001.0011011..1000011....1111111111.....11111000010000000000101111000000..1101. .10001.0..1111..1.1.1.1.1111..0000.1111.1...1......1000000.0110001.00000000000000000111....1000000000000000000000000000000001.00111110000000111111......11111111111111.....0000000000000111....11111.... .10001101100001.1.1.1.1.1110111000.111111.1...1111.1000000.1110001.000000000000000001.1....0000000000000000000000000000000001.0000011000000011..11111111111............111.0000000000001111........1.... .......00000001.1.1.1.1.1110111000.................1000000.0000001.000000000000000001.000000000000000000000000000000000000001.000001100000001111.......111111111000000000010000000000....11........1.... .......00000001.1.1.1.1.1.10100001.111111.1.1.1111.1000000.000000..0000000000000000011000000000000000000000000000000000000001.0000011000000011111100000000000000000000000011000000000....11........1....
For many recommender algorithms (algorithms for making personalised product recommendations to customers), it is convenient to represent products and customers as vectors of information.
I refer here to the context of recommending products to customers, but the frameworks discussed could just as easily be used to recommend any kind of item (e.g. movie) to any kind of item consumer (e.g. web-page user).
This page explores and illustrates the concept of describing customers and products using numeric vectors.
Suppose that our hypothetical business has the following four products to offer:
Suppose that the only attribute information that we have for these products is the colouring of these products i.e. the amount of red, blue and green shown by each product. In real applications, each product vector could be thousands of attributes long, coding any desirable information about that product.
product_id | product_desc |
---|---|
product_1 | red, green, green, green |
product_2 | red, blue, green |
product_3 | blue, blue, blue |
product_4 | red, red, blue |
product_id | product_desc | tf_red | tf_blue | tf_green |
---|---|---|---|---|
product_1 | red, green, green, green | 1 | 0 | 3 |
product_2 | red, blue, green | 1 | 1 | 1 |
product_3 | blue, blue, blue | 0 | 2 | 0 |
product_4 | red, red, blue | 2 | 1 | 0 |
Here are the four products, coded as numeric vectors
\[(1,0,3)\]
\[(1,1,1)\]
\[(0,2,0)\]
\[(2,1,0)\]
Now that they are represented as vectors, we can visualise these four products in euclidean space:
In the language of information retrieval, these are called term frequency vectors, since they are counts of word occurrence in the bag-of-words describing each product.
The smaller the angle \(\theta\) between any 2 product vectors, the more similar those 2 products are. Because we are only concerned with angle, we can scale each of the product vectors to have length 1 (L2 norm).
Here are the same product vectors in the product space, scaled to have (L2) length 1:
Notice that products with more attributes (colours) in common have a smaller angle between them.
The angles (in degrees) between the product vectors are:
## rggg rbg bbb rrb
## rggg 0 43 90 74
## rbg 43 0 55 39
## bbb 90 55 0 63
## rrb 74 39 63 0
The largest angle possible between a pair of products is \(90^\circ\) (\(\displaystyle\frac{\pi}{2}\) radians). This occurs when 2 products have no attributes in common.
The smallest possible angle between a pair of products is 0 degrees. This occurs when the two products have identical attributes.
The cosine of the angle \(\theta\) between 2 vectors gives a convenient metric for measuring similarity between the 2 vectors, since \(cos(\theta)\) lies within the range [0,1]:
\(cos(\theta)\) | meaning |
---|---|
1 | vectors point in an identical direction (they are the same vector) |
0 | vectors are orthogonal (\(90^\circ\)) |
\(cos(\theta)\), in this context, is called cosine similarity.
Since in this context vectors are non-negative valued, only angles in \(\theta\in[0^\circ,90^\circ]\) are possible
The cosine similarities between the product vectors are:
## rggg rbg bbb rrb
## rggg 1.0000000 0.7302967 0.0000000 0.2828427
## rbg 0.7302967 1.0000000 0.5773503 0.7745967
## bbb 0.0000000 0.5773503 1.0000000 0.4472136
## rrb 0.2828427 0.7745967 0.4472136 1.0000000
If we can represent customers as vectors in the same euclidean space as the product vectors, then we could use the angle between a customer vector and a product vector as a measure of how appropriate (‘similar’) that product is to that customer.
For example, we might code a customer who likes the colour Red a lot, likes Blue a little, and dislikes Green as:
\[\Big(5,2,0\Big)\]
The angle between this customer and the product rbg is:
\[\begin{array}{lcl} \theta &=& cos^{-1}\Bigg(\displaystyle\frac{(5,2,0)\cdot(1,1,1)}{|(5,2,0)|\space|(1,1,1)|}\Bigg) \\ &=& 0.722 \text{ radians} \\ &=& 41.37^\circ \\ cos(\theta) &=& 0.75 \end{array}\]
..whereas a product better suited to them might be rb, which has an angle of
\[\begin{array}{lcl} \theta &=& cos^{-1}\Bigg(\displaystyle\frac{(5,2,0)\cdot(1,1,0)}{|(5,2,0)|\space|(1,1,0)|}\Bigg) \\ &=& 0.405 \text{ radians} \\ &=& 23.2^\circ \\ cos(\theta) &=& 0.92\\ \end{array}\]
between it and the customer vector
One way to generate customer vectors like this (which have the same form as the product vectors) is to represent each customer as an aggregation of all of the products (vectors) that they have historically bought.
For example, suppose now that a customer has bought the following products:
Product | Red | Blue | Green | n_bought |
---|---|---|---|---|
product_2 | 1 | 1 | 1 | 1 |
product_4 | 2 | 1 | 0 | 1 |
We can represent this customer as a vector in the product-vector space by averaging over these 2 product vectors that the customer has bought:
\[\begin{array}{lcl} \overset{\rightarrow{}}{\text{customer}}_i &=& (\displaystyle\frac{1+2}{2} \space,\space \frac{1+1}{2} \space,\space \frac{1+0}{2})\\ &=& ( 1.5 \space,\space 1 \space,\space 0.5 ) \end{array}\]
Here is the customer plotted in the product-vector space:
Notice that the customer vector likes precisely in the middle of the two products that they bought.
Suppose that this customer buys another of product_4, making their purchase history:
Product | Red | Blue | Green |
---|---|---|---|
product_2 | 1 | 1 | 1 |
product_4 | 2 | 1 | 0 |
product_4 | 2 | 1 | 0 |
The customer vector is now:
\[\begin{array}{lcl} \overset{\rightarrow{}}{\text{customer}}_i &=& (\displaystyle\frac{1+2+2}{3} \space,\space \frac{1+1+1}{3} \space,\space \frac{1+0+0}{3})\\ &=& ( 1.67 \space,\space 1 \space,\space 0.33 ) \end{array}\]
Here is the customer (under this new scenario) plotted in the customer/product euclidean space:
Notice that by buying this product (rrb), this customer (vector) has shifted closer to that product (vector).
The angle (in degrees) between this customer vector and each of the product vectors is:
## $`customer angle to product vector1: (rggg)`
## [,1]
## [1,] 64.68351
##
## $`customer angle to product vector2: (rbg)`
## [,1]
## [1,] 28.56083
##
## $`customer angle to product vector3: (bbb)`
## [,1]
## [1,] 59.52964
##
## $`customer angle to product vector4: (rrb)`
## [,1]
## [1,] 10.6707
The customer, naturally, is closest to the products that they actually bought.
Suppose now that we want to recommend one of the following 2 products to this customer:
Product | Red | Blue | Green | angle \(\theta\) between customer & product vectors | \(cos(\theta)\) |
---|---|---|---|---|---|
product_5 | 0 | 1 | 1 | \(61^\circ\) | 0.4780914 |
product_6 | 1 | 1 | 0 | \(17^\circ\) | 0.9561829 |
Because the angle between the customer vector and the product_6 vector is smaller than the angle between the customer vector and product_5, we’d recommend product_6 to them. We could have also made this decision based on \(cos(\theta)\) being larger.
As a reminder, the customer vector is \(\Big(1.67, 1, 0.33\Big)\).
If some product attributes are more common than others (e.g. 80% of products have colour ‘black’ in them), then we would like to place less weight on these attributes in the cosine similarity calculation. For example, if a customer has bought lots of products with the word black in their product description, then the customer should not be described as having a preference for the colour black if the word black appears in almost all of the products.
A commonly heuristic used to combat this behaviour is Inverse Document Frequency (IDF).
IDF doesn’t have a single definition, but many variants (see examples at https://en.wikipedia.org/wiki/Tf-idf).
The definition of IDF used by the python library scikit-learn (when using the default settings) is
\[\begin{array}{lcl} idf(w) &=& log\Bigg(\displaystyle\frac{1+n}{1+df(w)}\Bigg) \\ w &=& \text{particular word/attribute under consideration} \\ n &=& \text{total number of products} \\ df(w) &=& \text{total number of products containing this word/attribute} \end{array}\]
[https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction]
IDF upweights words that are rare and downweights words that are common.
Here is how this IDF function looks for a database of 1000 different unique products:
Here is how the inclusion of IDF affects the product vectors in our example:
product_id | product_desc | tf_red | tf_blue | tf_green | red_idf | blue_idf | green_idf | red_tf_idf | blue_tf_idf | green_tf_idf |
---|---|---|---|---|---|---|---|---|---|---|
product_1 | red, green, green, green | 1 | 0 | 3 | 0.2231436 | 0.2231436 | 0.5108256 | 0.2231436 | 0.0000000 | 1.5324769 |
product_2 | red, blue, green | 1 | 1 | 1 | 0.2231436 | 0.2231436 | 0.5108256 | 0.2231436 | 0.2231436 | 0.5108256 |
product_3 | blue, blue, blue | 0 | 2 | 0 | 0.2231436 | 0.2231436 | 0.5108256 | 0.0000000 | 0.4462871 | 0.0000000 |
product_4 | red, red, blue | 2 | 1 | 0 | 0.2231436 | 0.2231436 | 0.5108256 | 0.4462871 | 0.2231436 | 0.0000000 |
Each element (word/attribute) of a product vector is now the term frequency (number of times a word/attribute appears in this product) multiplied by the IDF weighting for this word/attribute (scaling for word rarity).
Here are the TF-IDF-transformed product vectors (all vectors are normalised to have L2 unit length):
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Notice that the influence of the word green is stronger than the influence of the words blue or red. For example, the product “RedBlueGreen” has a smaller angle with the green axis than it does with the blue or red axes. Green, being a more rare attribute, has been upweighted by the TF-IDF relative to the other attributes.