A guide to Grocery Recommendation System

Rishabh Malhotra
6 min readDec 20, 2020

Introduction

Every one of us have been recommended movies. When you visit Netflix, Amazon Prime Videos, or other such platforms, all you see are movie recommendations. But, have you ever been recommended grocery? Of course, you have been if you are a regular user of the platforms like Groffers and Amazon Pantries. So, “how is it different from movie recommendations?” I hear you ask. You provide ratings to the movies you watch, but you do not rate your groceries. This article will, therefore, be about how to recommend groceries to a user based on his/her previous purchases.
There are majorly two types of recommendation systems, item-based recommendation systems and user-based recommendation systems. This article is going to be about how to develop a basic user-based recommendation system.

Collaborative Filtering

Collaborative filtering is based on the nearest neighbor philosophy. Taking an example, let’s say you have to recommend some n items to a user X. This user X will be compared to the nearest N of its neighbors to find a similarity index. Recommendation values will be found according to this similarity index obtained.

This method is mainly used for movie recommendations, but it can be easily translated into the domain of grocery recommendations. As mentioned above, the movie recommendation systems use the user rating to recommend movies. Since you do not rate your groceries, the groceries are calculated based on the frequencies of the items bought. The item that is bought more frequently by an individual will have a higher rating than the items that are purchased relatively infrequently.

Data Preprocessing

Ta-Feng Grocery Dataset can be used for the task of recommendation at hand. The dataset was gathered over a duration of 4 months from a store in Japan in 2000–2001. It contains the purchase history of each and every customer that visited the store on any day in the given duration. This dataset needs to be converted into a less readable but more-understandable by the machine format. The dataset is, thus, converted into a 2-D matrix, where each column represents a customer. Using the Ta-Feng dataset, it will be converted into 32216 x 2012 matrix where 32216 are the products and 2012 are the number of customers that bought something during the duration of 4 months. It is recommended not to use nested loops since the data is huge. Thus, each customer vector is 32216 dimensional where an entry identifies the number of times an item has been bought.

The Evaluation Matrix

The matrix is a depiction of the products and customer interaction. The rows represent the product category and the columns represent the customers. Each entry in the matrix represented by xij tells how many times the ith product was bought by the jth user.

Evaluation Matrix

Since the number of products is high than the number of users, there is a possibility that a particular product is bought more times by a particular user and other products are bought less in numbers by the same user, so the mean value of jth user’s evaluation can go high just because of a single product, also this can make the predictions inconsistent as some products frequency for jth user will be high as for some products it can be as low as 0.

In order to shorten the range difference between the frequency values we can keep a small criteria for the frequency values, which is mentioned as below. For any entry in the table, if the value is greater than equal to 5, then we can change that entry to 5, that means for any entry in the table, we can’t have any value that is greater than 5. This will give us a defined range of values between 0 to 5.

Recommendation Value Calculation

The recommendation value can be calculated for every user with respect to every product. In the current recommendation system, we use the evaluation matrix to get a fair idea of the recommendation values that can be calculated. As we can observe from the matrix that there are n product items (product categories) and j users. So size of the recommendation value matrix will be same as that of the evaluation matrix, i.e n x j.

For each entry in the recommendation matrix we have a scalar value, which represents the recommendation value of a product to a user. Let’s represent the entry with rij, i is representing product and j is representing customer. So rij means the recommendation value of product i to user j. For any particular user u, if we recommend products, we simply calculate riu where i can vary from 1 to n. The different recommendation values calculated can be placed in sorted order in order to make the task of picking the items to be recommended easy. The higher the value of rij the more it is likely to be suggested to the user u. The formula used to for calculating recommendation values are given as follows:

Recommendation value calculation

Here rij is the recommendation value as discussed before,xj is the mean of the jth user’s evaluation for all the products, i.e sum of all the entries in the evaluation matrix (M), for M[i][j], where j is fixed and i varies from 1 to n and then dividing the sum by n, which is the number of products. Also in the formula we have a key variable, x*ij, this represents, xij-xj. Here xij denotes the evaluation value of the jth user for the product i.

N serves the role of a hyper-parameter. N is the number of users that are significant for the recommendation and are similar to the target user j. In the denominator we divide by the mean of all the similarity values of the other users. While in the numerator we only take N significant users. Slj represents the similarity between the user l and user j, this is calculated using the Pearson formula.

Identifying Similar Users

To identify if a user l is similar to user j, we use Pearson’s Correlation formula which is stated below:

Pearson Correlation

Here xj denotes the evaluation vector of the jth user and xj is a vector having replication of the mean of evaluations of the jth user of the same length as that of evaluation vector of xj.Using the above formula we get a scalar value, depicting the correlation between the user l and user j.

Evaluation

The most basic way to evaluate your recommendation system is to use precision, recall and F1-scores. The formulas for precision and recall are given by-

The F1-Score is simply harmonic mean of the above two values.

Results

We applied the methods described above to the Ta-Feng dataset. With the hyperparameters, number of neighbours = 30 and number of items we recommend = 30, we get the following results-
precision = 0.9276119402985076
recall = 0.2908470004675024
fscore = 0.44284323649683205

The trend of precision, recall and fscore was as such when we increased our number of recommendations-

Trend for precision, recall and fscore over the number of items we recommend

Contributors

The project was made by Rishabh Malhotra and Yashesh Dagar under the guidance of our professor, Tanmoy Chakraborty.

  1. Rishabh Malhotra did research, Exploratory Data Analysis, Data Preprocessing and some part of the Calculation of Recommendation Values
  2. Yashesh Dagar did research, the remaining Calculation of Recommendation Values and the Evaluation part

References

--

--