# STATISTICS K-NEAREST SESSIONIZATION HEURISTICS MEAN ABSOLUTE ERROR

Suppose that an online bookseller has collected ratings information from 20 past users (U1-U20) on a selection of recent books. The ratings range from 1 = worst to 5 = best. Two new users (NU1 and NU2) have recently visited the site and rated some of the books (“?” represents missing ratings). The two new users’ ratings given in the last two rows.

 TRUE BELIEVER THE DA VINCI CODE THE WORLD IS FLAT MY LIFE SO FAR THE TAKING THE KITE RUNNER RUNNY BABBIT HARRY POTTER U1 1 5 3 3 5 U2 5 4 3 2 1 U3 3 1 2 2 5 U4 3 4 1 3 U5 2 4 3 2 2 U6 5 3 1 3 1 U7 1 4 5 5 2 4 U8 2 1 4 5 1 U9 3 2 2 5 U10 3 5 1 4 4 U11 2 1 2 3 U12 4 4 2 1 1 4 U13 2 4 4 5 U14 5 3 3 2 1 1 U15 2 3 3 2 U16 3 2 1 1 4 4 U17 1 5 1 2 4 4 U18 5 4 3 3 4 5 U19 4 2 5 1 5 U20 2 5 1 1 5 3 4 NU1 3 5 4 2 3 5 NU2 5 2 2 4 1 3

Using the K-Nearest Neighbor algorithm predict the ratings of these new users for each of the books they have not yet rated. Use the Pearson correlation coefficient (see Assignment 1) as the similarity measure.

1. (20 points) First, compute the correlations between the new users (NU1 and NU2) and all other users (you can show these as added columns in original spreadsheet). Then for each new user compute the predicted rating for each of the unrated items using K=3 (i.e., 3 nearest neighbors). Use the weighted average function to compute the predictions based on ratings of the nearest neighbors. Be sure to show the intermediate steps in your work and provide a short explanation of how you computed the predictions.
2. (20 points) Measure the Mean Absolute Error (MAE) on the predictions using NU1 and NU2 as test users. You can compute MAE by generating predictions for items already rated by the test user (e.g., for NU1 these are all items except “The DaVinci Code” and “Runny Babbit”). Then, for each of these items you can compute the absolute value of the difference between the predicted and the actual ratings. Finally, you can average these errors across all 12 compared items (for both NU1 and NU2) to obtain the MAE.
3. (20 points) Item-Based Collaborative Filtering. Using the same data as above and the item-based collaborative filtering algorithm, compute the predicted rating of NU1 on the book “The DaVinci Code”. Note that in this case, you will need to find the K most similar items (books) to the target item based on their rating vectors (columns in the table), and then use NU1’s ratings on the K neighbor items. For this problem use K = 2, and use Cosine Similarity to identify the most similar neighbors to “The DaVinci Code”.