Back to School RS
The Challenge: Design, Measure, Mix, Propose and Justify
This capstone project will bring together the skills you’ve learned across the four prior courses of the Recommender Systems specialization in a single project. You will be given a data set and a specific scenario and will be expected to research, propose, and justify a recommender specifically designed to match that data set and scenario. You will carry out this project individually in four parts, all of which will be submitted together as a final project report:
Design. Your first challenge is to understand your data set and scenario and produce a research design. This design will identify a set of metrics and evaluation techniques you will use to evaluate possible algorithms for your recommender, and will outline your plans for exploring both individual and hybrid recommender algorithms. Pay particular attention to how you plan to separate training and test data to ensure that your tests are valid. The plan can be brief (2-3 pages) but must explain how the metrics you choose relate to the business goals in the scenario.
Measure. Next, you will work with a set of at least three different base algorithms (drawn from among those you’ve studied) to understand how they perform on the provided data set, using your selected metrics. As part of this step, you may need to tune some of the base algorithms to get reasonable performance.
Mix. With complex objectives, it is likely that no single algorithm will produce a set of recommendations that meet all of the goals of the scenario. Thus, you need to explore hybrid algorithms to provide the best result set. We expect you to explore at least two, and possibly several different hybrids to find the best results.
Proposal and Reflection. Finally, you should present your recommendation for the algorithm (including possibly a hybrid algorithm) that should be used to fulfill the scenario. You should justify the result and the means used to achieve it, and should address a set of questions about your exploration.
One Project, Two Path
This project was designed to support programmers in the course (as most of our specialization enrollees have taken the honors track with programming) but also has an option for non-programmers.
- The programming path will include using LensKit for the base algorithms, for evaluation metrics, and to compute the mixtures.
- The non-programming path will provide you with some raw result data over a set of base algorithms; you can use spreadsheets or tools of your choice to compute metrics and hybrids.
The Data Set
You will be using a data set derived from Amazon.com with product metadata and ratings data on office products. The data set is provided thanks to Julian McAuley at UCSD and involves actual data from the period May 1996-July 2014. To make your computation more tractable, we’ve used a dense subset of the data (called the 5-core subset) that only includes items and users with at least five ratings. [Note that the original datasets are available at http://jmcauley.ucsd.edu/data/amazon/, though these should not be used for this capstone.
Note: There are separate data set extracts for those of you completing the programming (honors) track and the non-programming track.
The non-programming track dataset is smaller to make it more feasible for use in spreadsheet computation. For each item, your meta-data includes:
• An item number
• Amazon’s ITEM number (“asin”)
• The item’s brand name
• The item title
• The item category (both leaf category and full path)
• A price in dollars
• An availability score between 0 and 1 that reflects how widespread the product is in retail stores; higher scores reflect broad availability; lower scores indicate products not found in most big box stores.
Note that the availability score is synthetic (we created it), but for purposes of this capstone, treat it as if it were real data. You also are provided with a rating matrix with a row for each item and columns representing each user (ratings are on a 1-5 star scale). Your rating matrix includes all the ratings data you will receive (we have not separated out test and training data -- that’s your responsibility).
Part I: Designing a Measurement and Evaluation Plan (joint)
Part II: Measurement (separate)
Part III: Mixing the Algorithms (separate)
Part IV: Proposal and Reflection (joint)
Recommendation System: Hybrid RS Content-based Linear User-User Item-Item
Metrics: Non-linear RMSE NDGC Precision@NTOP-N
You can check the PDF of the Capstone Assignments.