发布时间:2022-06-29 文章分类:编程知识 投稿人:李佳 字号: 默认 | | 超大 打印

DIS 2006/2007

Exercise 8: TF/IDF ranking

In this exercise we'll have a look at how the TF/IDF ranking works.

There are 5 different documents in the collection:

Task 1. For the query Q = "Beijing duck recipe", find the two top ranked documents according to the TF/IDF rank. Assume the cosine similarity measure and the culinary term set T = {beijing, dish,duck, rabbit, recipe, roast}. Are the top ranked documents relevant to the query?

Task 2. Assume that the author of the document D5 goes on to tell more about her summer trip to China before doing the cooking and uses the word Beijing 3 times, instead of just once. What happens to the rank of D5? How can this be interpreted in the vector retrieval model (vectors and angles between them)? Is this change in the ranking of D5 a desirable property of TF/IDF? Why?

Solution

Excel sheet with calculations