cascadingsimhash a library to cluster by minhashes in Hadoop
« Why is XOR the default way to combine hashes
hector.rb: the pleasant JRuby Cassandra client (wraps Hector) »
cascading-simhash a library to cluster by minhashes in Hadoop
By Nate Murray | Published: May 9, 2011
simhashing
Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.
In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.
Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.
cascading-simhash a library to cluster by minhashes in Hadoop
simhashing
Say you have a large corpus of web documents and you want to group them together by some notion of “similarity”. For instance, we may want to detect plagiarism or find content that appears on multiple pages of a site.
In this scenario, it’s impractical to do a pairwise comparison of all documents. Fortunately, we can use simhashing.
Broadly speaking, simhashing is a algorithm that calculates a “cluster id” (the minimum hash, or minhash) from the content. Because the minhash for an item is calculated independently of the other items in the set, minhashing is an ideal candidate for MapReduce.