The exact probability law for the approximated similarity from the Minhashing method
We propose a probabilistic setting in which we study the probability law of the Rajaraman and Ullman RU algorithm and a modied version of it denoted by RUM. These algorithms aim at estimating the similarity index between huge texts in the context of the web. We give a foundation of this method by showing, in the ideal case of carefully chosen probability laws, the exact similarity is the mathematical expectation of the random similarity provided by the algorithm. Some extensions are given.
Keywords: Minshashing, algorithms, similarity, estimation, probability laws, convergence of algorithm