Name Matching Experiment
(Part 6)
Eurospider has carried out a simple experiment with the popular Levenshtein distance string metric. Around 600 names taken from the media were used to search for hits in a test database of more than 1000 entries. For each of the 600 names, the test database contained the full and correct name, which differed from the name used in the media. The names that were found for each of the 600 names were ranked by ascending Levenshtein distance. Finally, yield and precision were determined in the event of the top n ranks being sifted. What can we learn from this?
Resultate
We can see that the more ranks are sifted, the more correct hits (true positives) are found. As expected, the precision declines. This means that the more ranks are sifted, the more false hits (false positives) are found. The sharp drop in the precision curve means that the verification effort increases significantly.