Mauritz Zachrisson

Augmenting Jaro-Winkler to Detect Duplicate Financial Transactions

Abstract. In this thesis, a duplicate detection algorithm for short segments of text is proposed, based on an augmented Jaro Winkler string similarity metric. By using the measure of term frequency to remove stop words, the algorithm injects prior knowledge about language structure. Trained and evaluated on two small datasets of payment transactions, the algorithm is found to outperform standard distance metrics in a statistically significant manner for one dataset, using Precision Recall and Receiver Operating Characteristic AUC performance metrics. The developed algorithm outperforms some standard distance algorithms when considering a high precision scenario, measured using F0.5 score.