Python 3

home

optional: Advanced Topics in Data Science

Fuzzy String Matching

Approximate string matching can help us with synonyms, spell corrections and suggestions.

Note there are Fuzzy String exercises in this week's exercises as well as an additional 01_text_analysis.ipynb notebook in this week's data folder. In computer science, fuzzy string matching -- or approximate string matching -- is a technology for finding strings that match a pattern approximately (rather than exactly). Fuzzy string matching may be used in a search application, to find matches even when users misspell words or enter only partial words. A well respected Python library for fuzzy matches is fuzzywuzzy. It uses a metric called Levenshtein Distance to compare two strings and see how similar they are. This metric can measure the difference between two sequences of words. More specifically it measures the minimum number of edits that would need to be done to shift one character sequence to match another sequence. These edits can be

insertions
deletions
substitutions
transpositions

Consider these three strings:

Google, Inc.

Google Inc

Google, Incorporated

These strings read the same to a human, but would not match an equivalence (==) test. A regex could be instructed to match on all three, but would have to account for the specific differences (as well as any number of other variations that might be possible). Fuzzy matching might be used for:

spell checking
punctuation correction
duplicate records with varying entry formats
matching records between data systems

Fuzzy logic values range from 1 (completely True) to 0 (not at all True) but can be any value in between.

fuzzywuzzy was developed at SeatGeek to help them scan multiple websites describing events and seating in different ways. Here is an article they prepared when they introduced fuzzywuzzy to the public as an open source project:

https://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

Note there are Fuzzy String exercises (and further discussion) in this week's exercises as well as an additional 01_text_analysis.ipynb notebook in this week's data folder.

fuzzywuzzy basics

core methods for matching

Below are examples from the SeatGeek tutorial explaining how they came up with their fuzzy string matching approach, along with commentary about the four main functions used:

.ratio: compare the "likeness" of two strings
.partial_ratio(): match on words that are substrings
.token_sort_ratio(): tokenizes words and compares them in different orders
.token_set_ratio(): use set difference to reorder and compare strings

from fuzzywuzzy import fuzz

fuzz.ratio(): compare the "likeness" of two strings SeatGeek: works fine for very short strings (such as a single word) and very long strings (such as a full book), but not so much for 3-10 word labels. The naive approach is far too sensitive to minor differences in word order, missing or extra words, and other such issues.

fuzz.ratio("YANKEES", "NEW YORK YANKEES") ⇒ 60
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 75

fuzz.partial_ratio(): match on words that are substrings SeatGeek: we use a heuristic we call “best partial” when two strings are of noticeably different lengths (such as the case above). If the shorter string is length m, and the longer string is length n, we’re basically interested in the score of the best matching length-m substring.

fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES") ⇒ 100
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES") ⇒ 69

fuzz.token_sort_ratio(): tokenizes words and compares them in different orders SeatGeek: we also have to deal with differences in string construction. Here is an extremely common pattern, where one seller constructs strings as “ vs ” and another constructs strings as “ vs ” The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example: "new york mets vs atlanta braves" -> "atlanta braves mets new vs york" We then compare the transformed strings with a simple ratio()

fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 100

fuzz.token_set_ratio(): Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.

t0 = "angels mariners"
t1 = "angels mariners vs"
t2 = "angels mariners anaheim angeles at los of seattle"
fuzz.ratio(t0, t1) ⇒ 90
fuzz.ratio(t0, t2) ⇒ 46
fuzz.ratio(t1, t2) ⇒ 50
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners") ⇒ 90

[pr]