print("apple" == "apple") # True
print("apple" == "orange") # False

However, things get more complicated if we wish to find a similar string that's close enough:

print("apple" == "apples") # False
print("orange" == "ornage") # False
print("pear" == "paer") # False

As humans, we can tell that "apple" and "apples" are very similar, "orange" and "ornage" are very similar (likely a typo) and the same case with "pear" and "paer"(another typo). Unfortunately, Python recognises them as completely different strings.

words = ["apple", "orange", "pear", "apples", "snapple", "zzzzz"]

Given a list of words, we want to somehow be able to check which words are closer to one another (we're expecting "apple", "apples" and "snapple" to be pretty close!) In this article, we'll explore a simple method to do this.

Cosine Similarity Formula

cos(a, b) = dot(a, b) / magnitude(a) / magnitude(b)

a and b are vectors, and dot refers to the dot product, while magnitude refers to the magnitude of the vector. Don't worry if these sound foreign to you, as more explanation will come.

The return value of the cosine similarity function will give us a float value between 0 and 1– a score of 0 meaning that the 2 words are not similar at all, and 1 meaning the 2 words are exactly the same. A 0.9 would mean 2 words are pretty similar, while a 0.3 means they are not quite similar.

Vectorising Our Words

Let's first find a way to vectorize our words — in other words, we transform each word into a dictionary, like this:

"aab" becomes
{"a":2, "b":1, "aa":1, "ab":1}
"test" becomes
{"t":2, "e":1, "s":1, "te":1, "es":1, "st":1}
"apple" becomes
{"a":1, "p":2, "l":1, "e":1, "ap":1, "pp":1, "pl":1, "le":1}

Note that each string is broken up into 1) individual letters (unigrams) and also 2) groups of 2 letters (bigrams). These unigrams and bigrams become the keys of the dictionary, while the counts of each unigram or bigram become the values of the dictionary.

For instance, in "aab", "a" appears twice, so there is a "a":2 key-value pair. "b" appears only once, so there is a "b":1 key-value pair. Now we move on to bigrams — there are only 2 bigrams that appear once each "aa" and "ab", so the key-value pairs "aa":1 and "ab":1 appear.

def vectorize(word, ngrams=[1,2]):
    output = {}
    for ngram in ngrams:
        for i in range(len(word)-ngram):
            key = word[i:i+ngram]
            if key not in output:
                output[key] = 1
            else:
                output[key] += 1
    return output

Now that we have the vectorize function, let's vectorize all the words that we want to look at:

words = ["apple", "orange", "pear", "apples", "snapple", "zzzzz"]
vectors = { w: vectorize(w) for w in words }

Here's vectors would be a dictionary with keys being words and values being the vectors of each word.

{
    "apple": {"a":1, "p":2, ....},
    "orange": {"o":1, "r":1, ...},
    ...
}

Applying The Cosine Similarity Algorithm

Now that we can turn our words into vectors, we can start writing the cosine similarity function:

def cos(a, b):
    
    def dot(a, b):
        out = 0
        for key in {**a, **b}:
            out += a.get(key,0) * b.get(key,0)
        return out
    def magnitude(vec):
        out = 0
        for v in vec.values():
            out += v ** 2
        return out ** 0.5
    return dot(a, b) / magnitude(a) / magnitude(b)

Let's test this function with a couple of words.

a = "apple"
b = "apples"
c = "orange"

Here, we are expecting "apple" and "apples" to have a higher cosine similarity score than "apple" and "orange".

av = vectorize(a)
bv = vectorize(b)
cv = vectorize(c)
print(cos(av, bv)) # 0.9198662110077999
print(cos(av, cv)) # 0.18181818181818182

Here, "apple" and "apples" return a score of 0.92, while "apple" and "orange" return a score of 0.18 — "apple" and "apples" have a much higher similarity than "apple" and "orange" as expected.

Visualising Similarity Score Between Words

words = ["apple", "orange", "pear", "apples", "snapple", "zzzzz"]
vectors = { w: vectorize(w) for w in words }

We have here a dictionary vectors with keys being each word, and values being the vectorized version of each word.

import pandas as pd
data = []
for word1, vec1 in vectors.items():
    row = []
    for word2, vec2 in vectors.items():
        score = cos(vec1, vec2)
    
    row.append(score)
data.append(row)
print(pd.DataFrame(data, columns=words, index=words))

Let's compare every word against each other, put them nicely in a pandas dataframe and check the results:

         apple     orange    pear      apples    snapple     zzzzz
apple    1.000000  0.181818  0.455842  0.919866  0.856349    0.0
orange   0.181818  1.000000  0.341882  0.167248  0.233550    0.0
pear     0.455842  0.341882  1.000000  0.419314  0.390360    0.0
apples   0.919866  0.167248  0.419314  1.000000  0.859338    0.0
snapple  0.856349  0.233550  0.390360  0.859338  1.000000    0.0
zzzzz    0.000000  0.000000  0.000000  0.000000  0.000000    1.0

Here, scores of 1 exist as we are comparing "apple" with itself — exact matches return 1 which mean perfectly similar. Let's ignore the diagonal column containing the bunch of 1s.

We can see that "apples" is the most similar to "apple" with a score of 0.92, followed by "snapple" with a score of 0.86. Other words like "orange" or "pear" are less similar to "apple" and thus have a lower score. Words like "zzzzz" (not a real word but here to make a point) has no common letters at all so the score is 0.

The Code in One Place

Here's a shortened version of the code all in one place:

Conclusion

Hope this was helpful — If this provided value for you and you wish to support me as a writer, do consider signing up for a Medium membership! It's $5 a month, and you get unlimited access to stories on Medium. If you sign up using my link, I'll earn a small commission.

Here's the link to read unlimited access to Medium

If you find value in what you're reading and wish to get notified whenever I publish an article, do consider joining my email list.