Understanding and acting upon text has been intrinsic in our daily lives for a long time. From mastering a cooking recipe to building a cash flow statement, the understanding of syntax (grammatical structure) and semantics (rules related to sentence’s meaning) plays a vital role in comprehension even though intuitively.
But how would a computer perform such an action? Natural Language Processing (NLP) is a broad field of study which gives machines linguistic capabilities such as interpreting text and it’s meaning.
Let’s take a step back and first answer a very basic question. How would you as a human being compare two words?
For our brain, the job can be simplified as running comparisons with already known phonetics, meanings, and aesthetics patterns performing connections to other brain areas. The emergent networking categorizes the unknown word and it creates mental hubs to make it easier to retrieve the information when required.
Let’s check some of the existing concepts used by NLP to do the job with the following words:
In many cases, words first need to be tokenized, which in this simple case could mean chopping the character string into pieces of all combinations of adjacent words or letters of length n (also known as n-grams). As an example, the 2-grams of the word
In general, words are often separated by white space. However, not all white space is the same. For example,
versa has an individual meaning with multiple words. Different tokenisations are used to split up a text stream into units with a separate meaning.
Let’s use a gram of size 2. The similarity scores vary from 0.0 to 1.0, where 1.0 means that the words are the same and 0.0 that they are the not.
ck, and bank:
There are a number of ways how to measure the similarity of two words based on their n-grams (fixed-length tokens) or q-grams (variable length tokens or entire words), both syntactically and semantically. We are going to introduce three common syntactic similarities below.
This metric compares the similarity and diversity of the words (intersection/union).
Intersection (ba): 1
Union (ba, ac, ck, an, nk): 6
Jaccard Similarity: 1/6 = 0.167
This similarity metric considers strings as vectors. The similarity score is calculated by the cosine of the angle between the vectors. This metric is a measurement of orientation and not magnitude.
The vector model of the word depends on the distinct grams that represent the word. For simplicity, we are not considering the term frequency–inverse document frequency (TF-IDF) matrix here, a numerical statistic that intends to reflect how important is a letter or n-gram in a word (the “magnitude” of the vector).
With the vector model, the dot product can be calculated to find the cosine of the angle between them.
It follows that identical words will be represented by the same vectors. In our case:
Cosine Similarity: 1/√3*√3 = 0.34
This simple similarity metric considers the number of deletions, insertions, or substitutions required to transform one string into another string.
bank only requires replacing the letter
c with the letter
Transformation Costs (back, bank): 1
Levenshtein Distance: 1 – cost/size(longest word) = 1-(1/4) = 0.75
Jaccard similarity is usually used to compare binary feature vectors, based on the use of union and intersection of the sets. When it comes to comparing documents, Jaccard can be a good representation of the words occurrence, but for documents with different meaning and the same set of words it is not an adequate similarity metric.
The cosine similarity is widely used to compare longer documents. With a higher cosine similarity between two texts, it is most probable that both documents have a higher number of words in common. However, when the magnitude (words count) in a document is a relevant factor for the analysis, it is not the best approach available.
Levenshtein distance, on the other hand, can be applicable in spelling corrections and information retrieval, but it is sensitive to spelling errors and abbreviations.
These and other techniques are widely used for search, information extraction, identifying spam, automated data preparation, etc.
We just had a look at some of the most widely used similarity metrics. There are many more out there, and here we showed basic syntactical similarity metrics. Naturally, quite some more techniques are required to automate data preparation as we do at Onedot. Stay tuned!
We love hearing back from the community, so don’t hesitate to leave a comment. We hope you enjoyed it, and thanks for reading!