There are different sectors where text similarity is used, for example on Search Engines, in Customer Service, or Legal Matters (by linking related documents).
Consider the following 2 sentences:
Dear Ms. John Doe, can you confirm your purchase of a plane ticket to Hong Kong?
Ms. A approved purchase of a plane ticket to Hong Kong.
A human could easily determine that these 2 sentences convey a very similar meaning despite being written in 2 completely different formats. In essence, the goal is to compute how close two pieces of text are in meaning or surface closeness. The first is referred to as semantic similarity and the latter is referred to as syntactic similarity.
In computational linguistics, there are multiple ways to compute features that capture the semantics of documents and multiple algorithms to capture the dependency structure of documents to focus on the meanings of documents. Below are following some of those with an evaluation of their outcomes.
- Jaccard Similarity :: 1/7
- Different embeddings + K-means :: 2/7
- Different embeddings + Cosine Similarity :: 3/7
- Word2Vec + Smooth Inverse Frequency + Cosine Similarity :: 4/7
- Different embeddings + LSI + Cosine Similarity :: 3/7
- Different embeddings + LDA + Jensen-Shannon distance :: 4/7
- Different embeddings + Word Mover Distance :: 5/7
- Different embeddings + Variational Auto Encoder (VAE) :: 5/7
- Different embeddings + Universal sentence encoder :: 5/7
- Different embeddings + Siamese Manhattan LSTM :: 6/7
- BERT embeddings + Cosine Similarity :: 7/7
- Knowledge-based Measures :: 7/7
A way that we followed to apply a computing pairwise text similarity algorithm (https://github.com/atheo89/Text-Maching-Similarity-NLP/blob/main/syntactic-text-similarity.ipynb) was to transform the documents into term frequency-inverse document frequency (TF-IDF) vectors and then compute the cosine similarity between them.
TF-IDF: is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: TF, how many times a word appears in a document, and the IDF, inverse document frequency of the word across a set of documents.
Cosine Similarity: finds the angular separation (θ) between two N-dimensional vectors. The vectors corresponding to each text are formed based on unique word roots present in each text. The two vectors’ dot product yields the value of the cosine similarity index between the two texts.
We fed the algorithm with JSON data of two categories (contact and respond) and amazingly the results that we got had almost high accuracy. The below heat-map shows the percentage of similarity between the texts, the green color is for the texts with the expected pair, the yellow is the identical text, and of course, is 100% accurate!