Segments of text are related if they contain common information such as entities, concepts, objects, or actions. In certain business scenarios, it is helpful to find related text that is similar to the text of interest. The related functionality is suitable to find similar text at the clause, sentence, or document level. Relatedness is based on semantic, lexical, or cosine similarity.
Lexical similarity is based on the principle that similar words occur in similar contexts. The following facets are taken into consideration for determining lexical similarity:
• Word frequency (how often the words occur)
• Word proximity (the distance between the words)
• Word-order (the order in which the words occur)
• Word co-occurrence (the words that occur together regardless of order)
• Word distribution (how the words are distributed over the text segments)
• Vector Space Model (converting text into vectors and then using a geometric measure to calculate the distance between the vectors – a vector is something that has direction and magnitude)
Semantic similarity is a measurement of likeness between units of text. Semantic similarity of text segments is calculated by using information from a lexical (meaning of words) and thesauri database and from corpus linguistics (study of language).
Cosine similarity is calculated by converting text into vectors (vector is something that has direction and magnitude) and then using a geometric measure to calculate the distance between the vectors.
By leveraging related functionality we can discover not previously known information.
Relevant references from Wikipedia:
Semantic similarity or semantic relatedness is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation (e.g. their string format). http://en.wikipedia.org/wiki/Semantic_similarity
In linguistics, lexical similarity is a measure of the degree to which the word sets of two given languages are similar. A lexical similarity of 1 (or 100%) would mean a total overlap between vocabularies, whereas 0 means there are no common words. http://en.wikipedia.org/wiki/Lexical_similarity
Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1]. http://en.wikipedia.org/wiki/Cosine_similarity