Clustering is a process of grouping data in such a way that the data in the same group (i.e. a cluster) is similar to one another based on certain characteristics, and dissimilar to the data in other groups. In other words, the objective of clustering is to maximize the similarities within the groups, and minimize the similarities between the groups. One benefit of clustering is that it helps identify the features that distinguish one group from the other.

Examples of clustering are all around us, for instance:

• A corporation is divided into sales, marketing, customer support, etc. departments (i.e. clusters)

• A grocery store is laid out by dairy, meat, produce, bakery, etc. departments (i.e. clusters)

• In a shopping mall most restaurants are likely to be located in the food court (i.e. a cluster)

Text clustering is based upon the concept of grouping similar text into the same cluster. Each cluster consists of a number of phrases. The clustering result is considered superior if the contents of phrases within a cluster are more similar than the contents of phrases in the other clusters. Text clustering has the following benefits:

• Discovery – identifying previously unknown issues or opportunities

• Topic extraction – consistently and accurately identifying topics within textual data

• Summarization – synthesizing similar topics together to retain the most important concepts

Clustering vs. Categorization

Categorization is suitable when you want to categorize new text according to a known category; clustering is useful when you want to discover new structures not previously known.