What are some options for clustering on categorical data? What if the dataset contains a combination of numeric and categorical features?

  • K-Modes: K-Modes is a modification of K-Means suitable for datasets with all categorical features that clusters based on matches/mismatches across the features of the observations rather than numerical distance. The algorithm performs cluster assignment and iterates in the same way as k-means, just utilizing a different measure of similarity. 
  • K-Medoids (PAM Clustering): This approach, which stands for Partitioning Around Medoids, accounts for mixed data types by using a different similarity measure for numeric versus categorical features. It uses a measure called the Gower Distance to compute the partial similarities based on data type. PAM clustering is more robust to outliers compared to K-Means but can be computationally expensive on large datasets.  

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute
Here goes your text ... Select any part of your text to access the formatting toolbar.