Data mining algorithms in plain English

Maybe not interesting if you're a data mining guru, but this explanation of the top 10 most influential data mining algorithms in plain English is a good read for the rest of us, though “plain English” is perhaps debatable.

Here's a good one, on k-means:

You might be wondering:
 
Given this set of vectors, how do we cluster together patients that have similar age, pulse, blood pressure, etc?
 
Want to know the best part?
 
You tell k-means how many clusters you want. K-means takes care of the rest.
 
How does k-means take care of the rest? k-means has lots of variations to optimize for certain types of data.
 
At a high level, they all do something like this:
  1. k-means picks points in multi-dimensional space to represent each of the k clusters. These are called centroids.
  2. Every patient will be closest to 1 of these k centroids. They hopefully won’t all be closest to the same one, so they’ll form a cluster around their nearest centroid.
  3. What we have are k clusters, and each patient is now a member of a cluster.
  4. k-means then finds the center for each of the k clusters based on its cluster members (yep, using the patient vectors!).
  5. This center becomes the new centroid for the cluster.
  6. Since the centroid is in a different place now, patients might now be closer to other centroids. In other words, they may change cluster membership.
  7. Steps 2-6 are repeated until the centroids no longer change, and the cluster memberships stabilize. This is called convergence.
     

This seems like a great idea for a book: the central data algorithms of the third industrial revolution, this networked, online age. One chapter per algorithm, with a discussion of how it manifests itself on the key websites, applications, hardware, and other services we use all the time now. If you are a data mining expert in need of someone to be the “plain English” side of a writing team, call me maybe.