Similar Posts using the Jaccard Index

Today, I've added a Similar Posts section when viewing an individual post, which is calculated using the Jaccard Index. The Jaccard Index is used to compare the similarity and diversity of data sets, and in my case the data sets are the post tags. The calculation is quite simple:

Jaccard Index

The reason why the Jaccard Index is so well suited for this is because it doesn't just calculate how many tags appear in each post, but instead it calculates this based on the total number of tags in both posts. An example would be that posts matching 4 out of 5 tags would appear higher than posts matching 5 out of 20 tags.

My database structure is pretty typical in the way that I have one table for posts and another for tags, which are linked via the post ID. When viewing a post I already have the set of tags to compare the other posts to, so from these tags I can generate the SQL via:

$tags = array('tag1','tag2','tag3');
$sql = "SELECT `posts`.* FROM `posts`,`tags` WHERE  `posts`.`id` = `tags`.`post_id` AND `tags`.`post_id` != '{CURRENT_ID}' GROUP BY `posts`.`id` ORDER BY SUM(IF(" . implode('||', array_map(create_function('$a', 'return "`tags`.`tag`='" . $a . "'";'), $tags)) . ",1,0))/(COUNT(`tags`.`tag`)+" . count($tags) . ") DESC LIMIT 0,3";

Which will produce the following SQL:

SELECT `posts`.* FROM `posts`,`tags` WHERE `posts`.`id` = `tags`.`post_id` AND `tags`.`post_id` != '{CURRENT_ID}' GROUP BY `posts`.`id` ORDER BY SUM(IF(`tags`.`tag`='tag1'||`tags`.`tag`='tag3'||`tags`.`tag`='tag3',1,0))/(COUNT(`tags`.`tag`)+3) DESC LIMIT 0,3

When this SQL is run it will return the top 3 posts (excluding the current post) in order of the tag similarity score. You can see it in action when viewing any individual post on my blog.

08-Jan-2014 at 4:11pm
  • (optional)
  • Security Code