Word Repetition Screenshot

Ok, I made something interesting (to me) last night. It’s probably not worth taking the time to write up, but someday someone might find it and think it useful. I apologize for the messiness of the code, if I take this further and clean it up, I’ll update this post.

I was working on a few text-analysis features for Marked and decided I wanted to be able to show repeated words on a per-paragraph basis. The following is the experiment I did as a proof-of-concept. I decided that I wanted to do this in JavaScript/jQuery for various reasons, so my existing Ruby scripts were mostly useless. Thus, I devised a way to handle it entirely in a WebKit browser.

Demo

For the purposes of demonstration, I set up a single-page, dynamic version of this. You can enter any article URL and see the processing take place. Hover over a bolded word to see where it repeats in the paragraph. Note that it’s pulling through a proxy of Marky the Markdownifier, and that some markup will return blanks. Obviously, it also helps to have a lot of text in the article you’re analyzing. Hard up for ideas? Try this, or this. Give it a few seconds to load, it pulls in the content in the background and I haven’t put a progress indicator on it yet.

Breakdown

I started with a Porter Stemmer using a script from tartarus.org. Stemming allows you to break a word down to a root form, so that all variations of a word can be boiled down and plurals, conjugations and various anomalous representations of a word will all match each other. From there, I do a frequency check to find word roots used more than once within the block being processed, creating an array of the repeated words. Then I re-parse the block, one word at a time, adding some markup to words whose root is found in the previously-created array.

Here’s the main script with a few comments. Until I get this prettied up, I won’t go into a step-by-step. Feel free to lift and improve as you like. Remember to include the porter-stemmer script before this script. Oh, and because I’m lazy, you’ll need to include jQuery as well. The in_array function is stolen from php.js.

// from php.js
function in_array (needle, haystack) {
  for (key in haystack) {
    if (haystack[key] == needle) {
      return true;
    }
  }
  return false;
}

// short, common words to skip when counting
var stopwords = ['1','2','3','4','5','6','7','8','9','0','one','two','three','four','five','about','actually','always','even','given','into','just','not','Im','thats','its','arent','weve','ive','didnt','dont','the','of','to','and','a','in','is','it','you','that','he','was','for','on','are','with','as','I','his','they','be','at','one','have','this','from','or','had','by','hot','but','some','what','there','we','can','out','were','all','your','when','up','use','how','said','an','each','she','which','do','their','if','will','way','many','then','them','would','like','so','these','her','see','him','has','more','could','go','come','did','my','no','get','me','say','too','here','must','such','try','us','own','oh','any','youll','youre','also','than','those','though','thing','things'];

// takes the text of a paragraph element as input
// returns marked up text with repeated words in 'b' tags with a class matching their "stemmed" root
function checkWords(input) {

  var words = input.split(' ');
  var wordcount = {};

  // build an object to count word frequency
  $.each(words,function(i){
    thisWord = String(this).replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase();
    if (!in_array(thisWord,stopwords)) {
      var word = stemmer(thisWord);
      if (wordcount[word] > 0 && word.length) {
        wordcount[word] += 1;
      } else {
        wordcount[word] = 1;
      }
    }
  });
  
  // convert the object to an object array
  // include only words repeated more than once within the paragraph
  var topwords = new Array();
  $.each(wordcount,function(w,i){
    if (i > 1)
      topwords.push({'word':w,'freq':i});
  });

  // convert the object array to a flat array
  topwordsArr = new Array();
    $.each(topwords,function(i) {
    topwordsArr.push(String(this['word']));
  });
  
  // re-parse the output, marking up repeated words based on their stems
  var output = '';
  $.each(words,function(w) {
    var aWord = String(this);
    var stripWord = stemmer(aWord.replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase());
    if (in_array(stripWord,topwordsArr))
      output += ' <b class="'+stripWord+'">'+aWord+'</b>';
    else
      output += ' '+aWord;
  });
  return output;
}

(function($){
  // grab common top-level elements
  grafs = $('p,ul,ol,blockquote,h1,h2,h3,h4,h5,h6,pre code',$('#content'));
  // navigate each element found
  $.each(grafs,function(a,g){
    // if it's a paragraph, we'll process it
    if (grafs[a].tagName == "P") {
      $('#work').append($('<p>').html(checkWords($(grafs[a]).text())));
    // if not, we just stick it back into the DOM
    } else {
      $('#work').append(grafs[a]);
    }
  });
  // set up hover listeners on the 'b' elements
  // the class is pulled from the hovered element
  // all similar words are highlighted on hover
  $('b','#work').hover(function(){
    var thisClass = this.className;
    $('.'+thisClass).addClass('highlight');
  },function(){
    $('.highlight').removeClass('highlight');
  });
})(jQuery);

This little bit of css will add the highlighting and a nice fade in compatible browsers:

b {
  font-weight:bold;
  -webkit-transition:color .2s ease-in-out;
  -moz-transition:color .2s ease-in-out;
  -o-transition:color .2s ease-in-out;
  transition:color .2s ease-in-out;
}
.highlight { color:rgba(207, 95, 205, 1);}

If I decide to include this in Marked, it will definitely get some revamping. Like I said… proof-of-concept. Check out the demo, though, it’s kind of neat.