Word repetition checking with JavaScript

Word Repetition Screenshot

Ok, I made something interesting (to me) last night. It’s probably not worth taking the time to write up, but someday someone might find it and think it useful. I apologize for the messiness of the code, if I take this further and clean it up, I’ll update this post.

I was working on a few text-analysis features for Marked and decided I wanted to be able to show repeated words on a per-paragraph basis. The following is the experiment I did as a proof-of-concept. I decided that I wanted to do this in JavaScript/jQuery for various reasons, so my existing Ruby scripts were mostly useless. Thus, I devised a way to handle it entirely in a WebKit browser.

Demo

For the purposes of demonstration, I set up a single-page, dynamic version of this. You can enter any article URL and see the processing take place. Hover over a bolded word to see where it repeats in the paragraph. Note that it’s pulling through a proxy of Marky the Markdownifier, and that some markup will return blanks. Obviously, it also helps to have a lot of text in the article you’re analyzing. Hard up for ideas? Try this, or this. Give it a few seconds to load, it pulls in the content in the background and I haven’t put a progress indicator on it yet.

Breakdown

I started with a Porter Stemmer using a script from tartarus.org. Stemming allows you to break a word down to a root form, so that all variations of a word can be boiled down and plurals, conjugations and various anomalous representations of a word will all match each other. From there, I do a frequency check to find word roots used more than once within the block being processed, creating an array of the repeated words. Then I re-parse the block, one word at a time, adding some markup to words whose root is found in the previously-created array.

Here’s the main script with a few comments. Until I get this prettied up, I won’t go into a step-by-step. Feel free to lift and improve as you like. Remember to include the porter-stemmer script before this script. Oh, and because I’m lazy, you’ll need to include jQuery as well. The in_array function is stolen from php.js.

// from php.js
function in_array (needle, haystack) {
  for (key in haystack) {
    if (haystack[key] == needle) {
      return true;
    }
  }
  return false;
}

// short, common words to skip when counting
var stopwords = ['1','2','3','4','5','6','7','8','9','0','one','two','three','four','five','about','actually','always','even','given','into','just','not','Im','thats','its','arent','weve','ive','didnt','dont','the','of','to','and','a','in','is','it','you','that','he','was','for','on','are','with','as','I','his','they','be','at','one','have','this','from','or','had','by','hot','but','some','what','there','we','can','out','were','all','your','when','up','use','how','said','an','each','she','which','do','their','if','will','way','many','then','them','would','like','so','these','her','see','him','has','more','could','go','come','did','my','no','get','me','say','too','here','must','such','try','us','own','oh','any','youll','youre','also','than','those','though','thing','things'];

// takes the text of a paragraph element as input
// returns marked up text with repeated words in 'b' tags with a class matching their "stemmed" root
function checkWords(input) {

  var words = input.split(' ');
  var wordcount = {};

  // build an object to count word frequency
  $.each(words,function(i){
    thisWord = String(this).replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase();
    if (!in_array(thisWord,stopwords)) {
      var word = stemmer(thisWord);
      if (wordcount[word] > 0 && word.length) {
        wordcount[word] += 1;
      } else {
        wordcount[word] = 1;
      }
    }
  });

  // convert the object to an object array
  // include only words repeated more than once within the paragraph
  var topwords = new Array();
  $.each(wordcount,function(w,i){
    if (i > 1)
      topwords.push({'word':w,'freq':i});
  });

  // convert the object array to a flat array
  topwordsArr = new Array();
    $.each(topwords,function(i) {
    topwordsArr.push(String(this['word']));
  });

  // re-parse the output, marking up repeated words based on their stems
  var output = '';
  $.each(words,function(w) {
    var aWord = String(this);
    var stripWord = stemmer(aWord.replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase());
    if (in_array(stripWord,topwordsArr))
      output += ' <b class="'+stripWord+'">'+aWord+'</b>';
    else
      output += ' '+aWord;
  });
  return output;
}

(function($){
  // grab common top-level elements
  grafs = $('p,ul,ol,blockquote,h1,h2,h3,h4,h5,h6,pre code',$('#content'));
  // navigate each element found
  $.each(grafs,function(a,g){
    // if it's a paragraph, we'll process it
    if (grafs[a].tagName == "P") {
      $('#work').append($('<p>').html(checkWords($(grafs[a]).text())));
    // if not, we just stick it back into the DOM
    } else {
      $('#work').append(grafs[a]);
    }
  });
  // set up hover listeners on the 'b' elements
  // the class is pulled from the hovered element
  // all similar words are highlighted on hover
  $('b','#work').hover(function(){
    var thisClass = this.className;
    $('.'+thisClass).addClass('highlight');
  },function(){
    $('.highlight').removeClass('highlight');
  });
})(jQuery);

This little bit of css will add the highlighting and a nice fade in compatible browsers:

b {
  font-weight:bold;
  -webkit-transition:color .2s ease-in-out;
  -moz-transition:color .2s ease-in-out;
  -o-transition:color .2s ease-in-out;
  transition:color .2s ease-in-out;
}
.highlight { color:rgba(207, 95, 205, 1);}	

If I decide to include this in Marked, it will definitely get some revamping. Like I said… proof-of-concept. Check out the demo, though, it’s kind of neat.

Brett Terpstra

Brett is a writer and developer living in Minnesota, USA. You can follow him as ttscoff on Twitter, GitHub, and Mastodon. Keep up with this blog by subscribing in your favorite news reader.

This content is supported by readers like you.

Join the conversation