Ok, I made something interesting (to me) last night. It’s probably not worth taking the time to write up, but someday someone might find it and think it useful. I apologize for the messiness of the code, if I take this further and clean it up, I’ll update this post.
I was working on a few text-analysis features for Marked and decided I wanted to be able to show repeated words on a per-paragraph basis. The following is the experiment I did as a proof-of-concept. I decided that I wanted to do this in JavaScript/jQuery for various reasons, so my existing Ruby scripts were mostly useless. Thus, I devised a way to handle it entirely in a WebKit browser.
Demo
For the purposes of demonstration, I set up a single-page, dynamic version of this. You can enter any article URL and see the processing take place. Hover over a bolded word to see where it repeats in the paragraph. Note that it’s pulling through a proxy of Marky the Markdownifier, and that some markup will return blanks. Obviously, it also helps to have a lot of text in the article you’re analyzing. Hard up for ideas? Try this, or this. Give it a few seconds to load, it pulls in the content in the background and I haven’t put a progress indicator on it yet.
Breakdown
I started with a Porter Stemmer using a script from tartarus.org. Stemming allows you to break a word down to a root form, so that all variations of a word can be boiled down and plurals, conjugations and various anomalous representations of a word will all match each other. From there, I do a frequency check to find word roots used more than once within the block being processed, creating an array of the repeated words. Then I re-parse the block, one word at a time, adding some markup to words whose root is found in the previously-created array.
Here’s the main script with a few comments. Until I get this prettied up, I won’t go into a step-by-step. Feel free to lift and improve as you like. Remember to include the porter-stemmer script before this script. Oh, and because I’m lazy, you’ll need to include jQuery as well. The in_array
function is stolen from php.js.
// from php.js
function in_array (needle, haystack) {
for (key in haystack) {
if (haystack[key] == needle) {
return true;
}
}
return false;
}
// short, common words to skip when counting
var stopwords = ['1','2','3','4','5','6','7','8','9','0','one','two','three','four','five','about','actually','always','even','given','into','just','not','Im','thats','its','arent','weve','ive','didnt','dont','the','of','to','and','a','in','is','it','you','that','he','was','for','on','are','with','as','I','his','they','be','at','one','have','this','from','or','had','by','hot','but','some','what','there','we','can','out','were','all','your','when','up','use','how','said','an','each','she','which','do','their','if','will','way','many','then','them','would','like','so','these','her','see','him','has','more','could','go','come','did','my','no','get','me','say','too','here','must','such','try','us','own','oh','any','youll','youre','also','than','those','though','thing','things'];
// takes the text of a paragraph element as input
// returns marked up text with repeated words in 'b' tags with a class matching their "stemmed" root
function checkWords(input) {
var words = input.split(' ');
var wordcount = {};
// build an object to count word frequency
$.each(words,function(i){
thisWord = String(this).replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase();
if (!in_array(thisWord,stopwords)) {
var word = stemmer(thisWord);
if (wordcount[word] > 0 && word.length) {
wordcount[word] += 1;
} else {
wordcount[word] = 1;
}
}
});
// convert the object to an object array
// include only words repeated more than once within the paragraph
var topwords = new Array();
$.each(wordcount,function(w,i){
if (i > 1)
topwords.push({'word':w,'freq':i});
});
// convert the object array to a flat array
topwordsArr = new Array();
$.each(topwords,function(i) {
topwordsArr.push(String(this['word']));
});
// re-parse the output, marking up repeated words based on their stems
var output = '';
$.each(words,function(w) {
var aWord = String(this);
var stripWord = stemmer(aWord.replace(/[\/\\]/,' ').replace(//g,"'").replace(/[^a-z' ]/gi,'').toLowerCase());
if (in_array(stripWord,topwordsArr))
output += ' <b class="'+stripWord+'">'+aWord+'</b>';
else
output += ' '+aWord;
});
return output;
}
(function($){
// grab common top-level elements
grafs = $('p,ul,ol,blockquote,h1,h2,h3,h4,h5,h6,pre code',$('#content'));
// navigate each element found
$.each(grafs,function(a,g){
// if it's a paragraph, we'll process it
if (grafs[a].tagName == "P") {
$('#work').append($('<p>').html(checkWords($(grafs[a]).text())));
// if not, we just stick it back into the DOM
} else {
$('#work').append(grafs[a]);
}
});
// set up hover listeners on the 'b' elements
// the class is pulled from the hovered element
// all similar words are highlighted on hover
$('b','#work').hover(function(){
var thisClass = this.className;
$('.'+thisClass).addClass('highlight');
},function(){
$('.highlight').removeClass('highlight');
});
})(jQuery);
This little bit of css will add the highlighting and a nice fade in compatible browsers:
b {
font-weight:bold;
-webkit-transition:color .2s ease-in-out;
-moz-transition:color .2s ease-in-out;
-o-transition:color .2s ease-in-out;
transition:color .2s ease-in-out;
}
.highlight { color:rgba(207, 95, 205, 1);}
If I decide to include this in Marked, it will definitely get some revamping. Like I said… proof-of-concept. Check out the demo, though, it’s kind of neat.