Ok, I made something interesting (to me) last night. It’s probably not worth taking the time to write up, but someday someone might find it and think it useful. I apologize for the messiness of the code, if I take this further and clean it up, I’ll update this post.
For the purposes of demonstration, I set up a single-page, dynamic version of this. You can enter any article URL and see the processing take place. Hover over a bolded word to see where it repeats in the paragraph. Note that it’s pulling through a proxy of Marky the Markdownifier, and that some markup will return blanks. Obviously, it also helps to have a lot of text in the article you’re analyzing. Hard up for ideas? Try this, or this. Give it a few seconds to load, it pulls in the content in the background and I haven’t put a progress indicator on it yet.
I started with a Porter Stemmer using a script from tartarus.org. Stemming allows you to break a word down to a root form, so that all variations of a word can be boiled down and plurals, conjugations and various anomalous representations of a word will all match each other. From there, I do a frequency check to find word roots used more than once within the block being processed, creating an array of the repeated words. Then I re-parse the block, one word at a time, adding some markup to words whose root is found in the previously-created array.
Here’s the main script with a few comments. Until I get this prettied up, I won’t go into a step-by-step. Feel free to lift and improve as you like. Remember to include the porter-stemmer script before this script. Oh, and because I’m lazy, you’ll need to include jQuery as well. The
in_array function is stolen from php.js.
This little bit of css will add the highlighting and a nice fade in compatible browsers:
If I decide to include this in Marked, it will definitely get some revamping. Like I said… proof-of-concept. Check out the demo, though, it’s kind of neat.