If you happen to be converting a blog from WordPress to Jekyll, this tip might be of use, especially if you want to make sure links to your existing posts continue to work. While this is most likely to be an issue if you’re changing your permalink structure, you can still run into a few hiccups even if you maintain it.

Note that this post assumes you have a working knowledge of Ruby and can get the WordPress importer script to run on its own, including installing the sequel and MySQL gems. If you’re not that far yet, check back once you have it working.

I’m posting this to document my own discoveries, and I highly doubt it will be of much use to anyone else. While I have this working for my particular needs, this post only details enough to give you an idea how to implement your own. If you’ve read this far and don’t know what’s going on, you should probably skip this one.

I converted from having no dates in the url (/post-name) to having full dates (/2013/04/04/post-name). I handled this with Apache redirects. If you’re using another server platform, you’ll need to adjust the rule output accordingly.

I used a heavily modified version of a migration script borrowed from the original Jekyll package. Among myriad conversions it runs, it gathers permalinks and creates rules for all existing posts as it reads them from the WordPress database. The following is the concept, but you’ll need to reassemble in your own import script.

First, I set up a variable in the class initialization to hold the redirects as they’re gathered.

class WordPressImporter
    def initialize(dbname, user, pass, host = 'localhost', domain = '')
        ...
        @redirects = []
        ...
    end

Collecting and generating rules

During the process function in the import script, it gathers all of the posts and writes out the Markdown files for you. At the end of this function I add to my @redirects array using the information in the post variable.

@redirects << { 
    'source' => "^" + post[:post_name].to_s + "/?$", 
    'target' => slug 
}

The source line needs to generate a regular expression that matches the original URL of the post on your site. In my case this is just “^post-title/?”. If you have an existing permalink structure using dates or categories (or anything else), you’ll need to add the bits in to create a matching rule. For example, if your link structure is “yoursite.net/2013/04/post-title”, your regular expression needs to be “^2013/04/post-title/?”. You would generate it with:

date = post[:post_date]
regex = "^%02d/%02d/%s/?$" % [date.year, date.month, post[:post_name]]
@redirects << { 'source' => regex, 'target' => slug }

Adding the redirects to htaccess

I created an .htaccess file for the site in my “source” folder. It just needs to be at the root of the site source so that it’s copied over when the site is generated. Whether you create a new one or are using an existing one, you’ll need to add some markers so the script knows where to insert/update the rules. Add two lines to the htaccess anywhere that makes sense for page redirects:

#===== PAGE REMAPS
#===== PAGE REMAPS

The start and end markers are the same.

Here’s the part of the script that reads the htaccess in, inserts the generated rules between those markers and writes the result back out to .htaccess. Note that the File.open("source/.htaccess",'r') needs to be modified if your htaccess exists in a folder other than “source”. This should be defined within the WordPressImporter class.

def process_redirects
  $stderr.puts "Processing redirects"
  content = []
  File.open("source/.htaccess",'r') do |f|
    content = f.read.split(/\#===== PAGE REMAPS/)
  end
  before = ''
  after = ''
  if content.length == 2
    before = content[0]
  elsif content.length == 3
    before = content[0]
    after = content[2]
  end
  File.open("source/.htaccess", 'w+') do |f|
    f.puts(before)
    f.puts("#===== PAGE REMAPS")
    @redirects.each { |redirect|
      f.puts("RewriteRule #{redirect['source']} #{redirect['target']} [R=301,L]")
    }
    f.puts "#===== PAGE REMAPS"
    f.puts after
  end
end

Now just run the process_redirects function right after the process function and — if you hacked it all together properly — you should get an .htaccess file with all of your old links mapped to new ones. I highly recommend testing it on a local server before deploying anything.

Double checking

You can use a sitemap from your old site to check against the new site for missing links and repair and redirect them as needed. Here’s a basic script which may require some adjustment. Run it with sitemapchecker.rb yourstagingsite.com http://yourwordpresssite.com/yoursitemap.xml. If you do happen to use this and you have a large site, let me know. I have a version with progress meters…

#!/usr/bin/env ruby
# sitemapchecker.rb
# Brett Terpstra 2013, no rights reserved
require 'rexml/document'
require 'net/http'
require 'uri'


def get_xml(url)
  if File.exists?(url)
    f = File.open(url,'r')
    res = f.read
    f.close
  else
    url = "http://" + url unless url =~ /^https?:\/\//
    res = Net::HTTP.get_response(URI.parse(url)).body
  end
  res
end

def test_url(url)
  res = Net::HTTP.get_response(URI.parse(url))
  res.nil? ? "FAILURE" : res.code
end

def check_sitemap(target_domain,sitemap)
  target_domain.gsub(/^https?:\/\//,'')
  results = []
  doc = REXML::Document.new(get_xml(sitemap))
  raise "Error parsing #{sitemap}" unless doc
  urls = doc.get_elements("urlset/url")

  urls.each { |url|
    target = url.elements['loc'].text
    parts = target.match(/^(https?:\/\/)?([^\/]+)?(.*)$/)
    prefix = parts[1] || "http://"
    target = "#{prefix}#{target_domain}#{parts[3]}"
    res = test_url(target)
    results << { 'url' => target, 'result' => res }
  }
  outfile = "sitemap_check_#{target_domain}_#{Time.now.strftime('%m-%d-%Y')}.txt"
  $stderr.puts "Writing results to #{outfile}"
  File.open(outfile,'w') do |f|
    results.each { |res|
      f.puts %Q{#{res['result']}\t#{res['url']}}
    }
  end
end

if ARGV.length < 2
  puts "This script pulls an existing sitemap, remaps urls to a new domain,"
  puts "and checks to see if they exist at the new location."
  puts
  puts "The first argument should be the new domain to test on, followed by"
  puts "a filename or web url for the current sitemap (multiple allowed)."
  puts
  puts "A file titled \"sitemap_check_[yourdomain]_date.txt\" will be output"
  puts "in the current directory. I know, I should make that a CLI option. Whatever."
  puts
  puts "> #{File.basename(__FILE__)} stage.yoursite.com http://yoursite.com/yoursitemap.xml"
else
  target_domain = ARGV[0]
  ARGV.shift
  ARGV.each { |smap|
    check_sitemap(target_domain,smap)
  }
end

Hope somebody finds all of this useful.