If you happen to be converting a blog from WordPress to Jekyll, this tip might be of use, especially if you want to make sure links to your existing posts continue to work. While this is most likely to be an issue if you’re changing your permalink structure, you can still run into a few hiccups even if you maintain it.
Note that this post assumes you have a working knowledge of Ruby and can get the WordPress importer script to run on its own, including installing the sequel and MySQL gems. If you’re not that far yet, check back once you have it working.
I’m posting this to document my own discoveries, and I highly doubt it will be of much use to anyone else. While I have this working for my particular needs, this post only details enough to give you an idea how to implement your own. If you’ve read this far and don’t know what’s going on, you should probably skip this one.
I converted from having no dates in the url (/post-name) to having full dates (/2013/04/04/post-name). I handled this with Apache redirects. If you’re using another server platform, you’ll need to adjust the rule output accordingly.
I used a heavily modified version of a migration script borrowed from the original Jekyll package. Among myriad conversions it runs, it gathers permalinks and creates rules for all existing posts as it reads them from the WordPress database. The following is the concept, but you’ll need to reassemble in your own import script.
First, I set up a variable in the class initialization to hold the redirects as they’re gathered.
class WordPressImporter
def initialize(dbname, user, pass, host = 'localhost', domain = '')
...
@redirects = []
...
end
Collecting and generating rules
During the process
function in the import script, it gathers all of the posts and writes out the Markdown files for you. At the end of this function I add to my @redirects
array using the information in the post variable.
@redirects << {
'source' => "^" + post[:post_name].to_s + "/?$",
'target' => slug
}
The source line needs to generate a regular expression that matches the original URL of the post on your site. In my case this is just “^post-title/?”. If you have an existing permalink structure using dates or categories (or anything else), you’ll need to add the bits in to create a matching rule. For example, if your link structure is “yoursite.net/2013/04/post-title”, your regular expression needs to be “^2013/04/post-title/?”. You would generate it with:
date = post[:post_date]
regex = "^%02d/%02d/%s/?$" % [date.year, date.month, post[:post_name]]
@redirects << { 'source' => regex, 'target' => slug }
Adding the redirects to htaccess
I created an .htaccess
file for the site in my “source” folder. It just needs to be at the root of the site source so that it’s copied over when the site is generated. Whether you create a new one or are using an existing one, you’ll need to add some markers so the script knows where to insert/update the rules. Add two lines to the htaccess anywhere that makes sense for page redirects:
#===== PAGE REMAPS
#===== PAGE REMAPS
The start and end markers are the same.
Here’s the part of the script that reads the htaccess in, inserts the generated rules between those markers and writes the result back out to .htaccess
. Note that the File.open("source/.htaccess",'r')
needs to be modified if your htaccess exists in a folder other than “source”. This should be defined within the WordPressImporter class.
def process_redirects
$stderr.puts "Processing redirects"
content = []
File.open("source/.htaccess",'r') do |f|
content = f.read.split(/\#===== PAGE REMAPS/)
end
before = ''
after = ''
if content.length == 2
before = content[0]
elsif content.length == 3
before = content[0]
after = content[2]
end
File.open("source/.htaccess", 'w+') do |f|
f.puts(before)
f.puts("#===== PAGE REMAPS")
@redirects.each { |redirect|
f.puts("RewriteRule #{redirect['source']} #{redirect['target']} [R=301,L]")
}
f.puts "#===== PAGE REMAPS"
f.puts after
end
end
Now just run the process_redirects
function right after the process
function and — if you hacked it all together properly — you should get an .htaccess
file with all of your old links mapped to new ones. I highly recommend testing it on a local server before deploying anything.
Double checking
You can use a sitemap from your old site to check against the new site for missing links and repair and redirect them as needed. Here’s a basic script which may require some adjustment. Run it with sitemapchecker.rb yourstagingsite.com http://yourwordpresssite.com/yoursitemap.xml
. If you do happen to use this and you have a large site, let me know. I have a version with progress meters…
#!/usr/bin/env ruby
# sitemapchecker.rb
# Brett Terpstra 2013, no rights reserved
require 'rexml/document'
require 'net/http'
require 'uri'
def get_xml(url)
if File.exists?(url)
f = File.open(url,'r')
res = f.read
f.close
else
url = "http://" + url unless url =~ /^https?:\/\//
res = Net::HTTP.get_response(URI.parse(url)).body
end
res
end
def test_url(url)
res = Net::HTTP.get_response(URI.parse(url))
res.nil? ? "FAILURE" : res.code
end
def check_sitemap(target_domain,sitemap)
target_domain.gsub(/^https?:\/\//,'')
results = []
doc = REXML::Document.new(get_xml(sitemap))
raise "Error parsing #{sitemap}" unless doc
urls = doc.get_elements("urlset/url")
urls.each { |url|
target = url.elements['loc'].text
parts = target.match(/^(https?:\/\/)?([^\/]+)?(.*)$/)
prefix = parts[1] || "http://"
target = "#{prefix}#{target_domain}#{parts[3]}"
res = test_url(target)
results << { 'url' => target, 'result' => res }
}
outfile = "sitemap_check_#{target_domain}_#{Time.now.strftime('%m-%d-%Y')}.txt"
$stderr.puts "Writing results to #{outfile}"
File.open(outfile,'w') do |f|
results.each { |res|
f.puts %Q{#{res['result']}\t#{res['url']}}
}
end
end
if ARGV.length < 2
puts "This script pulls an existing sitemap, remaps urls to a new domain,"
puts "and checks to see if they exist at the new location."
puts
puts "The first argument should be the new domain to test on, followed by"
puts "a filename or web url for the current sitemap (multiple allowed)."
puts
puts "A file titled \"sitemap_check_[yourdomain]_date.txt\" will be output"
puts "in the current directory. I know, I should make that a CLI option. Whatever."
puts
puts "> #{File.basename(__FILE__)} stage.yoursite.com http://yoursite.com/yoursitemap.xml"
else
target_domain = ARGV[0]
ARGV.shift
ARGV.each { |smap|
check_sitemap(target_domain,smap)
}
end
Hope somebody finds all of this useful.