The XML Data Liberation Front

Despite the grandiose title, this post is pretty specific: converting RegExRX files to Markdown so I can include them in my nvALT snippets collection. Despite that, I’m sharing it because you can use it as a base to modify and start “rescuing” your own data out of other applications. I understand why applications of any complexity store their data in structured files, whether XML, JSON, or a database format, but I like to keep my data portable. Since the Data Liberation Army isn’t huge in number, the onus falls on us to find our own ways.

This script specifically works with XML and outputs to Markdown, but you could easily make the idea work with JSON files, binary XML (with a little help from plutil), or SQLite database queries, and output to any format you wanted with a little templating.

Ok, diatribe over. Back to the script.

Out of all the editors/testers for regular expressions out there, I’ve always come back to RegExRx. It’s not pretty (the Mac App Store icon couldn’t even get shadow transparency right), but it has all the features I could ask for. As I work, I save my successful regular expressions to RegExRX files. These are plain text XML files with the patterns stored as hex. This makes them pretty human-unreadable, and you know me…

I wrote a script to convert a folder full of these .regexrx files to Markdown files I could drop into nvALT or Quiver. I won’t go into a ton of detail on this because I’m pretty sure there aren’t more than 5 people in the world who will ever need this script, but…

In this script, you can specify a few options when you run it:

$ regexrx2md.rb -h
Usage: /Users/ttscoff/scripts/regexrx2md.rb [OPTIONS]
-o, --output-dir=DIRECTORY       Output folder, defaults to "markdown output"
-p, --prefix=PREFIX              Prefix added before output filenames
-t, --template=TEMPLATE          Use alternate ERB template
-h, --help                       Display this screen

Specify an output folder, a note title prefix, and your own template for the output (there’s a default one if you don’t make your own). A template is an ERB file that uses the variables @title, @flags, @search, @replace, and @source. The @source one is the contents of the “source text” in RegExRX, a string or text block to test the expression against. There are also helpers like “@source.indent” which will take every line and indent it 4 spaces (to make a Markdown code block). Also, .to_js simply replaces forward slashes with \/ so you can use /[search]/ in your template. Note that it doesn’t account for already-escaped slashes because I don’t use them in RegExRX (its copy-as feature does it automatically), but that’s something I’ll probably fix sooner than later.

Here’s an example template that imports nicely into Quiver:

regexrx_quiver_template.erbraw
<% if @flags %>

<% end %>

### Search

```javascript
/<%= @search.to_js %>/<%= @flags %>
```
<% if @replace %>

### Replace

```javascript
'<%= @replace %>'
```
<% end %>
<% if @source %>

### Test string

```text
<%= @source %>
```
<% end %>

The result in Quiver:

Side note: annoyingly, a lot of other snippet apps (like SnippetsLab) can’t just import Markdown files as notes. I had to import the results of this script in Codebox (which I think is now defunct) and then import that library in SnippetsLab.

And here’s the Ruby script. You need to have Nokogiri installed, which is (usually) just a matter of running gem install nokogiri (though depending on your setup you may need sudo gem install nokogiri and there’s a 50% chance you run into issues with libXML that you’ll have to search the web about).

regexrx2md.rbraw
#!/usr/bin/env ruby
require 'fileutils'
require 'nokogiri'
require 'optparse'
require 'erb'

def class_exists?(class_name)
  klass = Module.const_get(class_name)
  return klass.is_a?(Class)
rescue NameError
  return false
end

if class_exists? 'Encoding'
  Encoding.default_external = Encoding::UTF_8 if Encoding.respond_to?('default_external')
  Encoding.default_internal = Encoding::UTF_8 if Encoding.respond_to?('default_internal')
end

class String
  def unpack
    [self].pack('H*')
  end

  def indent
    out = ''
    self.split("\n").each {|line|
      out += "    #{line}\n"
    }
    out
  end

  def to_js
    self.gsub(/(?mi)(?<!\\)\//,'\/')
  end
end

class RegexRX
  attr_reader :title, :search, :flags, :replace, :source

  def initialize(file)
    doc = File.open(file) { |f| Nokogiri::XML(f) }
    @content = doc.xpath('RegExRX_Document')

    @title = doc.xpath("//Window").first["Title"].strip

    @search = grabString('fldSearch')

    @flags = ''

    @flags += 's' if grabOpt('Dot Matches Newline')
    @flags += 'i' unless grabOpt('Case Sensitive')
    @flags += 'm' if grabOpt('Treat Target As One Line')

    if @flags.length == 0
      @flags = false
    end

    # @regex = '/' + @search + '/' + @flags

    if grabPref('Do Replace')
      @replace = grabString('fldReplace')
    else
      @replace = false
    end

    @source = false
    source = grabString('fldSource')
    if source.length > 0
      @source = source
    end
  end

  def to_markdown(template)
    out = ERB.new(template).result(binding)

    out.force_encoding('utf-8')
  end


  def grabString(name)
    out = @content.xpath("//Control[@name=\"#{name}\"]").first
    .content
    .strip
    .force_encoding('utf-8')
    out.unpack
  end

  def grabPref(name)
    @content.xpath("//Preference[@name=\"#{name}\"]").first["value"] == "true"
  end

  def grabOpt(name)
    @content.xpath("//OptionMenu[@text=\"#{name}\"]").first["checked"] == "true"
  end
end

options = {}
optparse = OptionParser.new do|opts|
  opts.banner = "Usage: #{__FILE__} [OPTIONS]"
  options[:prefix] = ''
  options[:output] = 'markdown output'
  opts.on( '-o', '--output-dir=DIRECTORY', 'Output folder, defaults to "markdown output"') do |output|
    options[:output] = output
  end
  opts.on( '-p','--prefix=PREFIX', 'Prefix added before output filenames' ) do |prefix|
    options[:prefix] = prefix.strip + ' '
  end
  options[:template] = nil
  opts.on( '-t','--template=TEMPLATE', 'Use alternate ERB template' ) do |template|
    options[:template] = template
  end
  opts.on( '-h', '--help', 'Display this screen' ) do
    puts opts
    exit
  end
end
optparse.parse!

default_template = <<-ENDOFTEMPLATE
# <%= @title %>
<% if @flags %>

**Flags:** _<%= @flags %>_
<% end %>

**Search:**

<%= @search.indent %>
<% if @replace %>

**Replace:**

<%= @replace.indent %>
<% end %>
<% if @source %>
---

## Test string:

```text
<%= @source %>
```
<% end %>

ENDOFTEMPLATE

# If ERB template is specified, use that instead of the default
if options[:template]
  if File.exists?(File.expand_path(options[:template])) && File.basename(options[:template]) =~ /\.erb$/
    template = IO.read(File.expand_path(options[:template]))
  else
    $stderr.puts %Q{Specified template "#{options[:template]}" is not a valid template}
    Process.exit 1
  end
else
  template = default_template
end

FileUtils.mkdir_p(options[:output]) unless File.exists?(options[:output])

Dir.glob('*.regexrx').each {|file|
  # $stderr.puts "Reading #{file}"
  rx = RegexRX.new(file)
  filename = File.join(options[:output], options[:prefix] + rx.title + '.md')
  File.open(filename, 'w') {|f|
    f.print(rx.to_markdown(template))
  }
  $stderr.puts "Regex written to #{filename}"
}

Even if you don’t use RegExRX, I hope this inspires some data liberation for some folks.

Brett Terpstra

Brett is a writer and developer living in Minnesota, USA. You can follow him as ttscoff on Twitter, GitHub, and Mastodon. Keep up with this blog by subscribing in your favorite news reader.

This content is supported by readers like you.

Join the conversation