The MatchData object in Ruby gsub blocks

This post is absolutely only of interest to Ruby programmers. Just to save you some time.

I use regular expressions in Ruby a lot. One of the features I’ve come to use frequently is the block syntax for gsub calls. Whereas the other syntaxes for gsub really only provide back referencing for capture groups in replacements, the block syntax allows much more flexibility.

You have access to the $ variables for capture groups, but you also have the full power of the Regexp class available to the captures within the block. Just in case anybody else doesn’t know, here’s the scoop…

Typically, gsub (the global version of sub) is used as a pattern/replacement method with simple \1, \2 back references to make use of capture groups in the pattern (regular expression).

puts "A grin".gsub(/\b[A-Z] (\w+)/, 'Cheshire \1')
=> Cheshire grin

You can also pass a hash as the second argument, and do literal string replacement based on secondary matching.

puts "A grin".gsub(/(\w+)/, 'grin' => 'cat', 'A' => 'Cheshire')
=> Cheshire cat

These are essential tools for quick string manipulations. As you move on to parsing larger quantities of text, you usually want to do something further with the matches, whether it’s additional logic or just more complex manipulations than simple \1 syntax provides. That’s where the block format is perfect.

A gsub call with a block looks like this:

string = "How puzzling all these changes are!"

string.gsub!(/\b(\w+)/) do |match|
	if match =~ /^(\p{Lu}|t)/
			match.reverse
		else
			match.split('').sort().join('')
		end
	end
end

puts string

=> woH gilnpuzz all eseht aceghns aer!

Within that block, I always expected “match” to carry the full set of MatchData methods with it, but it’s just the full string of the overall match. You do have access to the $ operators, which you can use for referencing capture groups ($1,$2,…) in the match. However, you also have access to Regexp.last_match, which provides a MatchData object for the current gsub iteration with all of the capture group’s methods such as :names, :length and :offset, the original string (:to_s), etc..

You can even get the “pre” and “post” parts of the original string for checking context within a broader search expression. I won’t go into a detailed example, but here’s sample usage;

"I'm late, I'm late".gsub(/(\w+)/) do |match| 
	m = Regexp.last_match
	string = m.to_s
	before_string = m.pre_match
	after_string = m.post_match
	# ...
	string
end

You can actually leave off the block param (|match|) entirely. The “match” variable in this case is the equivalent of Regexp.last_match.to_s.

my_string.gsub!(/[[:punct:]]/) do
	match = Regexp.last_match.to_s
	# ...
end

You could also use Regexp.last_match[0]. The MatchData object provides direct access to capture group strings when addressed as an array (:[]), 0 being the full matched string.

Store the :last_match object for each iteration in a variable at the top of the block. If you call any Regexp methods within the block, last_match will be modified.

For short runs, you can put the block format in a single line with bracket syntax and ternary operators. Here’s an overdrawn example to illustrate a simple one-liner:

class String
	def hatter
		gsub(/[[:alpha:]]/) {|m| Regexp.last_match.offset(0)[0] % 3 == 1 ? m.upcase : m.downcase }
		# that was the one-liner!
	end
end

string = "But I don't want to go among mad people.\nOh, you can't help that. We're all mad here. I'm mad. You're mad.\nHow do you know I'm mad?\nYou must be. Or you wouldn't have come here."

puts string.hatter

=>	bUt I dOn'T wAnt to go amOng maD pEopLe.
	oh, yOu Can't HelP tHat. wE'rE aLl Mad heRe. i'M mAd. yoU'rE mAd.
	hoW dO yOu KnoW i'm Mad?
	yOu MusT bE. Or You woUldN't haVe ComE hEre.

Loads of fun. Of course this is only useful for string manipulation/processing up to a certain limit, at which point you’ll probably want to start studying StringScanner.

“I haven’t the slightest idea,” said the Hatter.

programming, regex, ruby, scripting

Join the conversation