Title photo
frugal technology, simple living and guerrilla large-appliance repair
Thu, 16 Jun 2016

Using Ruby to delete blocks of text across multiple lines

I tend to learn things in programming when I have a problem to solve. This is just such a case.

I was working with a huge XML file, and I needed to trim elements out of it that begin with <generic tag> and end with </generic tag>, and include a random amount of text and other tags, across multiple lines, in between.

At first I tried using the Nokogiri gem, but it just wasn't happening. I was working on my Election Results script, and ... the election -- they hold it on a certain date, you know.

I would have to brute-force it. Like I always do.

My whole idea this cycle was to dump my giant sed hack from elections past and use mostly (if not all) Ruby to parse the XML I get from the state of California and provide the JSON output my fellow dev needed for the front end. (I also have a ton of fixed-width ASCII from Los Angeles County to deal with, as well as scraped HTML from San Bernardino County, but those are other tales for other times.)

With the state data, I had the XML-to-JSON conversion covered with Ruby's Crack gem. But I just couldn't pare down the XML to make the JSON a manageable size.

So I figured out how to do the deletions I needed in UNIX's sed utility and went with that.

I was calling out to the UNIX shell from my Ruby script for the things I couldn't do in Ruby, but when that turned out to be all the things except for the XML-to-JSON conversion, I reverted to a Bash script with a call out to Ruby for that final operation.

It has been bothering me ever since.

And now I finally figured out how to use Ruby to delete text that begins and ends with certain expressions and includes a random amount of text in between -- and doing it across multiple lines.

I used the gsub method, with (.*) to represent the random text in the middle and added the \m switch to apply gsub across multiple lines.

So far it works great. I haven't written it into my script yet, but I will do that soon.

Here is a little program I wrote to practice opening, creating and writing to files as well as deleting the blocks of text according to my chosen search:

#!/usr/bin/env ruby

=begin

This program provides practice in reading
text files into ruby variables, and creating
and writing to text files, and it also
shows how to use the *gsub* method
to delete blocks of text with a definite
beginning and ending but indeterminate middle
-- and doing so across multiple lines using
the \m switch

=end

# Read a text file into a variable

text_to_trim = File.read("text_to_trim.txt")

# extra *puts* creates space in the output
# could also add \n when an extra line
# is needed, as in:
# puts "This is the original text:\n:"

puts "This is the original text:"

puts

puts text_to_trim

puts

# I am using the gsub! method to replace text
# (the ! makes it replace the text in place)
# The expression used here selects text between
# (and including) two known pieces of text,
# <start> and </start>, and uses (.*) to include
# everything (and anything) between those
# two pieces of text.
#
# The key to this script -- and the solution to
# the problem I've had for weeks is using /m -- 
# the multi-line switch.
#
# Also, I want to DELETE this text, so replacing with "".
# And the expression ends \n/m because there is an \n
# (a newline) at the end of every line in my text.

text_to_trim.gsub!(/<start>(.*)<\/start>\n/m, "")

# See what your output looks like

puts "This is the trimmed text:"

puts

puts text_to_trim

# Creating and opening a text file to store the results

trimmed_text = File.open("trimmed_text.txt", "w")

# Writing to the text file

trimmed_text.write(text_to_trim)

That's it. Feel free to use this code however you wish. I'm writing it to learn programming (and to learn how to do it in Ruby), and explaining myself in these blog posts helps me "cement" my own learning.

If you're a beginner, so am I. If you're not, I still am.

Until my next entry, there's madness to these methods. Or is it the other way around?