processing large xml data files

Getting to Railsconf 2009-26XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.

This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.

Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the parser is invoked and in the name of the callback methods. You use the callbacks in the very same way; they are only named differently.

Nokogiri

require 'nokogiri'

class Wikihandler  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of Class.new
  end

  def start_element(name, attributes = [])
    # check the element name here and create an active record object if appropriate
  end

  def characters(s)
    # save the characters that appear here and possibly use them in the current tag object
  end

  def end_element(name)
    # check the tag name and possibly use the characters you've collected
    # and save your activerecord object now
  end

end

parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('dump.xml')

libxml

require 'xml/libxml'

class Wikihandler2
  include XML::SaxParser::Callbacks
  def on_start_element_ns  (name, attributes, prefix, uri, namespaces)
  end

  def on_end_element_ns  (name, prefix, uri)
  end

  def on_characters(s)
  end

end

parser = XML::SaxParser.file('dump.xml')
parser.callbacks = Wikihandler2.new
parser.parse

rexml

require 'rexml/document'

class Wikihandler3
  def tag_start(name, attr_hash)
  end

  def tag_end(name)
  end

  def text(s)
  end
end

require 'rexml/document'
REXML::Document.parse_stream(File.new('dump.xml'), Wikihandler3.new )

If you see an error like “parser error : Detected an entity reference loop” you either need to upgrade libxml2 to handle large files better or filter the data first. The wikipedia data dumps above can be converted into a form that is safe for the older libxml2 library with a command like:

bunzip2 <enwiki -latest-pages-articles.xml.bz2 | sed 's/&lt;/&#60;/g' | sed 's/&gt;/&#62;/g' | sed 's/&quot;/&#34;/g' | sed 's/&amp;/&#38;/g' >dump.xml

I have a script that I use to find issues inside a large xml document. I call this script with the element name to split around and a number of file names that are each to be split. This can help zero in on data problems in a huge dataset by bisecting the problem in each pass. To split the document above, use “axeml.rb page dump.xml”.

I intentionally avoid calling an actual xml parser since this is to help find syntax errors in a large file. It also makes assumptions about formatting (eg tags separated out on lines).

The script is axeml.rb, as in split-with-an-axe :)

#!/usr/bin/ruby

def split(fname, element)
 f = File.new fname
 half = File.size(fname)/2
 opener = f.gets
 closer = ''
 closer = "</#{$1}>" if opener =~ /<([-_w]+)/

 c = opener.size
 base = File.basename(fname, '.xml')
 f1 = File.new(base + '.0.xml', "w+")
 f2 = File.new(base + '.1.xml', "w+")
 f1.puts opener
 line = ''
 while c < half
   line = f.gets
   c += line.size
   f1.puts line
 end
 if element
   fullelement = "</#{element}>"
   # continue to the end of the named element
   until line =~ /#{fullelement}/
     line = f.gets
     break if !line
     f1.puts line
   end
 end
 f1.puts closer
 f2.puts opener
 while f.gets
   f2.puts $_
 end
end

element = ARGV.shift
ARGV.each{|file| split file, element}