processing large xml data files

Getting to Railsconf 2009-26XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.

This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.

Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the parser is invoked and in the name of the callback methods. You use the callbacks in the very same way; they are only named differently.


require 'nokogiri'

class Wikihandler  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of

  def start_element(name, attributes = [])
    # check the element name here and create an active record object if appropriate

  def characters(s)
    # save the characters that appear here and possibly use them in the current tag object

  def end_element(name)
    # check the tag name and possibly use the characters you've collected
    # and save your activerecord object now


parser =


require 'xml/libxml'

class Wikihandler2
  include XML::SaxParser::Callbacks
  def on_start_element_ns  (name, attributes, prefix, uri, namespaces)

  def on_end_element_ns  (name, prefix, uri)

  def on_characters(s)


parser = XML::SaxParser.file('dump.xml')
parser.callbacks =


require 'rexml/document'

class Wikihandler3
  def tag_start(name, attr_hash)

  def tag_end(name)

  def text(s)

require 'rexml/document'
REXML::Document.parse_stream('dump.xml'), )

If you see an error like “parser error : Detected an entity reference loop” you either need to upgrade libxml2 to handle large files better or filter the data first. The wikipedia data dumps above can be converted into a form that is safe for the older libxml2 library with a command like:

bunzip2 <enwiki -latest-pages-articles.xml.bz2 | sed 's/&lt;/&#60;/g' | sed 's/&gt;/&#62;/g' | sed 's/&quot;/&#34;/g' | sed 's/&amp;/&#38;/g' >dump.xml

I have a script that I use to find issues inside a large xml document. I call this script with the element name to split around and a number of file names that are each to be split. This can help zero in on data problems in a huge dataset by bisecting the problem in each pass. To split the document above, use “axeml.rb page dump.xml”.

I intentionally avoid calling an actual xml parser since this is to help find syntax errors in a large file. It also makes assumptions about formatting (eg tags separated out on lines).

The script is axeml.rb, as in split-with-an-axe :)


def split(fname, element)
 f = fname
 half = File.size(fname)/2
 opener = f.gets
 closer = ''
 closer = "</#{$1}>" if opener =~ /<([-_w]+)/

 c = opener.size
 base = File.basename(fname, '.xml')
 f1 = + '.0.xml', "w+")
 f2 = + '.1.xml', "w+")
 f1.puts opener
 line = ''
 while c < half
   line = f.gets
   c += line.size
   f1.puts line
 if element
   fullelement = "</#{element}>"
   # continue to the end of the named element
   until line =~ /#{fullelement}/
     line = f.gets
     break if !line
     f1.puts line
 f1.puts closer
 f2.puts opener
 while f.gets
   f2.puts $_

element = ARGV.shift
ARGV.each{|file| split file, element}