XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.
This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.
Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the parser is invoked and in the name of the callback methods. You use the callbacks in the very same way; they are only named differently.
Nokogiri
require 'nokogiri'
class Wikihandler < Nokogiri::XML::SAX::Document
def initialize
# do one-time setup here, called as part of Class.new
end
def start_element(name, attributes = [])
# check the element name here and create an active record object if appropriate
end
def characters(s)
# save the characters that appear here and possibly use them in the current tag object
end
def end_element(name)
# check the tag name and possibly use the characters you've collected
# and save your activerecord object now
end
end
parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('dump.xml')
libxml
require 'xml/libxml'
class Wikihandler2
include XML::SaxParser::Callbacks
def on_start_element_ns (name, attributes, prefix, uri, namespaces)
end
def on_end_element_ns (name, prefix, uri)
end
def on_characters(s)
end
end
parser = XML::SaxParser.file('dump.xml')
parser.callbacks = Wikihandler2.new
parser.parse
rexml
require 'rexml/document'
class Wikihandler3
def tag_start(name, attr_hash)
end
def tag_end(name)
end
def text(s)
end
end
require 'rexml/document'
REXML::Document.parse_stream(File.new('dump.xml'), Wikihandler3.new )
If you see an error like “parser error : Detected an entity reference loop” you either need to upgrade libxml2 to handle large files better or filter the data first. The wikipedia data dumps above can be converted into a form that is safe for the older libxml2 library with a command like:
bunzip2 <enwiki -latest-pages-articles.xml.bz2 | sed 's/\</\</g' | sed 's/\>/\>/g' | sed 's/\"/\"/g' | sed 's/\&/\&/g' >dump.xml
I have a script that I use to find issues inside a large xml document. I call this script with the element name to split around and a number of file names that are each to be split. This can help zero in on data problems in a huge dataset by bisecting the problem in each pass. To split the document above, use “axeml.rb page dump.xml”.
I intentionally avoid calling an actual xml parser since this is to help find syntax errors in a large file. It also makes assumptions about formatting (eg tags separated out on lines).
The script is axeml.rb, as in split-with-an-axe :)
#!/usr/bin/ruby
def split(fname, element)
f = File.new fname
half = File.size(fname)/2
opener = f.gets
closer = ''
closer = "</#{$1}>" if opener =~ /<([-_\w]+)/
c = opener.size
base = File.basename(fname, '.xml')
f1 = File.new(base + '.0.xml', "w+")
f2 = File.new(base + '.1.xml', "w+")
f1.puts opener
line = ''
while c < half
line = f.gets
c += line.size
f1.puts line
end
if element
fullelement = "</#{element}>"
# continue to the end of the named element
until line =~ /#{fullelement}/
line = f.gets
break if !line
f1.puts line
end
end
f1.puts closer
f2.puts opener
while f.gets
f2.puts $_
end
end
element = ARGV.shift
ARGV.each{|file| split file, element}


You might want to VTD-XML (http://vtd-xml.sf.net) for processing large or huge XML documents… it has two APis, the standard version processes XML up to 2GB in size, the extended version allows documents up to 256 GB in size… XPath 1.0 is built-in… it has tons other features
The problem I have with such parsers is they read forward-only and do not cache. Xponent developed a caching parser, allowing one to do most anything, in one pass. Is is currently in public open beta test.
[...] Processing large XML files (SAX example) [...]