XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.
This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.
Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the parser is invoked and in the name of the callback methods. You use the callbacks in the very same way; they are only named differently.
Nokogiri
require 'nokogiri'
class Wikihandler < Nokogiri::XML::SAX::Document
def initialize
# do one-time setup here, called as part of Class.new
end
def start_element(name, attributes = [])
# check the element name here and create an active record object if appropriate
end
def characters(s)
# save the characters that appear here and possibly use them in the current tag object
end
def end_element(name)
# check the tag name and possibly use the characters you've collected
# and save your activerecord object now
end
end
parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('dump.xml')
libxml
require 'xml/libxml'
class Wikihandler2
include XML::SaxParser::Callbacks
def on_start_element_ns (name, attributes, prefix, uri, namespaces)
end
def on_end_element_ns (name, prefix, uri)
end
def on_characters(s)
end
end
parser = XML::SaxParser.file('dump.xml')
parser.callbacks = Wikihandler2.new
parser.parse
rexml
require 'rexml/document'
class Wikihandler3
def tag_start(name, attr_hash)
end
def tag_end(name)
end
def text(s)
end
end
require 'rexml/document'
REXML::Document.parse_stream(File.new('dump.xml'), Wikihandler3.new )
If you see an error like “parser error : Detected an entity reference loop” you either need to upgrade libxml2 to handle large files better or filter the data first. The wikipedia data dumps above can be converted into a form that is safe for the older libxml2 library with a command like:
bunzip2 <enwiki -latest-pages-articles.xml.bz2 | sed 's/</</g' | sed 's/>/>/g' | sed 's/"/"/g' | sed 's/&/&/g' >dump.xml
I have a script that I use to find issues inside a large xml document. I call this script with the element name to split around and a number of file names that are each to be split. This can help zero in on data problems in a huge dataset by bisecting the problem in each pass. To split the document above, use “axeml.rb page dump.xml”.
I intentionally avoid calling an actual xml parser since this is to help find syntax errors in a large file. It also makes assumptions about formatting (eg tags separated out on lines).
The script is axeml.rb, as in split-with-an-axe :)
#!/usr/bin/ruby
def split(fname, element)
f = File.new fname
half = File.size(fname)/2
opener = f.gets
closer = ''
closer = "</#{$1}>" if opener =~ /<([-_w]+)/
c = opener.size
base = File.basename(fname, '.xml')
f1 = File.new(base + '.0.xml', "w+")
f2 = File.new(base + '.1.xml', "w+")
f1.puts opener
line = ''
while c < half
line = f.gets
c += line.size
f1.puts line
end
if element
fullelement = "</#{element}>"
# continue to the end of the named element
until line =~ /#{fullelement}/
line = f.gets
break if !line
f1.puts line
end
end
f1.puts closer
f2.puts opener
while f.gets
f2.puts $_
end
end
element = ARGV.shift
ARGV.each{|file| split file, element}


You might want to VTD-XML (http://vtd-xml.sf.net) for processing large or huge XML documents… it has two APis, the standard version processes XML up to 2GB in size, the extended version allows documents up to 256 GB in size… XPath 1.0 is built-in… it has tons other features
The problem I have with such parsers is they read forward-only and do not cache. Xponent developed a caching parser, allowing one to do most anything, in one pass. Is is currently in public open beta test.
[...] Processing large XML files (SAX example) [...]
[...] Processing large XML files (SAX example) [...]
Hi Brad,
I came across your site when searching for XML and large datasets.
I agree with you that SAX makes a lot of sense, however also StAX may be considered.
I have worked in many projects as a software engineer (mostly 90′s) and architect (mostly last 10 years). One of the projects that I came across was processing extremely large XML datasets (containing 1,000,000 + financial transactions) and entered a team where development had been outsourced. The guys had come up with a ‘solution’ using JaXB. Of course that could not work. At least that is if you were to process all data into memory and work from there (which was exactly what they had been doing). I estimated the max dataset to be around 120,000 depending on available memory. In practice it was much worse of course (only about 30,000 max). So, I recommended a different solution based on SAX or StAX (actually I had done that already). As a spin off of that project I decided to further develop that solution into a MDE based approach. Based on the specification (XSD) the code generator would generate the XML parser, which could send events (logical) to the listening processor. This is what I did and it worked very well. You can find the information here: http://dijkstra-ict.nl/documents/XMLParserTechnologyForProcessingHugeXMLfiles.pdf.
If this is of interest to anyone here, drop me a note.
: lolke.dijkstra@dijkstra-ict.com
Kind regards,
Lolke Dijkstra