XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.
This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.
Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the parser is invoked and in the name of the callback methods. You use the callbacks in the very same way; they are only named differently.
Nokogiri
require 'nokogiri' class Wikihandler < Nokogiri::XML::SAX::Document def initialize # do one-time setup here, called as part of Class.new end def start_element(name, attributes = []) # check the element name here and create an active record object if appropriate end def characters(s) # save the characters that appear here and possibly use them in the current tag object end def end_element(name) # check the tag name and possibly use the characters you've collected # and save your activerecord object now end end parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new) parser.parse_file('dump.xml')
libxml
require 'xml/libxml' class Wikihandler2 include XML::SaxParser::Callbacks def on_start_element_ns (name, attributes, prefix, uri, namespaces) end def on_end_element_ns (name, prefix, uri) end def on_characters(s) end end parser = XML::SaxParser.file('dump.xml') parser.callbacks = Wikihandler2.new parser.parse
rexml
require 'rexml/document' class Wikihandler3 def tag_start(name, attr_hash) end def tag_end(name) end def text(s) end end require 'rexml/document' REXML::Document.parse_stream(File.new('dump.xml'), Wikihandler3.new )
If you see an error like “parser error : Detected an entity reference loop” you either need to upgrade libxml2 to handle large files better or filter the data first. The wikipedia data dumps above can be converted into a form that is safe for the older libxml2 library with a command like:
bunzip2 <enwiki -latest-pages-articles.xml.bz2 | sed 's/</</g' | sed 's/>/>/g' | sed 's/"/"/g' | sed 's/&/&/g' >dump.xml
I have a script that I use to find issues inside a large xml document. I call this script with the element name to split around and a number of file names that are each to be split. This can help zero in on data problems in a huge dataset by bisecting the problem in each pass. To split the document above, use “axeml.rb page dump.xml”.
I intentionally avoid calling an actual xml parser since this is to help find syntax errors in a large file. It also makes assumptions about formatting (eg tags separated out on lines).
The script is axeml.rb, as in split-with-an-axe :)
#!/usr/bin/ruby def split(fname, element) f = File.new fname half = File.size(fname)/2 opener = f.gets closer = '' closer = "</#{$1}>" if opener =~ /<([-_w]+)/ c = opener.size base = File.basename(fname, '.xml') f1 = File.new(base + '.0.xml', "w+") f2 = File.new(base + '.1.xml', "w+") f1.puts opener line = '' while c < half line = f.gets c += line.size f1.puts line end if element fullelement = "</#{element}>" # continue to the end of the named element until line =~ /#{fullelement}/ line = f.gets break if !line f1.puts line end end f1.puts closer f2.puts opener while f.gets f2.puts $_ end end element = ARGV.shift ARGV.each{|file| split file, element}