XML is often used to convey very large data sets. A perfect example is wikipedia’s data. The full wikipedia datasets are available in several different slices.
This dataset is about 24GB uncompressed for all articles in English. This is an excellent set to push the limits on your parser. Clearly, you can’t load a file like this into memory to process it. To handle a large file like this, you need to use sax mode. In this mode, methods in your code are called in the process of scanning through the xml document. The parser does not retain any state as it goes, so this mode scales up to whatever dataset size you need.
Nokogiri, rexml, and direct libxml interfaces in ruby all expose a sax mode for their interfaces. These differ in the way the


Recent Comments