On re-evaluating old hammers. XML::Simple is crazy slow.

Everyone uses XML. Whether you want to or not, whether you think it’s a good thing or a bad thing. The fact of the matter is that somehow, somewhere, you are subjected to XML. You are subjected to it as a configuration file format, or you are subjected to it for data interchange, or lord knows, something worse. Maybe you’re doing something insane like updating an XML document in real time to treat it like a database. The fact of the matter is that XML is everywhere, and in 2012, we take it for granted.

At my day job, we deal with lots and lots of transactions, and lots and lots of data. Most of this data is binary blobs (PDFs), but often times this data comes pre-packaged with some sort of “configuration file.” Something that we take for granted. After all, it’s 2012. Most of our real hardware is running on SSDs, how expensive can it really be to parse a little bit of XML?

So, this blog post isn’t really about what we do every day at my job. At my job, we discovered at our old trusty XML “parsing” module, XML::Simple was slow. That probably doesn’t come as a surprise to any of you seasoned perl developers out there. But what surprised me, was how slow it was. XML::Simple is a pretty useful little module, it allows you to take an XML document and turn it into a perl data structure with very minimal effort. However, this comes at a larger than expected cost. I went out to the XMark project website and grabbed their “ready made document” monster 116 megabyte XML file. I wrote a simple perl script to parse this file:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use XML::Simple;
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.xml";
my $start_time = Time::HiRes::time;
my $data_in = XMLin($infile);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";

view raw

gistfile1.pl

hosted with ❤ by GitHub

and ran it on my nice relatively modern computer with an SSD. And I waited, and I waited. I’m not a particularly patient person when it comes to these things, so after about 2 minutes of waiting, I hit Ctrl-C. This test was no good, I decided, clearly the problem was that I was taking much too long in simply *reading* the file off the disk (even though, again, I have an SSD. It was late, what can I say.) So I wrote the following bit of code using File::Slurp:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use XML::Simple;
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.xml";
my $start_time = Time::HiRes::time;
my $data = read_file $infile;
print STDERR "Slurp: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $data_in = XMLin($data);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";

view raw

gistfile1.pl

hosted with ❤ by GitHub

And ran it. I was certain I was going to find that reading this file was taking longer than I expected only I was immediately greeted with:

{11:23:40} (Eduardos-Mac-Pro) ☹ [ 1 ]
<~/play/xml_vs_json> $ perl scripts/xmlsimple.pl 
Slurp: 0.345096111297607

and a whole bunch of waiting. So, it was only taking 1/3 of a second to read 100 megabytes off of disk. I’d have to congratulate Uri on a fast little module. And so I let my mind wonder and thought, well clearly there has to be a faster way to parse XML from within perl. This led me to a perlmonks thread titled “Fastest XML Parser” which seemed promising. The collective wisdom of the monks agreed that the fastest module was clearly XML::LIbXML. This made good sense as this module was perl bindings into the venerable libxml2 library. It didn’t do exactly what I wanted, instead of turning an XML document into a perl data structure, it gave me DOM to play with, but maybe that was the best I could expect. Eventually, the XML::Simple solution finished, by the way:

{10:56:26} (Eduardos-Mac-Pro) ☺
<~/play/xml_vs_json> $ perl scripts/xmlin.pl 
Slurp: 0.346162080764771
Parse: 124.664380073547

124 seconds. Nearly two minutes. That was clearly not going to fly. So, I went off and created a solution with XML::LibXML.  The code was, basically, identical for testing purposes:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use XML::LibXML;
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.xml";
my $start_time = Time::HiRes::time;
my $data = read_file $infile;
print STDERR "Slurp: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $dom = XML::LibXML->load_xml(string => $data);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";

view raw

gistfile1.pl

hosted with ❤ by GitHub

Again, it didn’t do exactly what I wanted, but sometimes it takes a tough man to cook a tender chicken. The runtimes were considerably nicer too:

{11:31:40} (Eduardos-Mac-Pro) ☹ [ 1 ]
<~/play/xml_vs_json/scripts> $ perl xml-libxml.pl 
Slurp: 0.37142014503479
Parse: 3.30488514900208

So I want that to sink in. Switching XML parsing modules, from XML::Simple to XML::LibXML gave me a 3700% performance increase. From 124 seconds to 3.3 seconds to parse. This was clearly valuable, but it wasn’t a fair comparison. After a little bit of digging I discovered that, as it is so often the case in perl, I was not the first person to want a faster version of XML::Simple. Some kind soul had invested their time and effort and had provided CPAN with XML::LibXML::Simple. A “re-implementation” of XML::Simple using libxml2. It doesn’t have all the features of XML::Simple, for example, it doesn’t provide an XMLout, just an XMLin, but for the problem at hand, it may have been the exact thing I wanted. So, I wrote the following snippet:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use XML::LibXML::Simple qw(XMLin);
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.xml";
my $start_time = Time::HiRes::time;
my $data = read_file $infile;
print STDERR "Slurp: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $data_in = XMLin($data);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";

view raw

gistfile1.pl

hosted with ❤ by GitHub

which is basically the first slurp based gist with the XML parsing modules swapped out, ran it, and hoped for the best. The best, however was not particularly great. File::Slurp was still as fast as ever, I had greatly underestimated the complexity of taking a DOM document and turning it into a perl structure however:

{11:35:46} (Eduardos-Mac-Pro) ☺
<~/play/xml_vs_json/scripts> $ perl xml-libxml-simple.pl 
Slurp: 0.362099885940552
Parse: 60.4428260326385

How were we back here? Mind you, simply changing from XML::Simple to XML::LibXML::Simple had still doubled the performance, but unmarashalling the XML into perl objects had once again made it slow. Much slower. Nearly 2000% slower than simply retrieving the DOM. I began to get sad. Then I began to get a crazy idea… Why does it have to be XML at all? Sure, sometimes we get stuck with XML, but often times I just *chose* it because it is convenient and because I know that XML is a nice text format for me to marshall a data structure for IPC or persistance or god knows what. Was I always doomed to take this kind of hit coming in and out of perl hashes? So, I wrote the following bit of code:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use XML::Simple;
use JSON;
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.xml";
my $start_time = Time::HiRes::time;
my $data = read_file $infile;
print STDERR "Slurp: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $data_in = XMLin($data);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $encoded_json = encode_json $data_in;
print STDERR "JSON_Encode: ".(Time::HiRes::time – $start_time)."\n";
print $encoded_json;

view raw

gistfile1.pl

hosted with ❤ by GitHub

This simple little bit of code, simply takes the XML data and turns it into JSON data, and writes it out. Mind you, the irony of the fact that I reached for XML::Simple even though I had already learned it was slow is not lost on me. So entrenched was my thought when I wrote this code, that I had not yet integrated the fact that there was a faster solution. With this code, I took my 116 megabyte XML data, and turned it into a 99 megabyte JSON file. The contents are “similar.” I won’t call them identical, but for *my* purposes they are. And then I wrote some code to read this file, and unmarshall it into perl data structures:


#!/usr/bin/env perl
use warnings;
use strict;
use FindBin;
use File::Slurp;
use JSON;
use Time::HiRes;
my $infile = "$FindBin::Bin/../data/standard.json";
my $start_time = Time::HiRes::time;
my $data = read_file $infile;
print STDERR "Slurp: ".(Time::HiRes::time – $start_time)."\n";
$start_time = Time::HiRes::time;
my $data_in = decode_json($data);
print STDERR "Parse: ".(Time::HiRes::time – $start_time)."\n";

view raw

gistfile1.pl

hosted with ❤ by GitHub

Such a silly little change, right? From XMLin to decode_json. It was still doing the same basic work, reading a file off of disk into a scalar, and then turning that scalar into a deep nested perl data structure. The performance surprised me:

{11:44:39} (Eduardos-Mac-Pro) ☺
<~/play/xml_vs_json/scripts> $ perl json.pl 
Slurp: 0.315643787384033
Parse: 1.24264287948608

Preposterous. 2.5x faster than the XML::LibXML solution. Ridiculously faster. Preposterously faster. Annoyingly faster. All those clock cycles that I had wasted. All of those clock cycles in my code, when I was writing out a job file as an XML file, only to turn around in another process and XMLin it. XML had never failed me, and it had always seemed fast enough… but now, dealing with billions of transactions and petabytes of data… now it mattered.

I know JSON is not XML. I knew JSON was fast and lightweight and easy. I use JSON. I simply didn’t realize that for this use case, a use case I believe is probably quite common, the difference was so drastic. If you’re going to write out a job file, if you just need to marshall state, give JSON a try. In my use case, with zero added lines of code, it was 10,000% faster. That’s a lot less clock cycles on the cloud you have to pay for.

This entry was posted in tech and tagged , , , , , , . Bookmark the permalink.

4 Responses to On re-evaluating old hammers. XML::Simple is crazy slow.

  1. japhattack says:

    Not that I particularly like XML::SImple, but a fair amount of people don’t even bother to read the documentation. From XML::Simple::FAQ:

    Why is XML::Simple so slow?
    If you find that XML::Simple is very slow reading XML, the most likely
    reason is that you have XML::SAX installed but no additional SAX parser
    module. The XML::SAX distribution includes an XML parser written
    entirely in Perl. This is very portable but not very fast. For better
    performance install either XML::SAX::Expat or XML::LibXML.

    • earino says:

      Greetings! Thanks very much for your feedback. I do agree that most folks simply don’t read the FAQ, and I probably should have mentioned it, however in my case, both of those modules were already installed and the performance we saw was the performance we got. I did not however explicitly force a different XML parser using $XML::Simple::PREFERRED_PARSER, though that may very well be an interesting benchmark on it’s own. Again, thanks!

  2. How long did it take to s/xml/json/ for your test? If I understand correctly, your actual project can just output JSON directly … but for simple xml, maybe we can do a deliciously evil regex to convert to JSON and then parse. XML::Simple::EvilJSONHack here we come…

  3. As the author of XML::Simple, some thoughts …

    1. I’m not even slightly surprised that XML::Simple is slow. Any ‘design’ that went into the module was aimed at optimising for programmer convenience and not speed.

    2. I ran your program on my desktop machine with the linked file and it took 189 seconds when XML::Simple used the PurePerl parser and 34 seconds when it used the XML::SAX::ExpatXS parser. So either your machine is much slower than mine or your default setup is using the PurePerl parser.

    3. The main reason XML::LibXML is so fast is precisely because it doesn’t build Perl data structures – the libxml2 library uses C code to assemble C data structures into a DOM tree. Crossing the boundary from C space to Perl space (and back) is expensive – a SAX parser will cross that boundary several times *per element* whereas XML::LibXML only has to cross that boundary several times *per document* (during the parse phase).

    4. JSON parsers can be faster than XML parsers for one very simple reason – XML is *much* more complex to parse than JSON. I have worked with XML quite a lot and I have never grown to like it. JSON, on the other hand, I liked from day one.

    5. When I have to parse XML (or in fact HTML), I use XML::LibXML. It really is awesome and once you understand XPath, it’s easier than XML::Simple and much more consistent (ref: PerlMonks Node ID 490846 ). Make the switch – you won’t regret it 🙂

    Thanks for the write-up – anything that drives people to chose a better module (e.g. XML::LibXML) or a better data format (e.g.: JSON) has got to be good.

    PS: Apparently your hateful blog software assumes I’m a spammer if I include a link.