(First posted on the LibExpat Documentation Site)
I'm in the intriguing position of being employed to work on a free software project by the Linux Foundation. I'm expanding the test coverage of the Expat library under the eagle eye of maintainer Sebastian Pipping, which is an excellent way of discovering how the brilliant, twisted and almost entirely undocumented internals work. (There will be articles. Many articles.)
One thing you expect to come across when writing tests is the occasional bug. I came across a serious one in the parsing of external parameter entities. CVE-2017-9233 (to give it its formal identification) says that bad XML in an external entity will cause the parser to go into an infinite loop and never return control to the application.
For those of you who haven't been saturated in XML terminology for the last however long, an example is in order. Suppose you have the following trivial piece of XML:
<!DOCTYPE doc SYSTEM "http://example.com/level1.dtd"> <doc/>
And the DTD level1.dtd that it reads:
<!ELEMENT doc EMPTY> <!ENTITY % e SYSTEM "http://example.com/level2.ent"> %e;
And the external entity definition in yet another resource, level2.ent:
<!ELEMENT el EMPTY> <el/>
Now a quick riffle through the XML standards will tell you that ordinary elements such as <el> aren't allowed in DTDs. All you can legally use are the DTD declarations <!ELEMENT>, <!ATTLIST>, <!ENTITY> and <!NOTATION>. When our entity %e in level.dtd gets substituted in, it will put the <el/> element straight into the DTD, leaving us with a malformed DTD. The parser should detect this and reject the whole shebang.
What actually happened is hard to follow in the source code. (Many articles.) When parsing reached the 'e' of <el/>, the tokenizer recognised it as a valid start of an ordinary element (which is true) and returned the appropriate value, XML_TOK_INSTANCE_START.
The calling code knew how to recognise the limited number of tokens that are legal and most of the tokens that aren't legal in a DTD, but unfortunately it missed this specific case. Without any better instructions from its decision logic, the code assumed that it had found something valid and tried tokenising again, in case there was more text to be parsed.
It did this expecting that its internal pointers were updated to point to that unparsed text. That would have been true if it had dealt with something legal, but since "<e" isn't a legal part of a DTD the pointers had been left alone. The tokenizer therefore saw "<e" again, returned XML_TOK_INSTANCE_START again, and the whole thing repeated until the application was killed.
In brief:
- The lowest levels of the parser recognise "<e" as starting an element.
- The next level up correctly doesn't recognise it as a valid case
- ...but also fails to recognise it as an invalid case.
- The parser assumes it was successful and tries again on the same string.
Why Do I Care?
"How does this affect me?" you may well ask. "I don't use Expat. I don't even write in C. How can I possibly be affected?" The answer is, you may well be using Expat unknowingly. There are wrapper libraries for Python, Java and many other languages for Expat. Many other application and libraries (particularly C and C++ libraries like Poco, libDOM or libwww) use Expat under the hood.The libraries often enable external entity parsing, which does potentially allow this bug to bite your application. Some of them will, or used to, download the URIs for you, which is a serious problem. Some of the applications are just parsing local configuration files, but even those might follow a DTD if one was
inserted into the configuration file somehow. Other applications explicitly use external DTDs, leaving you vulnerable if those sites are compromised or malicious.
In short, it's entirely possible for you to be using Expat and not know it.
What Should I Do?
First and most obviously, upgrade. The current version of libexpat has patched this bug, and a few other things; read the change log for details.The other thing you should always do is consider how you use your XML parser. I am not a security expert by any stretch of the imagination, but even I know not to download arbitrary URIs, for example. Reading "https://www.w3c.org/xml/fluffy.ent" is probably safe; reading "http://evil.haxx0rs.org/xml/i-pwn-u.dtd" probably isn't. You should check that any library you use doesn't automatically download arbitrary URIs for you, and you should be careful about what URIs you do allow to be downloaded.