XML indentation in .Net

Author Tim | 22.02.2010 | Category Uncategorized

Upon integration of ) to manage WeSay’s XML encoded .lift file, the exact format of that file has taken on a new importance. Because Mercurial (and Chorus at this point) uses a standard line diffing tool to express the difference between two revisions line breaks, indentations and other white space have suddenly become an issue where they normally are not in XML documents. As it turns out formatting XML in .net is not entirely trivial. Though the XMLWriter and XMLReader as well as their respective XmlWriterSettings and XmlReaderSettings have various switches for enabling and disabling indentation and linebreaking on attributes, these are bound together by subtle interactions which I hope to shed some light on in this post.

First some background:
The indentation and whitespace in a given XML file can be of interest for at  least two reasons:
- line differs typically care about whitespace and the name itself bears witness to the importance of newlines.
- readability. It’s much easier for a humans to read a nicely formatted XML file.

In WeSay the .lift file is frequently created from two seperate files. First, a valid .lift file and secondly a .lift fragment file. Each time an entry is added or modified these two files are merged to form the new .lift file. For this reason we are interested in the interaction between an XmlReader and the XmlWriter that outputs said readers data.
 
As an example we will use some very simple XML rather than an actual lift file as that construct is unnecassarily complex for this discussion.
Here are the two source files we will be working with:

File 1:
<one>
    at1=”at1″
    at2=”at2″>
    <two>
      <three>
        at1=”3at1″
        at2=”3at2″ />
    </two>
 </one>
 
File 2:
 <four>
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
 </four>
 
 Here is our envisioned result:
 
Resulting File:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two>
      <three
        at1=”3at1″
        at2=”3at2″ />
    </two>
  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

All of these files have been written with indentation and new lines for
each element and attribute. This makes for ok readability and should
keep our diff files nice and small.

The first thing we are going to do is to see what happens when we start with completely unformatted input files, default readers and a default writer:

File 1:
<one at1=”"at1″” at2=”"at2″”><two><three at1=”"3at1″” at2=”"3at2″” /></two></one>

File 2:
<four at1=”"at1″” at2=”"at2″”><five><six at1=”"3at1″” at2=”"3at2″” /></five></four>

And the code goes something like this:
XmlReaderSettings readerSettings = new XmlReaderSettings
                                                   {
                                                       ConformanceLevel = ConformanceLevel.Fragment
                                                   };
                                                  
XmlWriterSettings writerSettings = new XmlWriterSettings
                                                   {
                                                       ConformanceLevel = ConformanceLevel.Document
                                                   };

XmlReader reader0 = XmlReader.Create(stream0, readerSettings);
XmlReader reader1 = XmlReader.Create(stream1, readerSettings);
XmlWriter writer = XmlWriter.Create(stream2, writerSettings);

while (!reader0.EOF)
{
    writer.WriteNode(reader, true);
}
while (!reader1.EOF)
{
    writer.WriteNode(reader, true);
}

With these settings the resulting file looks like this:

<?xml version=”1.0″ encoding=”utf-8″?><document><one at1=”at1″ at2=”at2″><two><three at1=”3at1″ at2=”3at2″ /></two></one><four at1=”at1″ at2=”at2″><five><six at1=”3at1″ at2=”3at2″ /></five></four></document>

Just one long line… pretty much the worst case possible for a line diffing tool and for reading. So let’s spruce it up a bit and add some formatting to the
result file by changing the WriterSettings a bit:

XmlWriterSettings writerSettings = new XmlWriterSettings
                                                   {
                                                        Indent = Indent,
                                                        NewLineOnAttributes = true,
                                                        ConformanceLevel = ConformanceLevel.Document
                                                   };

Resulting file:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two>
      <three
        at1=”3at1″
        at2=”3at2″ />
    </two>
  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

*sigh*… beautiful.
But being geeks we just can’t halp but fix something that ain’t broke. So
inspite of this beautiful result we now want to try and fix up the two
source files. This isn’t entirely unreasonable considering you may want
to look at the source files while debugging and it would be nice if they
were a bit more legible. So just for kicks, let’s see what happens when
we put a single line break in a source file.. say after the
element.

File 1:
<one at1=”"at1″” at2=”"at2″”><two>
<three
at1=”"3at1″” at2=”"3at2″”></three></two></one>

Resulting File:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two>
<three
        at1=”3at1″
        at2=”3at2″ /></two>

  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

?!!?
what happened?! Not only do we have a line break after the
element, but also the element and the closing
element are not indented!!
This brings us to our first
interesting observation: Whitespace in a source document
causes the writer to ignore it’s Indent Attribute until the containing
element of the whitespace (in our case ) is
closed. this is true of whitespace such as “spaces” as
well. Here is the resulting file if I substitute the newline of our
last example with a simple space:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two> <three
        at1=”3at1″
        at2=”3at2″></three></two>

  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

Interestingly, you’ll notice that the NewLineOnAttribute Property of the XmlWriterSettings is NOT ignored. This is even more interesting when you consider that this property is ignored UNLESS the Indent Property is TRUE. here it is straight from the horses mouth (i.e. ): This setting has no effect when the Indent property value is false.

Ok..
so we’ve established that whitespace is an issue. The easiest way to
get around this is to instruct the reader to ignore whitespace so that
the writer doesn’t get to clever on us:

XmlReaderSettings readerSettings = new XmlReaderSettings
                                                   {
                                                        IgnoreWhitespace = true,
                                                       ConformanceLevel = ConformanceLevel.Fragment
                                                   };

So now we are back on track and looking good! To celebrate, let’s tell the
world how happy we are! Let’s write a string into our first file that
will proclaim our joy! Of course we will do this without spaces.. just
in case.

File 1:
<one at1=”"at1″” at2=”"at2″”><two>I’mSoHappy<three at1=”"3at1″” at2=”"3at2″”/></two></one>

Resulting file:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two>I’mSoHappy<three
        at1=”3at1″
        at2=”3at2″ /></two>

  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

Arrrgh!
It did it again!!! So here is observation number two:

Finally, WeSay uses an XPathNavigator in some places and in the course of my testing I noticed that XmlWriter.WriteNode() behaves slightly different when it is passed an XPathNavigator rather than an XmlReader. Specifically, it seems to always ignore whitespace. So passing an XmlReader (with IgnoreWhitespace = false) to WriteNode for the first file and an XPathDocument for the second file where the files look like this:

File 1:
<one at1=”at1″ at2=”at2″><two>
<three at1=”3at1″ at2=”3at2″/></two></one>

File 2:
<four at1=”at1″ at2=”at2″><five>
<six
at1=”3at1″ at2=”3at2″ /></five></four>

Results in a result file looking like this:
<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one
    at1=”at1″
    at2=”at2″>
    <two>
<three
        at1=”3at1″
        at2=”3at2″ /></two>
  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six

        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

Here’s an outline of the code:

XmlReaderSettings readerSettings = new XmlReaderSettings
                                                   {
                                                       ConformanceLevel = ConformanceLevel.Fragment
                                                   };
                                                  
XmlWriterSettings writerSettings = new XmlWriterSettings
                                                   {
                                                       ConformanceLevel = ConformanceLevel.Document
                                                   };

XmlReader reader = XmlReader.Create(stream0, readerSettings);
XmlReader reader2 = XmlReader.Create(stream0, readerSettings);
XmlDocument document = new XmlDocument();
            document.Load(reader2);
XmlWriter writer = XmlWriter.Create(stream1, writerSettings);

while (!reader.EOF)
{
    writer.WriteNode(reader, true);
}
writer.WriteNode(document.CreateNavigator(), true);

Note that this is the case even when you create an XmlDocument from an XmlReader with IgnoreWhistespace = false.

So that about wraps it up. This was not meant to be an exhaustive study of
all the Xml- Reader/Writer/Document/WrietSettings/ReaderSettings/XPathNavigator interactions so if you find anything else unusual or that I grossly misrepresented something please feel free to let me know!

Addendum:
After testing mono’s response to all this upon Cambell’s request I discovered an idiosyncrasy  in .net . It seems that attributes with a preceding namespace are not indented so:

File 1:
“<one xmlns:at1=”at1″” at2=”at2″”><two><three at1=”3at1″ at2=”3at2″/></two></one>”

Results in:

<?xml version=”1.0″ encoding=”utf-8″?>
<document>
  <one xmlns:at1=”at1″
    at2=”at2″>

    <two>
      <three
        at1=”3at1″
        at2=”3at2″ />
    </two>
  </one>
  <four
    at1=”at1″
    at2=”at2″>
    <five>
      <six
        at1=”3at1″
        at2=”3at2″ />
    </five>
  </four>
</document>

Unfortunately (from a cross platform consistency standpoint), Mono does not exhibit this behavior and (correctly?) inserts a newline before that attribute.