Merge Files – The Daily WTF

XML is probably one too specific Language. Every aspect of XML has a standard for interacting with it, transforming it, or manipulating it, and that standard is also defined in XML. Every specification related to XML fits into a soup that does all things and solves every problem you might have.

Although To owe had a problem that didn’t Enough map to XML specification(s). Specifically, it was supposed to parse absolutely broken XML files.

bool Sorter::Work()
{
	if(this->__disposed)
		throw gcnew ObjectDisposedException("Object has been disposed");
	
	if(this->_splitFiles)
	{
		List<Document^>^ docs = gcnew List<Document^>();
		for each(FileInfo ^file in this->_sourceDir->GetFiles("*.xml"))
		FLD].*$");
			Regex ^isMainLevelRec = gcnew Regex("^\\s+\\<REC NAME=\\\""+this->_mainLevel+"\\\".*$");
			while(!reader->EndOfStream)
			
				String ^line = reader->ReadLine();
				if(!isRecOrFld->IsMatch(line) && !isEndOfRecOrFld->IsMatch(line))
					continue;
				if(isMainLevelRec->IsMatch(line) && !String::IsNullOrEmpty(sb->ToString()) && !first)
				
					sb->AppendLine("</FILE>");
					XElement^ xml = XElement::Parse(sb->ToString());
					String ^key = String::Empty;
					for each(XElement ^rec in xml->Elements("REC"))
					
						key = this->findKey(rec);
						if(!String::IsNullOrEmpty(key))
							break;
					
					docs->Add(gcnew Document(key, gcnew XElement("container", xml)));
					sb = gcnew StringBuilder("<FILE NAME=\"blah\">");
					first = true;
					added = true;
				
				sb->AppendLine(line);
				if(first && !added)
												first = false;
				if(added)
												added = false;
			
			delete reader;
			file->Delete();
		
		int i = 1;
		for each(Document ^doc in docs)
		
										XElement ^splitted = doc->GetData()->Element("FILE");
										splitted->Save(Path::Combine(this->_sourceDir->FullName, this->_docPrefix + "_" + i++ + ".xml"));
										delete splitted;
		
		delete docs;
	}
	List<Document^>^ docs = gcnew List<Document^>();
	for each(FileInfo ^file in this->_sourceDir->GetFiles(String::Format("0*.xml", this->_docPrefix)))
	
		XElement ^xml = XElement::Load(file->FullName);
		String ^key = findKey(xml->Element("REC")); // will always be first element in document order
		Document ^doc = gcnew Document(key, gcnew XElement("data", xml));
		docs->Add(doc);
		file->Delete();
	
	List<Document^>^ sorted = MergeSorter::MergeSort(docs);
	XElement ^sortedMergedXml = gcnew XElement("FILE", gcnew XAttribute("NAME", "MergedStuff"));
	for each(Document ^doc in sorted)
	
		sortedMergedXml->Add(doc->GetData()->Element("FILE")->Elements("REC"));
	
	sortedMergedXml->Save(Path::Combine(this->_sourceDir->FullName, String::Format("0_mergedAndSorted.xml", this->_docPrefix)));
	// returning a sane value
	return true;
}

This is in the .NET dialect of C++, so weird ^ sigil is a handle to a collapsed object.

It’s there many happening here. The purpose of this function is to eventually split some previously merged XML files into separate XML files, then take a set of XML files and merge them back together (correctly sorted).

So we start by asserting that this object hasn’t been thrown and throwing an exception if it has. Then we try to separate.

To do this we look in the “*.xml” directory and then… load the file and then save it? The belief the thing about this code is that it corrects the spaces, because we need some spaces later – but the .NET XML writer doesn’t add spaces, it just keeps them, so I suspect this line is not needed – or at least it shouldn’t be. I can imagine a world where this somehow makes the code work for reasons best not to think about.

Owe writes to the previous developers: “Thanks guys, I really appreciate this!”

Now, since we’re iterating through the entire directory of XML files, some files were previously merged (and need to be unlinked) and others were not merged at all. How do we distinguish them? We find each element named “REC” and check that its “NAME” attribute is equivalent to ours _mainLevel value. If there are at least two such elements, we know that this file was previously merged and therefore needs to be unmerged.

Owe writes: “Thanks guys, I really appreciate this!”

And then we get into the dreaded regular expression parsing of XML. This is done because XML files are not actually valid XML. So it’s a mix of string operations and regular expression matching to try to interpret the data. And remember that space we thought we needed when we were printing documents? So here’s why: our regular expressions match on whitespace.

Owe writes: “Thanks guys, I really appreciate this!”

Once we’ve constructed all the documents in memory, we can dump them into a new set of files. And then, once that’s done, we can open those files again, because now the merge is happening. Here we find all “REC” elements and build new XML documents based on them. Then a MergeSorter::MergeSort the function actually does the merge – and frankly, I dread to think what that looks like.

The merge sorter sorts the documents, but we actually want to output a single document with the elements in that sorted order, so we create a final XML document, iterate through all of our sorted document fragments, and then insert “REC” elements into the output .

Owe writes: “Thanks guys, I really appreciate this!”

Even though the code and the whole process here are terrible, the core of the WTF is “we need to store our XML with elements arranged in a certain order”. XML is not for that. But apparently, they don’t know what XML is for, since they do things in their documents that an XML parser can’t successfully parse. Or, perhaps more accurately, they couldn’t understand out how to parse as XML, hence regular expressions and string deletion.

If the docs were sensible, the whole thing could probably be solved with some fairly simple (by XML standards) XQuery/XSLT operations. Instead, we have this. Thanks guys, I really appreciate this.