Skip to contentSkip to author details

Solved: Preventing Timeout error using XmlDocument.Load and SgmlReader

 programming  C#

I was working on a project that used the SgmlReader to clean up an invalid web page as I scraped data from it. I was using the System.Net.WebRequest class and related functionality.

However, because the page that I was calling used an xhtml DTD, yet included malformed xhtml in the document, it caused a timeout error when trying to load the document into an XmlDocument using the SgmlReader.

Here is my code (that caused a timeout error on line 21):

var req = WebRequest.Create("https://SomeWebsiteWithXhtmlDTDAndMalformedXhtml.com");

using (var res = req.GetResponse())  
{
  using (Stream responseStream = res.GetResponseStream())
  {
    using (StreamReader reader = new StreamReader(responseStream))
    {
      using (SgmlReader sgmlReader = new SgmlReader())
      {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = reader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.NameTable.Add("http://www.w3.org/1999/xhtml");
        doc.PreserveWhitespace = true;
        doc.XmlResolver = null;
        doc.Load(sgmlReader);
        return doc;
      }
    }
  }
}

Because of the malformed document (many of the tags were missing their closing tag - or specifically, one of the meta tags was missing it's ending slash. For example, rather than this:

<meta name="viewport" content="width=device-width, initial-scale=1.0" />

it used

<meta name="viewport" content="width=device-width, initial-scale=1.0">

Notice that the last slash is missing, making this an invalid XHTML element. Even though SgmlReader is supposed to "fix" these types of issues, it seems that the DTD was causing the issue.

The fix involved breaking out the call to the web page from the WebRequest object to using the System.Net.WebClient class to grab the page as a string and read it in that way.

Here is the replacement code.

using (WebClient client = new WebClient())  
{
  string fipsDoc = client.DownloadString("https://SomeWebsiteWithXhtmlDTDAndMalformedXhtml.com");
  fipsDoc = fipsDoc.Substring(fipsDoc.IndexOf(Environment.NewLine) + Environment.NewLine.Length);

  StringReader reader = new StringReader(fipsDoc);

  using (SgmlReader sgmlReader = new SgmlReader())
  {
    sgmlReader.DocType = "HTML";
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;

    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = false;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
  }
}

The fix is in line #4 - I simply removed the first line of code from the file. That's where the DTD was being defined. I don't know for sure why it was causing a timeout, but I'm guessing it has something to do with the fact that SgmlReader was making assumptions about the content.

I'm a bit unhappy that I had to read it in as a string, manipulate it, then pass it through a StringReader, into the SgmlReader, and finally into the document, only then to get processed by my business logic. However, this solution works for my specific use case, so I'm not going to sweat it.