Cleaning HTML Snippets in .NET with HtmlAgilityPack

Sometimes you may have to sanitize a piece of HTML before storing or displaying it.

Say you have the following snippet and you want to

  • Remove the itemscope and itemtype attributes from the first div
  • Drop completely the meta tag
<div itemscope itemtype="http://schema.org/Product">
  <meta itemprop="name" content="something">
  <p> Some stuff </p>
</div>

You might be tempted to reach for Regex.Replace, but that would be a terrible idea

Instead use HtmlAgilityPack, which provides the ability to load, examine, and modify HTML documents and snippets.

private void Sanitize(string s)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);

    var div = doc.DocumentNode
        .Descendants("div")
        .FirstOrDefault(n => n.GetAttributeValue("itemtype", null) == "http://schema.org/Product");

    if (div != null)
    {
        div["itemtype"].Remove();
        div["itemscope"].Remove();
    }

    var meta = doc.DocumentNode
        .Descendants("meta")
	.FirstOrDefault(n => n.GetAttributeValue("itemprop", null) == "name");

    if (meta != null)
        meta.Remove();

    return doc.DocumentNode.WriteContentTo();
}

This example demonstrates how to remove attributes and nodes from an HTML snippet.