Cleaning HTML Snippets in .NET with HtmlAgilityPack
Sometimes you may have to sanitize a piece of HTML before storing or displaying it.
Say you have the following snippet and you want to
- Remove the
itemscope
anditemtype
attributes from the firstdiv
- Drop completely the
meta
tag
<div itemscope itemtype="http://schema.org/Product">
<meta itemprop="name" content="something">
<p> Some stuff </p>
</div>
You might be tempted to reach for Regex.Replace
, but that would be
a terrible idea
Instead use HtmlAgilityPack, which provides the ability to load, examine, and modify HTML documents and snippets.
private void Sanitize(string s)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(s);
var div = doc.DocumentNode
.Descendants("div")
.FirstOrDefault(n => n.GetAttributeValue("itemtype", null) == "http://schema.org/Product");
if (div != null)
{
div["itemtype"].Remove();
div["itemscope"].Remove();
}
var meta = doc.DocumentNode
.Descendants("meta")
.FirstOrDefault(n => n.GetAttributeValue("itemprop", null) == "name");
if (meta != null)
meta.Remove();
return doc.DocumentNode.WriteContentTo();
}
This example demonstrates how to remove attributes and nodes from an HTML snippet.