Cleaning HTML Snippets in .NET with HtmlAgilityPack

Sometimes you may have to sanitize a piece of HTML before storing or displaying it.

Say you have the following snippet and you want to

  • Remove the itemscope and itemtype attributes from the first div
  • Drop completely the meta tag
<div itemscope itemtype="http://schema.org/Product">
  <meta itemprop="name" content="something">
  <p> Some stuff </p>
</div>

You might be tempted to reach for Regex.Replace, but that would be a terrible idea

Instead use HtmlAgilityPack, which provides the ability to load, examine, and modify HTML documents and snippets.

private void Sanitize(string s)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(s);

    var div = doc.DocumentNode
        .Descendants("div")
        .FirstOrDefault(n => n.GetAttributeValue("itemtype", null) == "http://schema.org/Product");

    if (div != null)
    {
        div["itemtype"].Remove();
        div["itemscope"].Remove();
    }

    var meta = doc.DocumentNode
        .Descendants("meta")
	.FirstOrDefault(n => n.GetAttributeValue("itemprop", null) == "name");

    if (meta != null)
        meta.Remove();

    return doc.DocumentNode.WriteContentTo();
}

This example demonstrates how to remove attributes and nodes from an HTML snippet.

On the iOS AppStore and Windows Spyware

Earlier I tweeted that I thought there was a connection between the iOS AppStore and the plethora of spyware, adware, and other garbage readily distributed through popular software aggregation sites like download.com.

I had just read the HowToGeek’s excellent expose on the effects of installing the most popular programs on download.com. There was a time (long ago) when I would recommend download.com as a safe place to get useful software like ccleaner. But the site has been steadily in decline: in addition to noxious and tricky advertisements, most of the programs, even my beloved ccleaner, have started bundling garbage as part of the install process.

There are two things about Apple’s ecosystem that I think train users that the sites like download.com are safe and efficient sources of software.

3 Things I Learned This Week 2015-01-11

Schema.org itemref

We’re adding more schema.org markup to a site for work. The schema.org standard defines a lot of extra tags you can put on your HTML to tell crawlers more specific details about what the page is displaying.

We’re using a few different schemas (Recipe, Product, Author, etc) across the page and sometimes you need to communicate to the crawler that two sections of the page are actually speaking about the same item.

Calling one task from another in Rake

I added a rake task to create another one of these three things posts. Since I’m using Octopress, theres already a rake task to generate the shell of a new post: rake new_post["Post Title Here"] which I would like to reuse rather than duplicating. Keeping things DRY, you know?

You can use rake’s task command to get a reference to another task and then invoke it, passing along whatever parameters you want:

Rakefile
desc "Create a new three things post" task :three_things do |t, args| today = Date.today.to_s task(:new_post).invoke("3 Things I Learned This Week #{today}") end

Spell check in Vim

I do most of my non-c# coding and writing in vim (including this blog) and recently turned on Vim’s spell checking support. It works pretty well. With a good terminal emulator you can even get right click support on misspelled words.

But without right click support, these commands are working for me:

  • ]s and [s - Move to the next or previous misspelled word
  • z= show a list of spelling suggestions
  • zg - marks a word as “good”, i.e. spelled correctly

3 Things I Learned This Week (2015-01-02)

Happy New Year!

Top Posts of 2014

Lots of people are recapping the most visited pages of 2014, so here’s mine.

Using SASS With Visual Studio 2013 Essentials - Apparently its not altogether obvious how to get SASS compilation with Visual Studio. Lots of people came here from StackOverflow and google looking for answers.

Saving Changes with Entity Framework 6 is ASP.NET MVC 5 I was following a tutorial and got briefly stuck on why an update action wasn’t saving the changes back to the DB. Elaborating on the Unit of Work pattern provides clarity.

Multiple Angular Versions On The Same Page - At work we had an app that was using angular but also included a plugin that itself was using angular, although a different version. I had to isolate my angular application from the plugin. Thought not perfect, my solution did solve my immediate issue, and a few others have found it helpful as well.

Export a Global to the Window Object with Browserify - Browserify wraps all your code up in closures, but what happens when you need to export some objects for other scripts to access?

app.UseGoogleAuthentication() Does Not Accept 2 Arguments Azure Tutorial - When following an older tutorial on identity management for azure, the sample code included a line that no longer compiles. Took me awhile to figure out what the new arguments were.

Meeting Minutes are Actually Useful

I’ve started taking better meeting notes and emailing them to participants and those that couldn’t make the meeting. Keeping them in Google Docs makes them easily searchable and shareable. Its nice to have them all in one place and to be able to specifically call out TODO items that were decided on. Makes it easy to follow up and make sure people are doing what they said they would do.

Inject a custom handler into Sitecore processing

Needed to inject a global handler into a Sitecore app that could check the status of something and perform a redirect if necessary. Sitecore uses a processing pipeline behind the scenes and its pretty easy to inject a custom processor into it.

In my case, I was adding it to the httpRequestBegin pipeline, so I created a class that has a public void Process(HttpRequestArgs) method. Then I just needed to configure the class as part of the pipeline in sitecore.config.

~/MyApp.Web/Pipelines/Redirector.cs
namespace MyApp.Web { public class Redirector { public void Process(HttpRequestArgs args) { var status = GetTheStatusOfThings(); if (status == Statuses.RedirectNeeded) { args.Context.Response.Redirect(RedirectUrl); args.AbortPipeline(); } } } }
Sitecore.config
... <processor type="MyApp.Web.Pipelines.Redirector, MyApp.Web"/> ...

Note: MyApp.Web is the name of the DLL generated when I compile the MVC project.

3 Things I Learned This Week (2014-12-19)

One of the hosts of the Ruby Rogues podcast recently recommended taking time each week to write down three things you learned. I’m going to try this for awhile and see if its helpful.

Nuget .nuspec files add Referenced DLLs to the bin/ During Build

If your project depends on a local dll, one that is not already available via nuget, you can add it to the <references> section of the .nuspec. When your library is referenced in Visual Studio, that DLL will come along for the ride even though theres no direct reference in Visual Studio.

Compiling Ruby on Cygwin

Make sure you have all the build tools and libraries then use the instructions from the ruby-build project.

  • gcc
  • gcc-core
  • git
  • libtool
  • libncurses-devel
  • libncurses
  • make
  • openssh
  • openssl
  • openssl-devel
  • zlib
  • zlib-devel
  • libyaml
  • libyaml-devel
  • patch
  • patch-utils
  • make
  • libcrypt
  • libcrypt-devel
  • libiconv
  • libiconv-devel
  • curl
  • wget

Some of those may not actually be required but you should have them anyway :)

HT: Jean-Paul S. Boodhoo

Developing ASP.NET UserControls That Target and Modify Another Control

Sort of like how the ASP.NET Validation controls have a ControlToValidate property that takes the string ID of the desired control.

I was developing a wrapper for a jquery modal plugin and wanted the modal to declare which control triggered it to open:

<a runat="server" ID="ModalTrigger" href="#">Open Modal</a>

<!-- Near the Bottom -->
<uc:MyModalControl runat="server" ID="Modal"
    ModalID="modal-one" TriggerControl="ModalTrigger">
    <Contents>
        This is the modal stuff
    </Contents>
</uc:MyModal>

My modal control would need to add a data-reveal-id="modal-one" attribute to the Open Modal link.

protected void Page_Load(object sender, EventArgs e)
{
    Page.PreRender += Page_PreRender;
}

void Page_PreRender(object sender, EventArgs e)
{
    if (TriggerControl == null)
        return;

    AddDataRevealIdToTriggerControl();
}

private void AddDataRevealIdToTriggerControl()
{
    var triggerControl = FindControlRecursive(Page, TriggerControl) as HtmlControl;
    if (triggerControl != null)
        triggerControl.Attributes["data-reveal-id"] = ModalId;
    else
        throw new InvalidOperationException(
            string.Format("Could not locate TriggerControl '{0}'. Did you include runat=\"server\"?", TriggerControl));
    }
}

public string TriggerControl { get; set; }

private static Control FindControlRecursive(Control root, string id)
{
    if (root.ID == id)
        return root;

    foreach (Control c in root.Controls)
    {
        var ctr = FindControlRecursive(c, id);
        if (ctr != null)
            return ctr;
    }

    return null;
}

OSX VPN Not Routing Intranet Traffic

Mostly so I can find this again if I need it

I was sitting in the waiting room at the local auto shop waiting for them to finish up looking at my brakes and tried to connect our corporate VPN so I could look into some error emails I was getting.

Unfortunately I was unable to git pull the latest version of the code in question. I was getting an error about being unable to ssh to the git server.

That’s odd, usually if the VPN connects OK, I have no problems accessing the internal resources. Using ping to check the connection, I noticed that the internal traffic was not being routed over the VPN and the connection was being dropped by the local WIFI’s router.

It turns out that both my VPN and the WIFI connection I was using are configured to use 10.*.*.* IP addresses. So when I tried to ping 10.24.1.1, the internal IP of the git server, OSX was routing the data to the local WIFI instead of out over the VPN.

If only I could configure the network stack to send traffic to 10.24.*.* through the VPN!

Frustrations with System.DateTime - Part 1

System.DateTime is a frustrating object for a number of reasons, many of which I hope to elaborate in future posts.

Imagine you have a class, that for correctness, requires a DateTime parameter to be in UTC. How can you communicate and enforce this fact to users of your class?

On Training New Developers

Training any new employee is often an expensive undertaking. During the training period, both the trainer and trainee are working under capacity. The senior developers have to devote time to showing off the code base and performing quality control on the new employee’s contributions. The new employee, on the other hand, has not quite built a full knowledge of the business domain and development process and will understandably take longer to perform even rudimentary tasks.

But I think that proper preparation can go a long way towards minimizing those costs.

Keep Your Azure Secrets Safely Out Of Git

Application development these days often requires maintaining and securing credentials for numerous third party services and external tools. Your database, error tracker, and email integration services all require you to present a token or password for authentication and authorization.

If you’re keeping your source code out in the open, like on GitHub or CodePlex, or what-have-you, you want to be careful to keep those secrets to yourself.

ASP.NET and Windows Azure make that pretty easy through the web.config‘s ability to import external files.