Parsing HTML Documents with the Html Agility Pack

Introduction

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser – namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.

The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions.

Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.)

This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library’s utility. A complete, working demo is available for download at the end of this article. Read on to learn more!

Read complete article from 4GuysFromRolla.com

Post a Comment

Leave your comment below
The comment is moderated. Only comments related to the post will be accepted.
Your name
Email address
Your comment

Read Comments No comment yet Comment feed

No comments yet.
Printed:
Beta
You can get this information from:
https://kanasolution.com/2011/01/parsing-html-documents-with-the-html-agility-pack/
Close this window
Email This Information
To send the message, please fill the form below
Email To
Subject
Message
Your Email
Validation

Please enter the text on the following image in the verification box below. Click here if you cannot read the text. All alphabets are in upper case.

Verification image