Processing HTML with Hpricot

In this world of Web2.0 mashups and easy API access, it is quite refreshing how easy it is to pull data for third party sites and re-mash it into something new. Unfortunately, not everyone has been bitten by this bug, so we as developers sometimes have to do a little more leg work to get the information we need. A common technique is called a screen scrape where your application acts like a browser and parses the HTML returned from the third party server.

Although this should be simple enough, anyone who has ever tried to do this knows the pain of dancing with regular expressions in an attempt to find the the tags that you need. Luckily, us rubyists have the Hpricot library which takes the hard work out of parsing HTML. Hpricot allows developers to access html elements via CSS-selectors and X-Path, so you can target specific tags really easily. And because it is written in C, it is pretty fast too.


Hpricot is a gem, so installation is as easy as:

gem install hpricot

The just require the library at the top of the ruby file:

  require 'hpricot' 


Lets take this HTML snippet:

  <html>   <head>     <title>Snippet</title>   </head>   <body>     <div id="container">       <div id="navigation">         <ul>           <li><a href="/">Home</a></li>           <li><a href="/contact></a></li>         </ul>        </div>        <div id="sub-content">           <p>This would be some sort of sidebar</p>        </div>        <div id="content">          <p>This is paragraph 1</p>          <p>This is paragraph 2</p>        </div>      </div>    </body> </html> 

We can easily pull out the content of the paragraphs by doing this (Let’s assume the HTML is already stored in the variable @html)

  doc = Hpricot(@html)  pars ="div[@id=content]/p").each do |p|   pars << p.inner_html end 

Yep – that’s it. You now have an array with two elements that are the same as the copy in the two p tags. Notice that the p tag in the sub-content div isn’t pulled in?

It doesn’t end there though, you can also manipulate the HTML – which can come in handy if you wanted to, say, create a quick and dirty mobile version. Let’s say we wanted to remove the sub-content div from the mobile version, we could do this:

  doc = Hpricot(@html)"div[@id=sub-content]").remove  puts doc 

The resultant HTML no longer has a div called sub-content!

To add a new class to the navigation ul is as simple as:

  doc = Hpricot(@html)"div[@id=navigation]/ul").set("class", "nav") 

This is just the tip of the iceberg – the library is really powerful and simple to use. Go and check out the official page for more (less trivial) examples.

Disclaimer: You should make sure you have permission for the website owner before screen-scraping their site.


Category: programming Time: 2007-11-21 Views: 1

Related post

  • How do you parse and process HTML/XML in PHP? 2010-08-26

    How can one parse HTML/XML and extract information from it? This is a General Reference question for the php tag --------------Solutions------------- Native XML Extensions I prefer using one of the native XML extensions since they come bundled with P

  • c (cgi) program to process html forms 2010-09-16

    Is it ok to have C or python program to process html form? Though I worked on php and java stuff.what are the disadvantages or advantage of such approach (using c/python)? --------------Solutions------------- Its totally fine to write a CGI script in

  • why do nginx process run with user nobody 2013-08-28

    I was trying to setup nginx to run with one of my rails apps, when having a look at output for ps -e | grep nginx , I realised nginx worker processes run with user nobody. Is there a reason why they are not running as www-data ? --------------Solutio

  • Is XML really more semantic that HTML with classes/ids? 2016-01-22

    I'm coming from a HTML / JavaScript / PHP background and have recently started learning XML. I was reading this excerpt from "No Nonsense XML Web Development with PHP" which includes this comparison: <div> <div> <h2>Product One

  • Apache - child process exited with status 255 2010-03-25

    I am running Windows 7 64-bit with an older version of (Apache 2.0.59) and PHP 5.2 - just switched from XP and wanted to keep the same versions. Everything will initially be working fine, but then I'll be trying to load a page and Apache crashes. I'l

  • How does an agile process deal with ORM? 2011-01-31

    This question was prompted by this answer, where I quote,"agile + ORM on a large constantly changing database is brutal". So, how can an agile process deal with a constantly changing database? I am interested in both cases: Where the database de

  • How to not mix HTML with PHP? 2011-03-25

    I made an application in EXTJS, but my technical architect and project manager say we don't want big file, so removed the EXTJS and made in object oriented PHP and JavaScript code, mixing HTML with PHP. But I don't want to mix HTML and PHP because of

  • Is there a way to list all Intents and all processes associated with them? 2011-05-18

    Is there a way to list all Intents and all processes associated with them? Specifics: Droid X, 2.2 stock ROM, rooted. I'm perfectly fine with any approach, as long as it gets a guaranteed complete list: A script or a series of commands in Terminal Em

  • 10 Ways to Effectively Blend HTML with jQuery 2011-06-27

    10 ways to effectively blend html using jQuery! HTML is an amazing scripting language. Blending HTML with jQuery lets one make the most out of HTML. Extend the limitations of HTML by blending it with jQuery. So here is a collection of amazing tutoria

  • reattaching a process started with nohup? 2012-03-08

    Is it possible to reattach a process started with nohup? For example, start in a terminal: nohup tail -f /dev/null & Close the terminal. Now open a new terminal. How do you reattach the nohup'd process to this new terminal? --------------Solutions---

  • Sendmail process failed with error code 67 2012-12-19

    Everything was worked fine but someday sendmail stop working on me. I'am trying the following command line: echo "Body text" | mail -s "Some subject" [email protected] And got: /home/<username>/dead.letter... Saved message in

  • Child process exited with status 254 error when extracting a tar.xz file 2013-01-28

    I'm trying to extract a tar.xz file with tar in Mac OS X Lion. I'm using bsdtar 2.8.3 - libarchive 2.8.3. I'm able to use tar to extract anything else I've tried (.zip, .tar, .tar.gz, etc.), but when I try to extract the file (this file to be precise

  • Unknown processes starting with jfs and xfs 2013-07-29

    Today, I checked the running processes on my Ubuntu 12.04 server. There are 9 processes named jfsCommit, jfsSync and jfsIO. I wanted to ask what they are used for because I never saw them before and if/how I can remove them if needed. Also there are

  • How to resolve the error "processes exited with error(s)" in PdfLatex? 2014-02-08

    I am preparing a poster (a0 size,landscape) for presentation using TeXstudio. This is the code for the poster \documentclass[final,hyperref={pdfpagelabels=false}]{beamer} \usepackage{grffile} \mode<presentation>{\usetheme{drexel}} \usepackage[englis

  • Is changing HTML with JavaScript bad for SEO? 2014-03-01

    Building web apps I don't have to worry so much about SEO for back end things so dynamically changing the HTML isn't an issue. I am using backbone.js with a front end site I'm building and the site is a tour based site, ex several sections within the

  • Can I append apex elements to html with jQuery? 2014-04-17

    In a visualforce page, if I have a DOM completely created dynamically with jQuery, could I append the following code dynamically too? <div id="file"> <apex:inputFile value="{!attachment.body}" filename="{!}

  • Cannot delete user - running '/usr/sbin/userdel' failed: Child process exited with code 16 2014-05-31

    When I tried to delete a user using the User Accounts window under Settings, I get the following error message: running '/usr/sbin/userdel' failed: Child process exited with code 16 How can I delete this user? --------------Solutions------------- 1)

  • Processing files with names that contain a "^" (Carat), inWindows 2014-07-19

    I'm having trouble processing files with names that contain a "^" (Carat). What I'm noticing is that if I use double quotes when evaluating the filenames, the "Carat's" are doubled. If I don't use the double quotes, the "Carat's&q

  • How to save processes info (with cpu usage) to txt? 2014-09-14

    I'm on windows server 2012 R2 and I want to save detailed info of all running processes with details such as ram,cpu etc... to a text document. Is there any trick? I did a quick search and saw that you can get process info with the cmd tool using 'ta

iOS development

Android development

Python development

JAVA development

Development language

PHP development

Ruby development


Front-end development


development tools

Open Platform

Javascript development

.NET development

cloud computing


Copyright (C), All Rights Reserved.

processed in 1.045 (s). 13 q(s)