29
aug
2008

HTML DOM and easy screen scraping in PHP

One of my favourite features in JavaScript is its ability to interact with the DOM so easily. This is made even easier by various JavaScript libraries and their selector engines largely based on CSS expressions.

Working with XML is easy in PHP with various extensions such as SimpleXML, however unfortunately HTML is far more tedious. Thankfully however, I found this; PHP Simple HTML Dom parser (they could really do with a shorter name! I’ll go for PSDP).

It is a open source PHP solution to DOM parsing, from the documentation it seems to be based on jQuery. So, if your familiar with jQuery (or even any other JavaScript library) and PHP you’ll find it a breeze to pick up.

So what use is it anyway? Since PHP runs on the server you can’t dynamically access the DOM in the browser like you can in JavaScript. It could be used for screen scraping or perhaps even a template engine?

Here is a quick example of how it’s used (mostly borrowed from the docs);

<?php

require 'simple_html_dom.php';

// Listing all (one of ) the images on google.com
$html = file_get_html('http://www.google.com');

foreach($html->find('img') as $element){

    echo $element->src . "<br/>";

}

echo "<br/>";

// Listing some of the things found on my website home page
$html = file_get_html('http://www.dougalmatthews.com/');

foreach($html->find('#content h1') as $element){

    echo $element->plaintext . "<br/>";

}

I could keep listing off examples but there isn’t much point as there are quite a few nice examples in the PSDP documentation.

With this information and my previous post on RSS feeds you can now easily create feeds for websites that you wish had them! This is exactly how I created the feeds found on this page. If you do make any, let me know and I can host them under that address if you need hosting, or just want to share ;)!

I really like this project, its so easy to get started as they followed a standard(ish) approach that has been proven by JavaScript to work really well. Nice work!

Short url - Related tags: dom, html, javascript, jquery, php, site-news, tips-and-tricks

blog comments powered by Disqus