Chomper Stomping
jQuery/JavaScript/CSS 3/HTML 5, Java/PHP/Python/ActionScript, Git, Chrome/Firefox Extensions, Wordpress/Game/iPhone App Development and other random techie tidbits I've collected



programming concepts

July 18, 2008

Creating A PHP Spider/Bot/Crawler

Today begins the first in a series of posts about creating a PHP Spider/Crawler/Bot. I’m actually not sure what to call it.

At my work, we deal with hundreds of shipments to and from hundreds of vendors. We have thousands (possibly more) “tracking codes” for Fedex, UPS, DHL, Conway, etc. All of these places have websites where you can go to look up a specific tracking code and see the status of your delivery. Fedex has an API, as does USPS, as does UPS, as does DHL, but Conway does not. The question I’m facing is, do I integrate four different APIs plus a bot to get the info I need, or do I just brute force it all with a bot?

First things first though, can I even make a bot in PHP? Yes! As a test case I have created a simple page scraping bot. It will open up the OCA website and scrape today’s scripture off and print it out:

//get a connection to the desired page (remove the space between fopen and the “(“)

$handle = fopen (“http://www.oca.org/Reading.asp?SID=25″ , “r”);
//pull down the contents of that page
$contents = stream_get_contents($handle);
//close the connection
fclose($handle);

/*convert everything to lower case so it’s easier to do my string matching (not necessarily a must)*/
$contents = strtolower($contents);

/*I looked at the HTML coming down, and found that this was the best consistent place to look at to determine the beginning of the “scripture” content.*/
$start = ‘class=”scriptureheader”>’;
//find the position of the start text in the page content
$start_pos = strpos($contents, $start);

//drop the text before the start position
$first_trim = substr($contents, $start_pos);

/*this looked to be the best consistent ending of the scripture in the HTML*/
$stop = ‘</table>’;
//find the position of the stop string in the content
$stop_pos = strpos($first_trim, $stop);

//drop everything after the stop position
$second_trim = substr($first_trim, 0, $stop_pos);

//print out the results
print “<div>$second_trim</div>”;

It’s important to note that the start and end positions can not have any spaces in them. This could (and would need to be) greatly improved to really be useful. I’d probably recommend incorporating regular expressions instead of using the “strpos” command. Next I’m going to look at the cURL library for PHP. I’ll re-use the same website again to navigate forms to get the scriptures readings for Sundays instead of the current day.



About the Author

Christopher McCulloh
E-Commerce developer at Finish Line Co-Author of HTML, XHTML and CSS All-in-one Desk Reference for Dummies Graduated from IU with a Bachelors of Media Arts and Science and a Certificate in Applied Computer Science. Tech Editor for Building Facebook Applications for Dummies and Building Websites All-in-one for Dummies 2nd Edition. Creator and maintainer of the Status-bar Calculator Firefox Extension Three years professional experience in Java E-Commerce Development and four years professional experience with PHP for a combined total of seven years professional JavaScript/HTML/CSS experience




 
 

 
blue-xl

WordPress Settings API – Adding Options to Existing Page

Adding new options to an existing page in the dashboard in wordpress can be maddening. I’ve literally spent 15+ hours dealing with this horrible API at this point. To the point where I wrote two different wrappers for it....
by Christopher McCulloh
0

 
 
octocat

Introducing GitScripts

GitScripts is a project that attempts to make Git user friendly. I have been working on it for almost a year now. When we implemented Git in my office, we were having a really hard time using it. We loved the flexibility of the...
by Christopher McCulloh
1

 
 
sshlogo

Using two different identity files with ssh for rsa remote authentication keys

I have two different servers I need to connect to, each requiring two different types of remote authentication keys. One requires rsa, the other dss. So I had to make and use two different remote authentication keys, but was un...
by Christopher McCulloh
0

 

 
ATG
java-logo

Creating ATG Droplets and serving a default oparam

Creating your own ATG droplets is not difficult. Servicing a default open parameter (oparam) in an ATG droplet is surprisingly extremely easy. ATG has these things called “droplets” that you use from within your ...
by Christopher McCulloh
1

 
 
logo

Updates – BASIC jquery ui tabs rotate documentation, a note on nodejs hosting, and a note on the re-design

nodejs, jquery ui tab rotate, and re-design. Just a few quick notes… I’m actively working on documentation for the jquery ui tab rotation plugin. I’ve (finally) got a very basic working example up. The plugin ...
by Christopher McCulloh
0

 




7 Comments


  1. Thank you for this great article. I was looking for a resource that helps me with spider creation and your article simply has given me a starting point. Thank you again,
    Sumit


  2. Austin Oblouk

    Thank you for this article. I also recommend str_replace characters that could harm your database if you are working with hostile sites and caching info. Also using the php GDLib is very nice when working with cURL, you can really make some nice gui improvements.


  3. erick

    Hey, I was looking for something similar, but for this delivery company http://www.sendaexpress.com.mx/cotizador.asp to calculate shipping rate, any ideas? ..

    Thanks


  4. So, I posted this article two years ago and I know I eventually figured out how to do it using cURL but I have absolutely no recollection as to how I actually accomplished this. So, @erick, no, sorry, I don’t have any idea… I really wish I had done a follow up post…


  5. good article. I use this same approach on a project i’m currently working on (http://www.quickscrape.com) and I would recommend using curl. how do you handle recursion?


  6. Peter

    Very good article. It is somehow possible to make shoes at the entrance to the site identified under any name (eg, Google bot will indetifikuje in statistics such as Googlebot). very thanks for advice on how to name the boot. For site owners knew they had been visited just my boot. very thanks for the advice.


  7. Aneesh Dogra

    Nice for beginners …
    Need something advanced for me..
    Will be waiting for more articles…



Leave a Reply

Your email address will not be published. Required fields are marked *

*


+ two = 4

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>