Creating A PHP Spider/Bot/Crawler

Today begins the first in a series of posts about creating a PHP Spider/Crawler/Bot. I’m actually not sure what to call it.

At my work, we deal with hundreds of shipments to and from hundreds of vendors. We have thousands (possibly more) “tracking codes” for Fedex, UPS, DHL, Conway, etc. All of these places have websites where you can go to look up a specific tracking code and see the status of your delivery. Fedex has an API, as does USPS, as does UPS, as does DHL, but Conway does not. The question I’m facing is, do I integrate four different APIs plus a bot to get the info I need, or do I just brute force it all with a bot?

First things first though, can I even make a bot in PHP? Yes! As a test case I have created a simple page scraping bot. It will open up the OCA website and scrape today’s scripture off and print it out:

//get a connection to the desired page (remove the space between fopen and the “(“)

$handle = fopen (“http://www.oca.org/Reading.asp?SID=25″ , “r”);
//pull down the contents of that page
$contents = stream_get_contents($handle);
//close the connection
fclose($handle);

/*convert everything to lower case so it’s easier to do my string matching (not necessarily a must)*/
$contents = strtolower($contents);

/*I looked at the HTML coming down, and found that this was the best consistent place to look at to determine the beginning of the “scripture” content.*/
$start = ‘class=”scriptureheader”>’;
//find the position of the start text in the page content
$start_pos = strpos($contents, $start);

//drop the text before the start position
$first_trim = substr($contents, $start_pos);

/*this looked to be the best consistent ending of the scripture in the HTML*/
$stop = ‘</table>’;
//find the position of the stop string in the content
$stop_pos = strpos($first_trim, $stop);

//drop everything after the stop position
$second_trim = substr($first_trim, 0, $stop_pos);

//print out the results
print “<div>$second_trim</div>”;

It’s important to note that the start and end positions can not have any spaces in them. This could (and would need to be) greatly improved to really be useful. I’d probably recommend incorporating regular expressions instead of using the “strpos” command. Next I’m going to look at the cURL library for PHP. I’ll re-use the same website again to navigate forms to get the scriptures readings for Sundays instead of the current day.

Tags: , , , , , , ,

4 Responses to “Creating A PHP Spider/Bot/Crawler”

  1. Thank you for this great article. I was looking for a resource that helps me with spider creation and your article simply has given me a starting point. Thank you again,
    Sumit

  2. Austin Oblouk says:

    Thank you for this article. I also recommend str_replace characters that could harm your database if you are working with hostile sites and caching info. Also using the php GDLib is very nice when working with cURL, you can really make some nice gui improvements.

  3. erick says:

    Hey, I was looking for something similar, but for this delivery company http://www.sendaexpress.com.mx/cotizador.asp to calculate shipping rate, any ideas? ..

    Thanks

  4. So, I posted this article two years ago and I know I eventually figured out how to do it using cURL but I have absolutely no recollection as to how I actually accomplished this. So, @erick, no, sorry, I don’t have any idea… I really wish I had done a follow up post…

Leave a Reply