Today begins the first in a series of posts about creating a PHP Spider/Crawler/Bot. I’m actually not sure what to call it.
At my work, we deal with hundreds of shipments to and from hundreds of vendors. We have thousands (possibly more) “tracking codes” for Fedex, UPS, DHL, Conway, etc. All of these places have websites where you can go to look up a specific tracking code and see the status of your delivery. Fedex has an API, as does USPS, as does UPS, as does DHL, but Conway does not. The question I’m facing is, do I integrate four different APIs plus a bot to get the info I need, or do I just brute force it all with a bot?
First things first though, can I even make a bot in PHP? Yes! As a test case I have created a simple page scraping bot. It will open up the OCA website and scrape today’s scripture off and print it out:
//get a connection to the desired page (remove the space between fopen and the “(“)
$handle = fopen (“http://www.oca.org/Reading.asp?SID=25″ , “r”);
//pull down the contents of that page
$contents = stream_get_contents($handle);
//close the connection
fclose($handle);
/*convert everything to lower case so it’s easier to do my string matching (not necessarily a must)*/
$contents = strtolower($contents);
/*I looked at the HTML coming down, and found that this was the best consistent place to look at to determine the beginning of the “scripture” content.*/
$start = ‘class=”scriptureheader”>’;
//find the position of the start text in the page content
$start_pos = strpos($contents, $start);
//drop the text before the start position
$first_trim = substr($contents, $start_pos);
/*this looked to be the best consistent ending of the scripture in the HTML*/
$stop = ‘</table>’;
//find the position of the stop string in the content
$stop_pos = strpos($first_trim, $stop);
//drop everything after the stop position
$second_trim = substr($first_trim, 0, $stop_pos);
//print out the results
print “<div>$second_trim</div>”;
It’s important to note that the start and end positions can not have any spaces in them. This could (and would need to be) greatly improved to really be useful. I’d probably recommend incorporating regular expressions instead of using the “strpos” command. Next I’m going to look at the cURL library for PHP. I’ll re-use the same website again to navigate forms to get the scriptures readings for Sundays instead of the current day.
Tags: bot, code, crawler, example, PHP, sample code, spider, tutorial
ChomperStomp
Hallelujah Button
Status-bar Calculator
Andy Harris' Books
Outer Spice Web Company
Paul Irish's Blog
The Daily WTF
yayQuery Podcast
Thank you for this great article. I was looking for a resource that helps me with spider creation and your article simply has given me a starting point. Thank you again,
Sumit
Thank you for this article. I also recommend str_replace characters that could harm your database if you are working with hostile sites and caching info. Also using the php GDLib is very nice when working with cURL, you can really make some nice gui improvements.
Hey, I was looking for something similar, but for this delivery company http://www.sendaexpress.com.mx/cotizador.asp to calculate shipping rate, any ideas? ..
Thanks
So, I posted this article two years ago and I know I eventually figured out how to do it using cURL but I have absolutely no recollection as to how I actually accomplished this. So, @erick, no, sorry, I don’t have any idea… I really wish I had done a follow up post…