Today begins the first in a series of posts about creating a PHP Spider/Crawler/Bot. I’m actually not sure what to call it.
At my work, we deal with hundreds of shipments to and from hundreds of vendors. We have thousands (possibly more) “tracking codes” for Fedex, UPS, DHL, Conway, etc. All of these places have websites where you can go to look up a specific tracking code and see the status of your delivery. Fedex has an API, as does USPS, as does UPS, as does DHL, but Conway does not. The question I’m facing is, do I integrate four different APIs plus a bot to get the info I need, or do I just brute force it all with a bot?
First things first though, can I even make a bot in PHP? Yes! As a test case I have created a simple page scraping bot. It will open up the OCA website and scrape today’s scripture off and print it out:
//get a connection to the desired page (remove the space between fopen and the “(“)
$handle = fopen (“http://www.oca.org/Reading.asp?SID=25″ , “r”);
//pull down the contents of that page
$contents = stream_get_contents($handle);
//close the connection
fclose($handle);
/*convert everything to lower case so it’s easier to do my string matching (not necessarily a must)*/
$contents = strtolower($contents);
/*I looked at the HTML coming down, and found that this was the best consistent place to look at to determine the beginning of the “scripture” content.*/
$start = ‘class=”scriptureheader”>’;
//find the position of the start text in the page content
$start_pos = strpos($contents, $start);
//drop the text before the start position
$first_trim = substr($contents, $start_pos);
/*this looked to be the best consistent ending of the scripture in the HTML*/
$stop = ‘</table>’;
//find the position of the stop string in the content
$stop_pos = strpos($first_trim, $stop);
//drop everything after the stop position
$second_trim = substr($first_trim, 0, $stop_pos);
//print out the results
print “<div>$second_trim</div>”;
It’s important to note that the start and end positions can not have any spaces in them. This could (and would need to be) greatly improved to really be useful. I’d probably recommend incorporating regular expressions instead of using the “strpos” command. Next I’m going to look at the cURL library for PHP. I’ll re-use the same website again to navigate forms to get the scriptures readings for Sundays instead of the current day.
ChomperStomp
Hallelujah Button
Status-bar Calculator
Andy Harris' Books
Outer Spice Web Company
Paul Irish's Blog
The Daily WTF
yayQuery Podcast