Chomper Stomping jQuery/JavaScript/CSS 3/HTML 5, Java/PHP/Python/ActionScript, Git, Chrome/Firefox Extensions, Wordpress/Game/iPhone App Development and other random techie tidbits I've collected

23Jul/081

preg_*

Regular expressions are fun! (repeat 5x, or until you believe it)

I'm working on a webbot, and right now I need it to drop all the HTML and just leave me with the text. So I wrote this regex:

/<.*>/

(explanation: start match at "<" look for any character "." any number of times "*" and stop when you come to ">" (but really, it goes all the way to the very last ">" it finds and stops).

Of course, it then matched everything from the first < all the way to the last >, dropping all text that was properly encapsulated by HTML tags.

So, next I wrote this:

/<[a-zA-Z\t "=0-9_\-\\/]*>/

(explanation: start at "<" find any character I could think of except for > "[a-zA-Z\t "=0-9_\-\\/]" any number of times "*" and then stop when you come to ">" (stops at the first >))

Wow... that's... insanity... I probably even missed something. It did, however, only drop the HTML tags themselves. However, it's nasty looking.

I then realized I could just write this:

/<[^>]*>/

(explanation: start at "<" find any character except > "[^>]" any number of times "*" and stop as soon as you come to ">")

Yeah, it looks like some sort of ascii art of "The Cheat" or something, but it very elegantly finds the beginning and ending of a tag. See, regex is fun!

Here is the final code btw:

$htmlSearch = '/<[^>]*>/';
$cleanLine = preg_replace($htmlSearch, "", $line);

Comments (1) Trackbacks (0)
  1. Thank you very much for the post on phpmanual.net regarding enabling curl when using xampp.


Leave a comment

No trackbacks yet.