Friday, February 21, 2014

Using a Regular Expression to Seach for HTML Nodes

I am a lazy coder and data analyzer.  I use text editing tools to search and replace and manipulate so I can then use a spreadsheet tool to break text into columns and rows.  Then I can extract data or write code that follows patterns without a lot of manual work.  If I am working on code, I can copy and paste it back into my text editor, remove the tabs, and then I have clean code.  Brilliant!

One thing I need to do when scrubbing information that is structured by HTML or XML (or XHTML) is remove extraneous nodes that contain no data.  For instance, I would remove <style/> blocks.  Using a tool like TextPad for Windows, you can use the following regular expression to select an entire node (i.e., the start and end tags and everything in between, including line breaks).
This regex allows me to search and replace (with nothing) to remove the <style/> blocks.  To do another block, just replace the two instances of "style" with the tag you want to find (e.g., "script").

No comments:

Post a Comment