Friday, February 21, 2014

Using a Regular Expression to Seach for HTML Nodes

I am a lazy coder and data analyzer.  I use text editing tools to search and replace and manipulate so I can then use a spreadsheet tool to break text into columns and rows.  Then I can extract data or write code that follows patterns without a lot of manual work.  If I am working on code, I can copy and paste it back into my text editor, remove the tabs, and then I have clean code.  Brilliant!

One thing I need to do when scrubbing information that is structured by HTML or XML (or XHTML) is remove extraneous nodes that contain no data.  For instance, I would remove <style/> blocks.  Using a tool like TextPad for Windows, you can use the following regular expression to select an entire node (i.e., the start and end tags and everything in between, including line breaks).
<style\b[^>]*>([^<>]*)</style>
This regex allows me to search and replace (with nothing) to remove the <style/> blocks.  To do another block, just replace the two instances of "style" with the tag you want to find (e.g., "script").

No comments:

Post a Comment