|
This is one of my journal's many "channels." |
|
Over the years I've written this code in a few different languages: take some HTML input, process it according to some of the basic rules a browser would use, and spit out plain text (no tags or HTML entities). By "basic rules a browser would use," I mean that e.g. a series of <p> tags should not result in a long blank gap, but a series of <br /> tags should. Line breaks (r or n) don't matter except within a pre-formatted section. Etc., etc.
My first attempt, I think, was in straight UserTalk. Then i rewrote it a couple of times with regular expressions (still UserTalk) to make it faster. Then a client needed it in a language that could be used on any Mac OS X box, so I rewrote it in Perl, which was faster still (and was much better about converting the HTML entities to UniCode).
The Perl script uses lots of regular expressions, and so makes many passes over the input, changing the text in place. It worked well enough for most HTML, but long documents with a very high ratio of tags-to-text (that is, very tag heavy) would process very slowly. Unfortunately the script was run automatically in the background by a "regular" GUI application, and so the app would seem to freeze up for a little while as it processed one of these pathological cases.
Over the last week I rewrote it again, this time in pure C++. It's a command line tool with the same basic interface that the Perl script had: you can pass it an argument to specify the input file and output files. Omitting either one causes it to use standard input and/or output.
The new tool makes a single pass through the text, doesn't use any regular expressions, and generates slightly better output. Actually, it's more honest to say that it makes three passes through the text: first it converts UTF-8 to UTF-16 (but that's an OS API service), then it processes the UTF-16, then it converts back to UTF-8 (again, just done by the OS) for output.
The timing results speak for themselves. These tests use the worst, most pathological example we had. It's a 200 KB file that's about 90% tags (specifically, it's a long email exchange where everybody top-posted and quoted everything else, and everyone used HTML messages.)
$ time striphtmltags.pl - < ./striphtmltags.input.html > ./striphtmltags.output.txt # old one real 0m20.201s user 0m19.774s sys 0m0.352s $ time newstriphtml < ./striphtmltags.input.html > ./striphtmltags.output.txt # new one real 0m0.048s user 0m0.039s sys 0m0.010s
They wanted it faster. For this worst-case scenario, it's 420 times faster. Zoom.
| November, 2007 | ||||||
| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | |
| Sep Dec | ||||||
|
TruerWords
is Seth Dillingham's personal web site. Read'em and weep, baby. |