TruerWords Logo

Search TruerWords

Sign Up  Log On

“Faster Code (Converting HTML to Text)”

From: Seth Dillingham In Response To: Top of Thread.  
Date Posted: Monday, November 12, 2007 10:01:09 AM Replies: 0
Enclosures: None.

Over the years I've written this code in a few different languages: take some HTML input, process it according to some of the basic rules a browser would use, and spit out plain text (no tags or HTML entities). By "basic rules a browser would use," I mean that e.g. a series of <p> tags should not result in a long blank gap, but a series of <br /> tags should. Line breaks (r or n) don't matter except within a pre-formatted section. Etc., etc.

My first attempt, I think, was in straight UserTalk. Then i rewrote it a couple of times with regular expressions (still UserTalk) to make it faster. Then a client needed it in a language that could be used on any Mac OS X box, so I rewrote it in Perl, which was faster still (and was much better about converting the HTML entities to UniCode).

The Perl script uses lots of regular expressions, and so makes many passes over the input, changing the text in place. It worked well enough for most HTML, but long documents with a very high ratio of tags-to-text (that is, very tag heavy) would process very slowly. Unfortunately the script was run automatically in the background by a "regular" GUI application, and so the app would seem to freeze up for a little while as it processed one of these pathological cases.

Over the last week I rewrote it again, this time in pure C++. It's a command line tool with the same basic interface that the Perl script had: you can pass it an argument to specify the input file and output files. Omitting either one causes it to use standard input and/or output.

The new tool makes a single pass through the text, doesn't use any regular expressions, and generates slightly better output. Actually, it's more honest to say that it makes three passes through the text: first it converts UTF-8 to UTF-16 (but that's an OS API service), then it processes the UTF-16, then it converts back to UTF-8 (again, just done by the OS) for output.

The timing results speak for themselves. These tests use the worst, most pathological example we had. It's a 200 KB file that's about 90% tags (specifically, it's a long email exchange where everybody top-posted and quoted everything else, and everyone used HTML messages.)

$ time - < ./striphtmltags.input.html > ./striphtmltags.output.txt # old one
real    0m20.201s
user    0m19.774s
sys     0m0.352s
$ time newstriphtml < ./striphtmltags.input.html > ./striphtmltags.output.txt # new one
real    0m0.048s
user    0m0.039s
sys     0m0.010s

They wanted it faster. For this worst-case scenario, it's 420 times faster. Zoom.

Discussion Thread:

There are no replies.


There are no trackbacks.

is Seth Dillingham's
personal web site.
More than the sum of my parts.