TruerWords Logo
Google
 
Web www.truerwords.net

Search TruerWords

Welcome
Sign Up  Log On

Finding and Activating URLs with Regular Expressions

How To Find and Activate URLs with Regular Expressions

Background

It's funny how one thing leads to another.

When Conversant mails out a plain-text email (versus HTML email), it strips out HTML links and puts the URL in parentheses after the linked text. Then it just strips out all other HTML tags, so that the people on the mailing list don't see HTML tags in their email.

After two and a half years of working this way, it was requested that we change from parens to angle brackets. So, I looked into it. Sounds easy enough, at first...

Except that it's not just about outgoing email. When you reply to one of those messages, it's posted back to the site, which means whatever we send out in the email must translate into valid content for displaying in a web page. When a plain text email comes in to Conversant, the angle brackets (< and >) are converted to entities (&lt; and &gt;).

Even that would be ok, but when URL's are sitting alone on a page (not in a tag), Frontier (on which Conversant is written) tries to activate them like this: http://conversant.macrobyte.net/

Frotier's URL activator didn't realize that it shouldn't include the &gt; at the end of the URL, as part of the new link tag, so it ended up created invalid links.

This led to the decision to stop converting > to the &gt; entity.

Then it was pointed out that there are still many situations where Frontier doesn't recognize valid URL's, or converts text into links that should not be (the most notorious example is the javascript comment delimiter, //, which it would convert to <a href="//">//</a>). So perhaps we should shut that feature off entirely, and redo it ourselves.

Not as easy as it sounds. If we're going to do it ourselves, and make it better than what we had, we might as well make it as good as we possibly can, right? It took four regular expressions to make it work exactly the way we wanted.

The Regular Expressions

  • Look for standard, simple URL's not contained by any real delimiters. Require at least two chars after protocol:

    (^|[ \t\r\n])((ftp|http|https|gopher|mailto|news|nntp|telnet|wais|file|prospero|aim|webcal):
    (([A-Za-z0-9$_.+!*(),;/?:@&~=-])|%[A-Fa-f0-9]{2}){2,}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*(),;/?:@&~=%-]*))?([A-Za-z0-9$_+!*();/?:~-]))
  • Stricter compliance to the URL specification, but only looks within angle-bracket delimiters (including entified angle brackets):

    ([<;])((ftp|http|https|gopher|mailto|news|nntp|telnet|wais|file|prospero|aim|webcal):
    [A-Za-z0-9/](([A-Za-z0-9$_.+!*(),;/?:@&~=-])|%[A-Fa-f0-9]{2})+(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*(),;/?:@&~=%-]*))?)>
  • Similar to above, but only find URL's in parens. This is for backwards compatibility, since we had 30 months of messages containing URL's wrapped in parens.

    (\()([A-Za-z][A-Za-z0-9+.-]+:
    [A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2})+(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]*))?)
  • Look for email addresses without the mailto: protocol specified. (The protocol isn't excluded, actually, but those with the protocol were already activated by the earlier expressions, so now they'll be wrapped in <A> tags and won't be caught here).

    (^|[ \t\r\n<;(])((([A-Za-z0-9$_.+%=-])|%[A-Fa-f0-9]{2})+
    @(([A-Za-z0-9$_.+!*,;/?:%&=-])|%[A-Fa-f0-9]{2})+\.[a-zA-Z0-9]{1,4})

(Brian Andresen helped write these regular expressions, originally.)

As I said, it's funny how one thing leads to another. All of this just because we wanted to change from parentheses to angle brackets!

Change History

Oct. 27, 2002

Added ~ to char sets after Duncan Smeed pointed out that the URL-matching expressions didn't included ~, so wouldn't match some common http url formats.

Apr. 7, 2003

Updated with three changes made since October:

  • URL's in angle bracket or paren delimiters must now begin with a letter. Allowing numbers was too flexible, as it would activate timestamps comtaining colons.
  • Allow common URL's to be activated that are preceded by a '>' or a ')'
  • Added support for anchor specifiers in the URL. (URL#foo)
Apr. 18, 2003

Allow the first character after the anchor delimiter (#) to be a number. Previously, only a-z and A-Z were allowed. Strictly speaking, this is not correct as anchors are supposed to begin with alpha chars, but common practice (and a bug report from someone for whom I have immense respect) is that anchors often begin with numbers. So be it.

May 5, 2003

changes to the 'common urls' detector (the first regular expression):

  • added support for ITMS
  • require at least two characters after "protocol:"
  • restrict which characters can be the last character in the URL (It's OK to be more picky here, as it's only detecting undelimited URLs.) This works around a problem with people ending a setence with a URL, followed by a period. Now it leaves the extra char out of the URL. This happens a lot, so this is the lesser of two evils (if someone pastes in a URL that ends with a period or comma, and doesn't enclose it in delimiters, the link won't be right but that's very rare)
July 28, 2004

Email addresses without a protocol (see the last regular expression) must now end with \.[a-zA-Z0-9]{1,4}

December 28, 2004

Don't allow the % (percent sign) as the first character of an undelimited, sans-protocol email address (see the last pattern).

February 3, 2005

URLs in delimiters must now begin with a known protocol. Otherwise, we run into trouble with xml namespaces. For example, if the document is XHTML and imports the dublin core dtf, <dc:title> would be turned into a link!

November 23, 2005

No longer allow a comma as the last character of an undelimited URL. (See last item from changes for May 5, 2003.) When someone ends a phrase with a url, followed by a comma and the rest of the sentence, the comma was being included in the URL. URL's very rarely end with commas, but people include them in sentences all the time, so this is a "lesser of two evils" change.

September 28, 2006

Just reformatted the page slightly. No change to content.

Page last updated: 9/28/2006




TruerWords
is Seth Dillingham's
personal web site.
Read'em and weep, baby.