TruerWords Logo
Google
 
Web www.truerwords.net

Search TruerWords

Welcome
Sign Up  Log On

Topic: Character Sets and Conversant (and the Eudora Problem)

Messages: (4) 1


Author: Seth Dillingham

Date:4/20/2006

Permalink Icon

# 5474

Character Sets and Conversant (and the Eudora Problem)

We recently decided to 'standardize' all Conversant text (everything from templates to messages) on UTF-8. I've been quite happy with this decision, as it made it possible (well, easier... it was always possible) to host truly international and multi-lingual sites, and made it a lot easier to deal with content coming in from a variety of sources like Microsoft Word. For example, we no longer bat an eye at ‘fancy’ characters like “curly quotes” or — for another example — long dashes.

This hasn't come without some pain on our end, though. We have to figure out what character set was used for the text being sent when a new message is created. That's supposed to be pretty easy: email, for example, generally includes a special header called "Content-Type" which lists the character set.

The problem is when the software that sent the email lies to us. This is where I'm stumped at the moment.

One of my clients uses Qualcomm's Eudora for all of his mail. He sends HTML messages to his Conversant site, and the messages always contain curly quotes.

Here's the problem: Eudora claims the message's character set is us-ascii. This is the simplest character set in use today, and converting it to UTF-8 should not be a problem... but those of you who have any experience with character sets already know what I'm going to say, right?

US-ASCII (a.k.a. ASCII) doesn't have curly quotes. Eudora must be lying about the character set, right? US-ASCII is a seven-bit set, and the character codes for the curly quotes are all in the 8-bit range, so it seems that it must be lying to me.

This has plagued me for days, and the client is beginning to think I'm being lazy by blaming his Eudora. (Surely a big company would never do anything so obviously wrong with 'established' software, right?)

Anybody have any suggestions? I haven't had much luck looking for answers in Google. What character set is Eudora really using when it sends "above ascii" text but claims it's all ASCII? Is it the platform-native character set, like windows-latin-1 or iso-8859-1, or something else entirely?

[Top]


Author: Sean McMains

Date:4/20/2006

Permalink Icon

# 5475

Re: Character Sets and Conversant (and the Eudora Problem)

Hey Seth,

Sorry for throwing fuel on a fire, but the special characters aren't coming through correctly on Bloglines. I'm subscribed to your RSS feed at http://www.truerwords.net/index/rss, and see things like this:

...we no longer bat an eye at ‘fancy’ characters like “curly quotes” or — for another example — long dashes.

It looks fine on your website -- the problem only appears to manifest in Bloglines.

I don't know if that's problem with the content type of your feed, with Bloglines, or something else, but I thought I'd bring it to your attention, anyway.

As far as the issue you mention goes, I'd suggest doing some experiments on the text that this guy sends to determine what the encoding really is. Once you've got that, it should be possible to scan the message text, see if it has any characters outside of the range for the indicated text encoding, and if it does, override the indicated encoding with the actual encoding.

Alternately, you could email the Eudora folks and ask them about it (and whether they have plans to fix their problem). If they'll acknowledge the issue, at least, it should reassure your client that you're not just a lazy bum. :)

Sean

[Top]


Author: Seth Dillingham

Date:4/20/2006

Permalink Icon

# 5476

RE: Character Sets and Conversant (and the Eudora Problem)

A bunch of people have pointed out that the RSS from my site looks funny, now. Thanks, I got it, I'll fix it.

Seth

[Top]


Author: Seth Dillingham

Date:4/21/2006

Permalink Icon

# 5477

Character Sets "Oops"

After my last post about switching to UTF-8 for all of our content in Conversant, people wrote to say that my XML feed had problems since we made the change.

Heh. That's ironic.

Anyway, some wrote to me privately, as if a bug is something I'd be embarrassed about. "Oh no! A bug in the software! Say it isn't so!" All this time I thought my software was bug-free, like everybody else's. (What matters is how we deal with them. They're inevitable.)

Well, now the bug should be fixed. All "above ascii" text in the XML output was being automatically converted to numerical entities (like &234;), but that's no longer necessary when the output is UTF-8.

In other words, my “curly quotes” — and long dashes — should all look fine now. Feel free to give me a shout if something looks wrong, though.

[Top]



<- Previous Thread:
House

Next Thread: ->
That's... That's Just Not Fair

Until August 31
My Amazon sales
benefit the PMC

Homepage Links

Apr 1 - Aug 31
Ad revenue
benefits the PMC


TruerWords
is Seth Dillingham's
personal web site.
More than the sum of my parts.