|
|
“RE: The Incredible Non-Uniqueness of an MD5 Hash” |
|||
| From: | Seth Dillingham | In Response To: | 1747 Re: The Incredible Non-Uniqueness of an MD5 Hash |
| Date Posted: | Wednesday, March 6, 2002 12:49:09 AM | Replies: | 1 |
| Enclosures: | None. | ||
On 3/5/02, Greg Pierce said:
>You do have a bit of a catch-22 here, because two URLs could >refer to the exact same page, if you're including args in >your equation...ie, > >http://[host]/[path]?arg1=x&arg2=y > >is the programmatically the same thing as: > >http://[host]/[path]?arg2=y&arg1=x > >They would, however, generated different hashes...
That's not a catch 22, actually.
I didn't say that the object referred to by the ID had to be unique, I said only that the ID itself had to be unique.
This is a sort of web page indexing system. If more than one URL refers to exactly the same page, it's up to the crawler to determine that.
The reason I needed to use a hashing algorithm is that some URL's are too long for Frontier's table names. The names are limited to 255 characters, but (in spite of what others have said) there is no official limit to the length of a WWW URL.
Using the MD5 hash allows me to squeeze a URL of any length into a unique-in-practice string of 32 characters, which is *perfect* for my needs.
As I mentioned in the original piece, the random distribution of the characters in the MD5 also allows me to distribute the items evenly over a space of virtually any size. (Instead of a single table containing items for each ID, I'll have a table with 16 subtables (a-z and 0-9) and each of those will also have 16 subtables. If it's determined that more depth is needed, we could easily take it another level deep, and another, and another, etc.)
Wow, I'm amazed at the amount of traffic this subject generated. I don't even think anybody else pointed to it! Thank you for the discussion, everybody.
Seth
There are no trackbacks.
|
TruerWords
is Seth Dillingham's personal web site. Truer words were never spoken. |