In this series of blog posts — which started on the Mimecast site before migrating here — I’ve been writing primarily about the technical complexities that make email a much more interesting business than it seems at first glance. But some of the most daunting complications are not technical; email needs to support the complexity of social interaction in general.
A straightforward example of this is the MIME protocol, which I co-designed twenty years ago. Some of the complexities in MIME (such as 7 bit encodings or multipart boundaries) might be called “contingent” or even “fundamentally unnecessary” because they exist only for backwards compatibility with the pre-MIME email world. If one redesigned email from scratch, these things would probably go away. However, much of the complexity of MIME comes from the fact that people want to be able to communicate a wide range of information. It needs to represent text, images, sounds, video, and so on — to the point where there are now over a thousand registered MIME types, each of which needs to be handled differently when displayed to the user. The world of MIME types is complicated not because we failed to make it simpler, but because human commuication requires vast numbers of data types.
However, the single most complicated aspect of email — or any other computer-mediated communication, although email always seems to wrestle with the problems first — is the lingering effects of the Tower of Babel. There are an estimated 6700 languages in the world, and even though thousands are in the process of dying out, that still leaves thousands to support. People want to be able to send and receive email in their own languages, and this leads to staggering complexity.
Languages vary impressively. Most western languages go from left to right, but others such as Arabic and Hebrew go from right to left, and some Asian languages go from top to bottom. English speakers tend to think of text as a simple series of characters, but in other languages there are special marks that need to be added above or underneath some letters. In some languages, there are different representations of the same character in different contexts, for example when the letter is at the end of a word. Increasingly, non-English speaking people are, understandably, demanding the ability to represent their languages “properly.”
The real-world diversity of languages and their scripts is further complicated by the introduction, in the world of computers, of the notion of “character sets.” Contrary to casual assumption, character sets and languages do not map onto each other simply. There are dozens of character sets in which English can be represented, for example, and there are dozens of character sets that can represent more than one language. Worse still, there are “character sets” in common use that do not come from any standards body or process, but represent the unilateral representation of a single vendor.
Some of you are tapping your fingers impatiently at this point — why am I nattering about character sets now that we have Unicode, a single, huge character set that is intended to superseed all of them? Al