Saying You Never Know What U Have Until Its Gond

Always wonder almost that mysterious Content-Type tag? You know, the i you're supposed to put in HTML and you never quite know what it should be?

Did yous ever get an email from your friends in Bulgaria with the field of study line "???? ?????? ??? ????"?

I've been dismayed to discover only how many software developers aren't really completely upward to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether information technology could handle incoming electronic mail in Japanese. Japanese? They take email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, nosotros discovered it was doing exactly the wrong thing with grapheme sets, so nosotros actually had to write heroic lawmaking to disengage the incorrect conversion it had done and redo it correctly. When I looked into another commercial library, it, likewise, had a completely cleaved character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't do anything nigh it." Like many programmers, he just wished it would all accident over somehow.

But information technology won't. When I discovered that the popular spider web development tool PHP has almost complete ignorance of graphic symbol encoding bug, blithely using 8 bits for characters, making it darn near impossible to develop skillful international web applications, I thought, enough is enough.

So I have an proclamation to make: if y'all are a programmer working in 2003 and you don't know the nuts of characters, character sets, encodings, and Unicode, and I grab yous, I'm going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

And one more thing:

IT'Due south Not THAT HARD.

In this article I'll fill you in on exactly what every working programmer should know. All that stuff about "plain text = ascii = characters are 8 bits" is not only incorrect, information technology'due south hopelessly wrong, and if you're even so programming that way, you're non much better than a medical doctor who doesn't believe in germs. Please exercise not write another line of lawmaking until y'all finish reading this article.

Before I get started, I should warn y'all that if y'all are one of those rare people who knows about internationalization, you are going to notice my entire discussion a little bit oversimplified. I'm really just trying to set a minimum bar here so that everyone can understand what's going on and tin can write lawmaking that has a hope of working with text in any language other than the subset of English that doesn't include words with accents. And I should warn you that character handling is simply a tiny portion of what it takes to create software that works internationally, but I can just write about one matter at a fourth dimension so today it's character sets.

A Historical Perspective

The easiest manner to understand this stuff is to become chronologically.

You lot probably think I'thousand going to talk nearly very one-time grapheme sets like EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't have to get that far dorsum in time.

ASCII tableBack in the semi-olden days, when Unix was existence invented and K&R were writing The C Programming Linguistic communication, everything was very uncomplicated. EBCDIC was on its way out. The just characters that mattered were skillful erstwhile unaccented English language letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently be stored in vii bits. Near computers in those days were using eight-scrap bytes, and then non but could you store every possible ASCII graphic symbol, but you had a whole scrap to spare, which, if you were wicked, you could use for your ain devious purposes: the dim bulbs at WordStar really turned on the loftier fleck to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. But kidding. They were used for control characters, like 7 which fabricated your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to exist fed in.

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight $.25, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The problem was, lots of people had this idea at the same fourth dimension, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known every bit the OEM character fix which provided some absolute characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with lilliputian dingle-dangles dangling off the right side, etc., and you lot could use these line drawing characters to make spiffy boxes and lines on the screen, which yous can withal see running on the 8088 figurer at your dry cleaners'. In fact  as before long as people started buying PCs exterior of America all kinds of different OEM character sets were dreamed upward, which all used the summit 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel information technology was the Hebrew letter of the alphabet Gimel (ג), so when Americans would transport their résumés to State of israel they would arrive as rגsumגs. In many cases, such as Russian, at that place were lots of unlike ideas of what to practise with the upper-128 characters, so you couldn't even reliably interchange Russian documents.

Somewhen this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to practise below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where y'all lived. These different systems were chosen code pages. Then for example in Israel DOS used a lawmaking page called 862, while Greek users used 737. They were the aforementioned below 128 simply different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English language to Icelandic and they even had a few "multilingual" code pages that could practice Esperanto and Galician on the same figurer! Wow! But getting, say, Hebrew and Greek on the aforementioned computer was a complete impossibility unless you wrote your ain custom program that displayed everything using bitmapped graphics, considering Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more than crazy things were going on to take into business relationship the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the "double byte character set up" in which some messages were stored in i byte and others took two. It was easy to movement forward in a string, merely dang nearly impossible to move backwards. Programmers were encouraged not to use southward++ and due south– to move backwards and forwards, but instead to call functions such every bit Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But yet, most people just pretended that a byte was a character and a grapheme was 8 bits and every bit long as y'all never moved a string from 1 computer to another, or spoke more than than one language, it would sort of e'er work. Merely of grade, as soon as the Cyberspace happened, it became quite commonplace to motion strings from one figurer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some brand-believe ones like Klingon, likewise. Some people are under the misconception that Unicode is simply a 16-chip code where each grapheme takes 16 bits and therefore in that location are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you idea that, don't feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode style of thinking of things or zilch volition make sense.

Until now, we've causeless that a letter of the alphabet maps to some bits which you can shop on disk or in memory:

A -> 0100 0001

In Unicode, a alphabetic character maps to something called a lawmaking point which is still just a theoretical concept. How that lawmaking betoken is represented in memory or on disk is a whole nuther story.

In Unicode, the letter A is a platonic ideal. It'southward merely floating in heaven:

A

This platonic A is different than B, and different from a, but the aforementioned as A and A and A. The idea that A in a Times New Roman font is the aforementioned character as the A in a Helvetica font, but different from "a" in lower example, does not seem very controversial, just in some languages simply figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy mode of writing ss? If a letter of the alphabet's shape changes at the end of the give-and-take, is that a different alphabetic character? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or then, accompanied by a bang-up deal of highly political contend, and yous don't have to worry about it. They've figured it all out already.

Every platonic letter of the alphabet in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a lawmaking point. The U+ ways "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Own. The English letter A would be U+0041. You tin find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no existent limit on the number of letters that Unicode tin can ascertain and in fact they take gone beyond 65,536 so non every unicode letter tin can actually exist squeezed into 2 bytes, but that was a myth anyway.

OK, then say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a agglomeration of code points. Numbers, actually. Nosotros haven't yet said anything virtually how to store this in memory or represent it in an email bulletin.

Encodings

That'south where encodings come in.

The earliest idea for Unicode encoding, which led to the myth most the 2 bytes, was, hey, allow'due south just store those numbers in two bytes each. Then Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Correct? Not and so fast! Couldn't it besides be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to exist able to shop their Unicode lawmaking points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, information technology was evening and it was morning and at that place were already two means to store Unicode. Then the people were forced to come up with the bizarre convention of storing a Atomic number 26 FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if yous are swapping your loftier and low bytes information technology will look similar a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

For a while information technology seemed like that might be good enough, but programmers were lament. "Await at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points higher up U+00FF. Besides they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn't have minded guzzling twice the number of bytes. Merely those Californian wimps couldn't carry the thought of doubling the amount of storage it took for strings, and anyway, in that location were already all these doggone documents out in that location using diverse ANSI and DBCS character sets and who's going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the bright concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in retentiveness using 8 bit bytes. In UTF-8, every code bespeak from 0-127 is stored in a single byte. Just code points 128 and above are stored using 2, 3, in fact, up to half dozen bytes.

How UTF-8 works

This has the slap-up side upshot that English language text looks exactly the same in UTF-eight as information technology did in ASCII, so Americans don't even notice anything wrong. But the residuum of the earth has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will exist stored as 48 65 6C 6C 6F, which, behold! is the same equally it was stored in ASCII, and ANSI, and every OEM character attack the planet. Now, if you lot are and then assuming every bit to use accented messages or Greek messages or Klingon letters, you'll have to use several bytes to store a single code point, just the Americans volition never observe. (UTF-8 besides has the prissy property that ignorant old string-processing code that wants to use a single 0 byte equally the goose egg-terminator will not truncate strings).

So far I've told you three means of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has ii bytes) or UTF-16 (because information technology has 16 bits), and you still have to effigy out if it's high-endian UCS-2 or low-endian UCS-ii. And at that place'due south the popular new UTF-8 standard which has the squeamish holding of also working respectably if yous have the happy coincidence of English text and braindead programs that are completely unaware that at that place is anything other than ASCII.

There are really a agglomeration of other ways of encoding Unicode. There's something chosen UTF-seven, which is a lot similar UTF-8 merely guarantees that the loftier bit will e'er be cypher, so that if you have to pass Unicode through some kind of draconian police-state e-mail system that thinks vii $.25 are quite enough, thank you it can however squeeze through unscathed. At that place'southward UCS-4, which stores each code point in 4 bytes, which has the nice holding that every single code point can exist stored in the same number of bytes, but, golly, even the Texans wouldn't exist so bold as to waste that much memory.

And in fact now that you're thinking of things in terms of platonic platonic letters which are represented by Unicode lawmaking points, those unicode code points tin be encoded in any old-school encoding scheme, likewise! For case, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the one-time OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with ane catch: some of the messages might not bear witness up! If in that location's no equivalent for the Unicode code signal yous're trying to represent in the encoding you're trying to represent it in, you normally get a niggling question marking: ? or, if y'all're actually good, a box. Which did you get? -> �

In that location are hundreds of traditional encodings which can only shop some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (likewise useful for whatsoever Western European language). Just endeavour to shop Russian or Hebrew letters in these encodings and you become a bunch of question marks. UTF 7, 8, xvi, and 32 all accept the nice property of being able to store any code point correctly.

The Unmarried Most Of import Fact About Encodings

If you lot completely forget everything I just explained, please call back ane extremely of import fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.

In that location Ain't No Such Affair Equally Plain Text.

If you have a string, in memory, in a file, or in an electronic mail message, yous have to know what encoding it is in or yous cannot interpret it or display it to users correctly.

Almost every stupid "my website looks similar gibberish" or "she tin't read my emails when I utilize accents" problem comes down to i naive programmer who didn't understand the uncomplicated fact that if you don't tell me whether a particular cord is encoded using UTF-eight or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot brandish it correctly or fifty-fifty figure out where it ends. There are over a hundred encodings and above code signal 127, all bets are off.

How do we preserve this information virtually what encoding a string uses? Well, at that place are standard ways to do this. For an email message, yous are expected to have a cord in the header of the class

Content-Type: text/manifestly; charset="UTF-8"

For a spider web page, the original idea was that the web server would return a like Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML folio.

This causes problems. Suppose you lot accept a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of unlike languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The spider web server itself wouldn't really know what encoding each file was written in, so it couldn't send the Content-Type header.

It would exist convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can yous read the HTML file until you know what encoding it'southward in?! Luckily, nigh every encoding in mutual use does the same matter with characters between 32 and 127, so you tin can ever get this far on the HTML folio without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-eight">

Just that meta tag really has to be the very get-go matter in the <head> department because equally soon every bit the web browser sees this tag information technology's going to stop parsing the page and outset over after reinterpreting the whole page using the encoding you specified.

What practice spider web browsers practise if they don't discover whatsoever Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to judge, based on the frequency in which diverse bytes appear in typical text in typical encodings of various languages, what linguistic communication and encoding was used. Because the various old 8 flake lawmaking pages tended to put their national letters in unlike ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a hazard of working. Information technology'south truly weird, but it does seem to piece of work oft plenty that naïve web-folio writers who never knew they needed a Content-Type header look at their folio in a web browser and it looks ok, until 1 day, they write something that doesn't exactly accommodate to the letter of the alphabet-frequency-distribution of their native language, and Internet Explorer decides it'southward Korean and displays information technology thusly, proving, I recall, the point that Postel'southward Law about being "bourgeois in what yous emit and liberal in what you accept" is quite frankly not a good engineering science principle. Anyhow, what does the poor reader of this website, which was written in Bulgarian but appears to exist Korean (and not fifty-fifty cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at to the lowest degree a dozen for Eastern European languages) until the picture comes in clearer. If he knew to practice that, which most people don't.

For the latest version of CityDesk, the web site management software published by my company, nosotros decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use equally their native string blazon. In C++ code nosotros merely declare strings as wchar_t ("wide char") instead of char and utilize the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-two cord in C code you merely put an L before information technology as then: L"Hullo".

When CityDesk publishes the spider web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That's the way all 29 language versions of Joel on Software are encoded and I have not all the same heard a unmarried person who has had whatever trouble viewing them.

This article is getting rather long, and I can't perhaps cover everything there is to know about character encodings and Unicode, but I hope that if you've read this far, yous know plenty to get back to programming, using antibiotics instead of leeches and spells, a job to which I will leave you lot now.

gonzalezrancelf.blogspot.com

Source: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

0 Response to "Saying You Never Know What U Have Until Its Gond"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel