This website is no longer maintained. Its content may be obsolete. Please visit http://home.cern/ for current CERN information.
Previous: | Software Development Tools Service | (See printing version) | |
Next: | Questions and Answers from the Computing Help Desk |
Jukka Korpela,
Helsinki University of Technology (HUT), Computing Centre, Finland
Note: this article is extracted from a more detailed document
available on the Web, at URL
http://www.hut.fi/u/jkorpela/chars.html
).
As it contains many references which are links
to other Web pages it might be useful to browse it from the Web.
N.B. All references with links have been underlined.
This article tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context. It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. ASCII, ISO 8859 family (ISO Latin, especially ISO Latin 1) and MIME are used as examples. The document in itself does not contain solutions to practical problems with character codes; rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.
In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively. The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than byte. Internally, octets consist of eight bits (hence the name, from Latin octo 'eight'), but we need not go into bit level here.
Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit which presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.
In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers.
Previously ASCII encoding was usually assumed by default (and it is still widely used). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.
The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is often confusing.
Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:
The phrase character set is used in a variety of meanings. Often it denotes just a character repertoire but it may also refer to a character code or even to a character encoding. See, for example OII's document Character Set Standards, which mentions several standards, some of which define a character code while others also specify a fixed encoding.
Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire.
The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding.
In fact, the definition of ASCII also defines a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):
ASCII Printable characters ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~
There are actually several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986.
The international standard ISO
646 defines a character set similar to
US-ASCII but with code positions
corresponding to US-ASCII characters @[\]{|}
as "national use
positions". It also gives some liberties with characters #$^`~. The
standard also defines an "international reference version (IRV)",
which is (in the 1991 edition of ISO 646) identical to
US-ASCII.
The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of character appearing in national variants are not intended to be exhaustive, just typical examples.
dec | oct | hex | glyph | official Unicode name | National variants |
---|---|---|---|---|---|
35 | 43 | 23 | # | number sign | # £ Ù |
36 | 44 | 24 | $ | dollar sign | ¤ |
64 | 100 | 40 | @ | commercial at | É § Ä à ³ |
91 | 133 | 5B | [ | left square bracket | Ä Æ ° â ¡ ÿ é |
92 | 134 | 5C | \ | reverse solidus | Ö Ø ç Ñ ½ ¥ |
93 | 135 | 5D | ] | right square bracket | Å Ü § ê é ¿ | |
94 | 136 | 5E | ^ | circumflex accent | Ü î |
95 | 137 | 5F | _ | low line | è |
96 | 140 | 60 | ` | grave accent | é ä µ ô ù |
123 | 173 | 7B | { | left curly bracket | ä æ é à ° ¨ |
124 | 174 | 7C | | | vertical line | ö ø ù ò ñ f |
125 | 175 | 7D | } | right curly bracket | å ü è ç ¼ |
126 | 176 | 7E | ~ | tilde | ü ¯ ß ¨ û ì ´ |
Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Many systems which support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.
More information about national variants and their impact:
The character code defined by the ASCII standard is the
following: code values are assigned to characters consecutively in
the order in which the characters are listed in the "ASCII Printable
characters" table above (rowwise),
starting from 32 (assigned to the blank) and ending up with 126
(assigned to the tilde character ~
).
The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. Notice that code numbers from 0 to 31 and code numbers from 127 upwards do not correspond to any printable character. (Code numbers 0 - 31 and 127 are reserved for control purposes. They have standardized names, but in fact their usage varies a lot.) Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)
Sometimes the phrase "8-bit ASCII" is used. It is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.
The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.
In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:
ISO Latin 1 Printable characters ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
Notes:
Typing characters on a computer may appear deceptively simple: you press a key labeled "A", and the character "A" appears on the screen. You also expect "A" to be included into a disk file when you save what you are typing, you expect "A" to appear on paper if you print your text, and you expect "A" to be sent if you send your product by E-mail or something like that.
Thus far, you should have learned that the presentation of a character in computer storage or disk or in data transfer may vary a lot. You have probably realized that especially if it's not the common "A" but something more special (say, an "A" with an accent), strange things might happen, especially if data is not accompanied with adequate information about its encoding.
But you might still be too confident. You probably expect that on your system at least things are simpler than that. If you use your very own very personal computer and press the key labeled "A" on its keyboard, then shouldn't it be evident that in its storage and processor, on its disk, on its screen its invariably "A"? Can't you just ignore its internal character code and character encoding? Well, probably yes - with "A". I wouldn't be so sure about "Ä", for instance.
When you press a key on your keyboard, then what actually happens is this. The keyboard sends the code of a character to the processor. The processor then, in addition to storing the data internally somewhere, normally sends it to the display device. Now, the keyboard settings and the display settings might be different from what you expect. Even if a key is labeled "Ä", it might send something else than the code of "Ä" in the character code used in your computer. Similarly, the display device, upon receiving such a code, might be set to display something different. Such mismatches are usually undesirable, but they are definitely possible.
Moreover, there are often keyboard restrictions. If your computer uses internally, say, ISO Latin 1 character repertoire, you probably won't find keys for all 191 characters in it on your keyboard.
Therefore, you often need program or operating system specific ways of entering characters from a keyboard, either because there is no key for a character you need or there is but it does not work (properly). Two important examples of such ways:
It is often possible to use various "escape"
notations for characters. This somewhat vague term means
notations which are converted to characters according to some
specific rules by some programs. For example, in the HTML
language one can use the notation Ä
or,
equivalently, the notation Ä
for the
character Ä. To take another example, in the C programming
language, one can usually write \0304
to denote
Ä within a string constant, although this makes the program
character code dependent. In cases like these, the character itself
does not occur in a file (such as an HTML document or a C source
program); instead, the file contains the "escape" notation as a
character sequence, which will then be interpreted in a
specific way by programs like a Web browser or a C compiler.
It is hopefully obvious from the preceding discussion that a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated.
Naturally, such a sequence could be intended to present other than character data, too; it could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers.
This problem can be handled in different ways in different systems when data is stored and processed within one computer system. For data transmission, a platform-independent method of specifying the encoding and other relevant information is needed. Such methods exist, although they not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm. Attaching a human-readable note, such as a few words of explanation in an E-mail message body, is better than nothing. But since data is processed by programs which cannot understand such notes, the encoding should be specified in a standardized computer-readable form.
Internet media types, often
called MIME media types, can be used to specify a
major media type ("top level media type", such as
text
), a subtype (such as html
), and an
encoding (such as iso-8859-1
).
They were originally developed to allow sending other than plain ASCII data by E-mail. They can be (and should be)
used for specifying the encoding when data is sent over a network,
e.g. by E-mail or using the HTTP protocol on the World Wide Web.
The media type concept is defined in RFC 2046. The procedure for registering types in given in RFC 2048. For less authoritative but more readably presented information on media types, see e.g. document MIME Types by Chris Herborth. The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset".
Specifically, when data is sent in MIME format, the media type
and encoding are specified in a manner illustrated by the following
example:
Content-Type: text/html; charset=iso-8859-1
This specifies, in addition to saying that the media type is
text
and subtype is html
, that the character
encoding is ISO 8859-1
Several character encodings have alternate names in the registry. For example, the basic (ISO 646) variant of ASCII can be called "ASCII" or "ANSI_X3.4-1968" or "cp367" (plus a few other names); the preferred name in MIME context is, according to the registry, "US-ASCII". Similarly, ISO 8859-1 has several names, the preferred MIME name being "ISO-8859-1". The "native" encoding for Unicode, UCS-2, is named "ISO-10646-UCS-2" there.
The Content-Type
information is
an example of information in a header. Headers
relate to some some data, describing its presentation and other
things, but is passed as logically separate from it. Possible
headers and their contents are defined in the basic MIME
specification, RFC
2045. Adequate headers should normally be generated
automatically by the software which sends the data (such as a
program for sending E-mail, or a Web server) and interpreted
automatically by receiving software (such as a program for reading
E-mail, or a Web browser). In E-mail messages, headers precede the
message body; it depends on the E-mail program whether and how it
displays the headers. For Web documents, a Web server is required
to send headers when it delivers a document to a browser (or other
user agent) which has sent a request for the document.
Basically, MIME should let people communicate smoothly without hindrances caused by character code and encoding differences. MIME should handle the necessary conversions automatically and invisibly.
For example, when person A sends E-mail to person B, the following should happen: The E-mail program used by A encodes A's message in some particular manner, probably according to some convention which is normal on the system where the program is used (such as ISO 8859-1 encoding on a typical modern Unix system). The program automatically includes information about this encoding into an E-mail header, which is usually invisible both when sending and when reading the message. The message, with the headers, is then delivered, through network connections, to B's system. When B uses his E-mail program (which may be very different from A's) to read the message, the program should automatically pick up the information about the encoding as specified in a header and interpret the message body according to it. For example, if B is using a Macintosh computer, the program would automatically convert the message into Mac's internal character encoding and only then display it. Thus, if the message was ISO 8859-1 encoded and contained the Ä (upper case A with dieresis) character, encoded as octet 196, the E-mail program used on the Mac should use a conversion table to map this to octet 128, which is the encoding for Ä on Mac. (If the program fails to do such a conversion, strange things will happen. ASCII characters would be displayed correctly, since they have the same codes in both encodings, but instead of Ä, the character corresponding to octet 196 in Mac encoding would appear - a symbol which looks like f in italics.)
Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs.
Typical minor (!) problems which may occur in communication in Western European languages other than English is that most characters get interpreted and displayed correctly but some "national letters" don't. For example, character repertoire needed in German, Swedish, and Finnish is essentially ASCII plus a few letters like "ä" from the rest of ISO Latin 1. If a text in such a language is processed so that a necessary conversion is not applied, or an incorrect conversion is applied, the result might be that e.g. the word "später" becomes "spter" or "spÌter" or "spdter" or "sp=E4ter".
To illustrate what may happen when text is sent in a grossly
invalid form, consider the following example. I'm sending myself
E-mail, using Netscape 4.0 (on Windows 95). In the mail composition
window, I set the encoding to UTF-8. The body of
my message is simply
Tämä on testi.
(That's Finnish for 'This is a test'. The second and fourth
character is letter a with umlaut.) Trying to read the mail on my
Unix account, using the Pine E-mail program (popular among Unix
users), I see the following (when in "full headers" mode;
irrelevant headers omitted here):
X-Mailer: Mozilla 4.0 [en] (Win95; I) MIME-Version: 1.0 To: Jukka.Korpela@hut.fi Subject: Test X-Priority: 3 (Normal) Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7 Content-Transfer-Encoding: 7bit [The following text is in the "x-UNICODE-2-0-UTF-7" character set] [Your display is set for the "ISO-8859-1" character set] [Some characters may be displayed incorrectly] T+O6Q- on testi.
Interesting, isn't it? I specifically requested
UTF-8 encoding, but Netscape used UTF-7. And it did not include a correct
header, since x-UNICODE-2-0-UTF-7
is not a registered "charset" name. Even if the encoding
had been a registered one, there would have been no guarantee that
my E-mail program would have been able to handle the encoding. The
example, "T+O6Q-" instead of "Tämä", illustrates what may
happen when an octet sequence is interpreted according to another
encoding than the intended one. In fact, it is difficult to say
what Netscape was really doing, since it seems to encode
incorrectly.
A correct UTF-7 encoding for "Tämä" would be "T+AOQ-m+AOQ-". The "+" and "-" characters correspond to octets indicating a switch to "shifted encoding" and back from it. The shifted encoding is based on presenting Unicode values first as 16-bit binary integers, then regrouping the bits and presenting the resulting six- bit groups as octets according to a table specified in RFC 2045 in the section on Base64. See also RFC 2152.
Whenever text data is sent over a network, the sender and the recipient should have a joint agreement on the character encoding used. In the optimal case, this is handled by the software automatically, but in reality the users need to take some precautions.
Most importantly, make sure that any Internet software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: the header must be there and it must reflect the actual encoding used; and the encoding used must be one that is widely understood by the (potential) recipients' software. (One must often make compromises as regards to latter aim: you may need to use an encoding which is not yet widely supported to get your message through at all.
Learn to use your Web browser, newsreader, and E-mail program so
that you can retrieve the encoding information for the page,
article, or message you are reading. (For example, on Netscape use
View Page Info
; on News Xpress, use View Raw
Format
; on Pine, use h
)
If you use, say, Netscape to send E-mail or to post to Usenet news, make sure it sends the message in a reasonable form. In particular, make sure it does not duplicate the message by sending it both as plain text and as HTML (turn off the latter). As regards to character encoding, make sure it is something widely understood, such as ASCII, some ISO 8859 encoding, or UTF-8, depending on how large character repertoire you need.
In particular, avoid sending data in a proprietary encoding (like the Macintosh encoding or a DOS encoding) to a public network. (At the very least, if you do that, make sure that the message heading specifies the encoding!) There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to Internet, data should be converted to a more widely known encoding, by the sending program. If you cannot find a way to configure your program to do that, get another program.
For matters related to this article please contact the author.