CERN Accelerating science

This website is no longer maintained. Its content may be obsolete. Please visit http://home.cern/ for current CERN information.

CERN home page CERN home page The Consult page The Consult page This page The Consult page This page This page Help, Info about this page

Previous: Previous Software Development Tools Service (See printing version)
Next: Next Questions and Answers from the Computing Help Desk


A Tutorial on Character Code Issues

  Jukka Korpela,
Helsinki University of Technology (HUT), Computing Centre, Finland


Note: this article is extracted from a more detailed document available on the Web, at URL http://www.hut.fi/u/jkorpela/chars.html). As it contains many references which are links to other Web pages it might be useful to browse it from the Web.
N.B. All references with links have been underlined.

This article tries to clarify the concepts of character repertoire, character code, and character encoding especially in the Internet context. It specifically avoids the term character set, which is confusingly used to denote repertoire or code or encoding. ASCII, ISO 8859 family (ISO Latin, especially ISO Latin 1) and MIME are used as examples. The document in itself does not contain solutions to practical problems with character codes; rather, it gives background information needed for understanding what solutions there might be, what the different solutions do - and what's really the problem in the first place.

The basics

In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively. The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than byte. Internally, octets consist of eight bits (hence the name, from Latin octo 'eight'), but we need not go into bit level here.

Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit which presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.

In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers.

Previously ASCII encoding was usually assumed by default (and it is still widely used). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.

Definitions

The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is often confusing.

character repertoire
A set of distinct characters. No specific internal presentation in computers or data transfer is assumed. A character repertoire is usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form. Notice that a character repertoire may contain characters which look the same in some presentations but are regarded as logically distinct, such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha.
character code
A mapping, often presented in tabular form, which defines one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. That is, it assigns a unique numerical code, a code position, to each character in the repertoire. As synonyms for "code position", the following terms are also in use: code number, code value, code element, code point, code set value. Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers; in fact, most character codes have "holes", such as code positions reserved for control functions or for eventual future use to be defined later.
character encoding
A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets. Naturally, this only works for character repertoires with at most 256 characters. For larger sets, more complicated encodings are needed.
glyph
It is important to distinguish the character concept from the glyph concept. A glyph is a presentation of a particular shape which a character may have when rendered or displayed. For example, the character Z might be presented as a boldface Z or as an italic Z, and it would still be a presentation of the same character. On the other hand, lower-case z is defined to be a separate character - which in turn may have different glyph presentations.
Unicode
Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code identical with ISO 10646 and an encoding for it. In practice, people usually talk about Unicode rather than ISO 10646, partly because Unicode is more explicit about the meanings of characters, partly because Unicode charts of characters are available on the Web. (Unicode version 1.0 used somewhat different names for some characters than ISO 10646. In the current Unicode version, 2.0, the names have been made the same as in ISO 10646.)

Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:

  1. A character repertoire specifies a collection of characters, such as "a", "!", and "ä".
  2. A character code defines numeric codes for characters in a repertoire. For example, in the ISO 10646 character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240. (Note: Especially the per mille sign, presenting o/oo as a single character, can be incorrectly shown on display or on paper. That would be an illustration of the symptoms of the problems we are discussing.)
  3. A character encoding defines how sequences of numeric codes are presented as (i.e. mapped to) sequences of octets. In one possible encoding for ISO 10646, the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.

The phrase character set is used in a variety of meanings. Often it denotes just a character repertoire but it may also refer to a character code or even to a character encoding. See, for example OII's document Character Set Standards, which mentions several standards, some of which define a character code while others also specify a fixed encoding.

Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire.

Example: ASCII

The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding.

In fact, the definition of ASCII also defines a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):


   ASCII Printable characters

  ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~ 

There are actually several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986.

The international standard ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @[\]{|} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines an "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII.

The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of character appearing in national variants are not intended to be exhaustive, just typical examples.

dec  oct  hex  glyph official Unicode name National variants
 35   43  23   #   number sign # £ Ù
 36   44  24   $   dollar sign ¤
 64  100  40   @   commercial at É § Ä à ³
 91  133  5B   [   left square bracket Ä Æ ° â ¡ ÿ é
 92  134  5C   \   reverse solidus Ö Ø ç Ñ ½ ¥
 93  135  5D   ]   right square bracket Å Ü § ê é ¿ |
 94  136  5E   ^   circumflex accent Ü î
 95  137  5F   _   low line è
 96  140  60   `   grave accent é ä µ ô ù
123  173  7B   {   left curly bracket ä æ é à ° ¨
124  174  7C   |   vertical line ö ø ù ò ñ f
125  175  7D   }   right curly bracket å ü è ç ¼
126  176  7E   ~   tilde ü ¯ ß ¨ û ì ´

Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Many systems which support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.

More information about national variants and their impact:

The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed in the "ASCII Printable characters" table above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~).

The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. Notice that code numbers from 0 to 31 and code numbers from 127 upwards do not correspond to any printable character. (Code numbers 0 - 31 and 127 are reserved for control purposes. They have standardized names, but in fact their usage varies a lot.) Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)

Sometimes the phrase "8-bit ASCII" is used. It is a misnomer used to refer to various character codes which are extensions of ASCII in the following sense: the character repertoire contains ASCII as a subset, the code numbers are in the range 0 - 255, and the code numbers of ASCII characters equal their ASCII codes.

Another example: ISO Latin 1 and ISO 8859-1

The ISO 8859-1 standard (which is part of the ISO 8859 family of standards) defines a character repertoire identified as "Latin alphabet No. 1", commonly called "ISO Latin 1", as well as a character code for it. The repertoire contains the ASCII repertoire as a subset, and the code numbers for those characters are the same as in ASCII. The standard also specifies an encoding, which is similar to that of ASCII: each code number is presented simply as one octet.

In addition to the ASCII characters, ISO Latin 1 contains various accented characters and other letters needed for writing languages of Western Europe, and some special characters. These characters occupy code positions 160 - 255, and they are:


   ISO Latin 1 Printable characters

  ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Notes:

Typing characters

Typing characters on a computer may appear deceptively simple: you press a key labeled "A", and the character "A" appears on the screen. You also expect "A" to be included into a disk file when you save what you are typing, you expect "A" to appear on paper if you print your text, and you expect "A" to be sent if you send your product by E-mail or something like that.

Thus far, you should have learned that the presentation of a character in computer storage or disk or in data transfer may vary a lot. You have probably realized that especially if it's not the common "A" but something more special (say, an "A" with an accent), strange things might happen, especially if data is not accompanied with adequate information about its encoding.

But you might still be too confident. You probably expect that on your system at least things are simpler than that. If you use your very own very personal computer and press the key labeled "A" on its keyboard, then shouldn't it be evident that in its storage and processor, on its disk, on its screen its invariably "A"? Can't you just ignore its internal character code and character encoding? Well, probably yes - with "A". I wouldn't be so sure about "Ä", for instance.

When you press a key on your keyboard, then what actually happens is this. The keyboard sends the code of a character to the processor. The processor then, in addition to storing the data internally somewhere, normally sends it to the display device. Now, the keyboard settings and the display settings might be different from what you expect. Even if a key is labeled "Ä", it might send something else than the code of "Ä" in the character code used in your computer. Similarly, the display device, upon receiving such a code, might be set to display something different. Such mismatches are usually undesirable, but they are definitely possible.

Moreover, there are often keyboard restrictions. If your computer uses internally, say, ISO Latin 1 character repertoire, you probably won't find keys for all 191 characters in it on your keyboard.

Therefore, you often need program or operating system specific ways of entering characters from a keyboard, either because there is no key for a character you need or there is but it does not work (properly). Two important examples of such ways:

It is often possible to use various "escape" notations for characters. This somewhat vague term means notations which are converted to characters according to some specific rules by some programs. For example, in the HTML language one can use the notation &Auml; or, equivalently, the notation &#196; for the character Ä. To take another example, in the C programming language, one can usually write \0304 to denote Ä within a string constant, although this makes the program character code dependent. In cases like these, the character itself does not occur in a file (such as an HTML document or a C source program); instead, the file contains the "escape" notation as a character sequence, which will then be interpreted in a specific way by programs like a Web browser or a C compiler.

Information about encoding

The need for information about encoding

It is hopefully obvious from the preceding discussion that a sequence of octets can be interpreted in a multitude of ways when processed as character data. By looking at the octet sequence only, you cannot even know whether each octet presents one character or just part of a two-octet presentation of a character, or something more complicated.

Naturally, such a sequence could be intended to present other than character data, too; it could be an image in a bitmap format, or a computer program in binary form, or numeric data in the internal format used in computers.

This problem can be handled in different ways in different systems when data is stored and processed within one computer system. For data transmission, a platform-independent method of specifying the encoding and other relevant information is needed. Such methods exist, although they not always used widely enough. People still send each other data without specifying the encoding, and this may cause a lot of harm. Attaching a human-readable note, such as a few words of explanation in an E-mail message body, is better than nothing. But since data is processed by programs which cannot understand such notes, the encoding should be specified in a standardized computer-readable form.

The MIME solution

Internet media types, often called MIME media types, can be used to specify a major media type ("top level media type", such as text), a subtype (such as html), and an encoding (such as iso-8859-1). They were originally developed to allow sending other than plain ASCII data by E-mail. They can be (and should be) used for specifying the encoding when data is sent over a network, e.g. by E-mail or using the HTTP protocol on the World Wide Web.

The media type concept is defined in RFC 2046. The procedure for registering types in given in RFC 2048. For less authoritative but more readably presented information on media types, see e.g. document MIME Types by Chris Herborth. The technical term used to denote a character encoding in the Internet media type context is "character set", abbreviated "charset".

Specifically, when data is sent in MIME format, the media type and encoding are specified in a manner illustrated by the following example:
Content-Type: text/html; charset=iso-8859-1
This specifies, in addition to saying that the media type is text and subtype is html, that the character encoding is ISO 8859-1

Several character encodings have alternate names in the registry. For example, the basic (ISO 646) variant of ASCII can be called "ASCII" or "ANSI_X3.4-1968" or "cp367" (plus a few other names); the preferred name in MIME context is, according to the registry, "US-ASCII". Similarly, ISO 8859-1 has several names, the preferred MIME name being "ISO-8859-1". The "native" encoding for Unicode, UCS-2, is named "ISO-10646-UCS-2" there.

The Content-Type information is an example of information in a header. Headers relate to some some data, describing its presentation and other things, but is passed as logically separate from it. Possible headers and their contents are defined in the basic MIME specification, RFC 2045. Adequate headers should normally be generated automatically by the software which sends the data (such as a program for sending E-mail, or a Web server) and interpreted automatically by receiving software (such as a program for reading E-mail, or a Web browser). In E-mail messages, headers precede the message body; it depends on the E-mail program whether and how it displays the headers. For Web documents, a Web server is required to send headers when it delivers a document to a browser (or other user agent) which has sent a request for the document.

How MIME should work in practice

Basically, MIME should let people communicate smoothly without hindrances caused by character code and encoding differences. MIME should handle the necessary conversions automatically and invisibly.

For example, when person A sends E-mail to person B, the following should happen: The E-mail program used by A encodes A's message in some particular manner, probably according to some convention which is normal on the system where the program is used (such as ISO 8859-1 encoding on a typical modern Unix system). The program automatically includes information about this encoding into an E-mail header, which is usually invisible both when sending and when reading the message. The message, with the headers, is then delivered, through network connections, to B's system. When B uses his E-mail program (which may be very different from A's) to read the message, the program should automatically pick up the information about the encoding as specified in a header and interpret the message body according to it. For example, if B is using a Macintosh computer, the program would automatically convert the message into Mac's internal character encoding and only then display it. Thus, if the message was ISO 8859-1 encoded and contained the Ä (upper case A with dieresis) character, encoded as octet 196, the E-mail program used on the Mac should use a conversion table to map this to octet 128, which is the encoding for Ä on Mac. (If the program fails to do such a conversion, strange things will happen. ASCII characters would be displayed correctly, since they have the same codes in both encodings, but instead of Ä, the character corresponding to octet 196 in Mac encoding would appear - a symbol which looks like f in italics.)

Problems with implementations - examples

Unfortunately, there are deficiencies and errors in software so that users often have to struggle with character code conversion problems, perhaps correcting the actions taken by programs.

Typical minor (!) problems which may occur in communication in Western European languages other than English is that most characters get interpreted and displayed correctly but some "national letters" don't. For example, character repertoire needed in German, Swedish, and Finnish is essentially ASCII plus a few letters like "ä" from the rest of ISO Latin 1. If a text in such a language is processed so that a necessary conversion is not applied, or an incorrect conversion is applied, the result might be that e.g. the word "später" becomes "spter" or "spÌter" or "spdter" or "sp=E4ter".

To illustrate what may happen when text is sent in a grossly invalid form, consider the following example. I'm sending myself E-mail, using Netscape 4.0 (on Windows 95). In the mail composition window, I set the encoding to UTF-8. The body of my message is simply
Tämä on testi.
(That's Finnish for 'This is a test'. The second and fourth character is letter a with umlaut.) Trying to read the mail on my Unix account, using the Pine E-mail program (popular among Unix users), I see the following (when in "full headers" mode; irrelevant headers omitted here):

X-Mailer: Mozilla 4.0 [en] (Win95; I)
MIME-Version: 1.0
To: Jukka.Korpela@hut.fi
Subject: Test
X-Priority: 3 (Normal)
Content-Type: text/plain; charset=x-UNICODE-2-0-UTF-7
Content-Transfer-Encoding: 7bit

    [The following text is in the "x-UNICODE-2-0-UTF-7" character set]
    [Your display is set for the "ISO-8859-1" character set]
    [Some characters may be displayed incorrectly]

T+O6Q- on testi.

Interesting, isn't it? I specifically requested UTF-8 encoding, but Netscape used UTF-7. And it did not include a correct header, since x-UNICODE-2-0-UTF-7 is not a registered "charset" name. Even if the encoding had been a registered one, there would have been no guarantee that my E-mail program would have been able to handle the encoding. The example, "T+O6Q-" instead of "Tämä", illustrates what may happen when an octet sequence is interpreted according to another encoding than the intended one. In fact, it is difficult to say what Netscape was really doing, since it seems to encode incorrectly.

A correct UTF-7 encoding for "Tämä" would be "T+AOQ-m+AOQ-". The "+" and "-" characters correspond to octets indicating a switch to "shifted encoding" and back from it. The shifted encoding is based on presenting Unicode values first as 16-bit binary integers, then regrouping the bits and presenting the resulting six- bit groups as octets according to a table specified in RFC 2045 in the section on Base64. See also RFC 2152.

Practical conclusions

Whenever text data is sent over a network, the sender and the recipient should have a joint agreement on the character encoding used. In the optimal case, this is handled by the software automatically, but in reality the users need to take some precautions.

Most importantly, make sure that any Internet software that you use to send data specifies the encoding correctly in suitable headers. There are two things involved: the header must be there and it must reflect the actual encoding used; and the encoding used must be one that is widely understood by the (potential) recipients' software. (One must often make compromises as regards to latter aim: you may need to use an encoding which is not yet widely supported to get your message through at all.

Learn to use your Web browser, newsreader, and E-mail program so that you can retrieve the encoding information for the page, article, or message you are reading. (For example, on Netscape use View Page Info; on News Xpress, use View Raw Format; on Pine, use h)

If you use, say, Netscape to send E-mail or to post to Usenet news, make sure it sends the message in a reasonable form. In particular, make sure it does not duplicate the message by sending it both as plain text and as HTML (turn off the latter). As regards to character encoding, make sure it is something widely understood, such as ASCII, some ISO 8859 encoding, or UTF-8, depending on how large character repertoire you need.

In particular, avoid sending data in a proprietary encoding (like the Macintosh encoding or a DOS encoding) to a public network. (At the very least, if you do that, make sure that the message heading specifies the encoding!) There's nothing wrong with using such an encoding within a single computer or in data transfer between similar computers. But when sent to Internet, data should be converted to a more widely known encoding, by the sending program. If you cannot find a way to configure your program to do that, get another program.

Further reading




For matters related to this article please contact the author.

Cnl.Editor@cern.ch


Last Updated on June 24th, 1999 at 16:52:25
Copyright © CERN 1999 -- European Laboratory for Particle Physics