IME variants not present in the EACC/MARC21 character sets

These pages are maintained for reference purposes only. Please refer to the main page for the latest information on their status.

The problem:

The bibliographic utilities (RLG, OCLC) and most major local library system vendors have recently updated their software to rely upon third-party vendors for Chinese, Japanese and Korean input, and to conform to MARC21 standards. In practice this often means that users now use the standard Input Method Editors (IMEs) provided with Microsoft Windows 2000 and XP, or the more recent ones available on the Microsoft Office Proofing Tools 2003 CD.

(For more general information on Multilingual capabilities of Windows, see the file Multilingual Workshop. For some short guidelines to the most recent IMEs, see the file IME Workshop.)

However, for exchange purposes the domain of characters allowed by the MARC21 standard is still the one of the old EACC character set. The latter was formulated before the Unicode character sets upon which IMEs rely became available. Therefore, the two standards naturally differed somewhat in their detailed principles and solutions to questions on whether certain characters were the same or different. Moreover, with hindsight one can also acknowledge that especially Japanese and Korean catalogers were forced to input Chinese traditional variants of characters, rather than the usual Japanese and Korean ones.

Since in the older versions of the utilities by definition there simply was no access to any characters outside the EACC character set, catalogers naturally inputted a standardized character in the absence of availability of any others, also in the cases in which Japanese or Korean normal usage differed quite visibly from the traditional original. E.g., since 戸 or 户 were not available, naturally the 戶 was selected.

With the reliance upon open IMEs, however, now the frequent situation occurs, in which through the IME only the non-EACC character (since of current normal usage) is easily accessible, while the traditionally prescribed EACC equivalent is not or only with difficulty available in that particular IME. Since all actually used EACC characters are currently mapped to valid Unicode characters (see for the latest updates see the MARC character set standards at the Library of Congress ; a small subset not mapped should not occur within bibliographic records), the characters are likely to be available in an IME for another language, most likely Chinese in its traditional variant; however, it may not be easy for a non-speaker to input it from there.

The issue is not only limited to catalogers. Users also will not be able to do searches using the IME variant if that it is not the encoding of the character used by the catalogers.

To solve these problems, we have prepared the following tables, in order that catalogers can find and select the EACC validated character instead of those characters disallowed, but chosen by the IME, using simple cut-and-paste. The list as such can also function for system vendors as a list of where normalization should occur: users inputting a non-EACC character will expect the same results as if they had entered the EACC character. For this latter purpose, the list will not become outdated even if the standard MARC21 interchange would come to allow the occurrence of all Unicode characters.

These tables were intended to be a group effort. It was begun by catalogers at Yale University, and was enhanced by colleagues at Columbia and Princeton Universities before its publication here. Please send any remarks to the editor of this page, currently Martin Heijdra, Princeton University; for how to recommend additions to table 1 and 2, please refer to the main page.

The tables

Table 1

The first table is a general table, including Japanese, Chinese and some Korean cases. You can also download this table in Excel format. This table was last updated October 21, 2005.

Important! Make sure you look at the following table in the regular font you use. (The default setting, if you have it installed, is Arial Unicode MS.) Character display differs depending on the font used. The display used internally by the IME may use another font than the one you have set in your program, resulting in slight visual differences.

Table 2

The second table deals with Korean only. There is a large set of Korean characters with duplicate readings, and these characters have been encoded twice in Unicode in order to be compatible with some Korean applications. The encoding produced using the standard Korean IME will be different depending on the pronunciation used for input, even if visibly the character is the same. However, only one pronunciation results in a valid EACC character. The table lists 267 of such cases, ordered according to the input which results in an invalid character. In one case, neither reading results in a valid EACC character; this case is listed in both tables. This Korean table also is available for download in Excel format, or for Web viewing. This table was last updated September 19, 2005.

Since the second table is meant for Korean, it is best to use the Batang font for viewing.

Table 3

The third table deals with a discrepancy between the Unihan database and the authoritative LC EACC-to-Unicode listing. A September 2004 update mapped 235 previously "Private Use Area" EACC characters to Unicode characters, but this update is as of now not yet reflected in the Unihan database. This means, that there are valid characters one can use which do not have the EACC field filled in in the Unihan database. Programs will differ in how they deal with these characters depending on what they base their mappings on.

Of the 235 characters, most are mapped to previously allowed characters; that is, while the Unihan database may list only one EACC equivalent for a Unicode character, this update mapped them to two or more; but since they were previously allowed, there should usually be no rejection of these characters. In addition, in June 2004 one more character had already been remapped to a newly allowed value ( 5861, 塡), which also is not yet reflected in the Unihan database.

All in all there are 19 exceptions, where there are newly allowed Unicode characters. These may or may not be rejected by a particular library system; officially, all these values are allowed for exchange. Of these:

12

characters do not present any special difficulty (note: 1 was already remapped in June 2004)

1

character should not present any difficulty but happens not to have a glyph in Arial Unicode MS

2

characters are in Extension A, for which glyphs are only available in special fonts such as SimSun (Founder Extension); provided you have such a font, many programs may be able to deal with them

3

characters are in Extension B, with glyphs only available in fonts such as SimSun (Founder Extension); programs rarely can deal with these characters, and you may wish to avoid them by replacing them with the "unavailable glyph" symbol geta (U+3013, 〓). Internet Explorer may show these characters in the table as an empty box; you can see the characters in recent versions of Office programs through copy-and-paste

1

character, finally, is an Ancient Hangul syllable; use a Korean font such as Batang for viewing

Table 3 lists all these 19 characters. It is also available in Excel format. It was last updated on September 19, 2005.

Table 4

The fourth table lists a set of 8 characters, which once had a valid EACC to Unicode mapping, but in September 2004 were removed from the MARC 21 repertoire. They have a EACC value in the Unihan database, but no longer exist on the full LC list of allowed mappings. Systems which have not implemented the Sept. 2004 update may allow for these characters, while they should not.

4 of these characters involve Korean multiple readings, and are also listed in the Korean table.

Table 4 was last updated on September 19, 2005; it is also available in Excel format.

Table 1 General Table for Japanese, Chinese and Korean

Do not use

(Unicode value)

Use instead

(Unicode value)

Remarks

5DFB

5377

look at bottom, not top, to see difference; fonts vary

8AAC

8AAA

 

FF62

300C

use the shorter hooks

FF63

300D

use the shorter hooks

6238

6236

 

6237

6236

 

67FB

67E5

 

5BDB

5BEC

use character with dot

6B69

6B65

use character without stroke

5F5A

5F59

 

7232

70BA

 

9332

9304

 

59EB

59EC

 

7155

7199

 

95B2

95B1

 

63FA

6447

use character with stroke

8131

812B

 

85AB

85B0

 

6B73

6B72

 

865A

865B

 

6669

665A

careful; right element should have seven, not eight strokes

6E09

6D89

use character without stroke

7A0E

7A05

 

93B8

942B

 

90F7

9109

 

9115

9109

 

9EB9

9EB4

 

5C2D

582F

 

7476

7464

 

5036

4FF1

correct value uses the "eye" element with extended strokes

570F

5708

look at bottom of inside, not top, to see difference; fonts vary

7DD2

7DD6

use character with extra dot

 郎

F92C

90CE

character produced using Korean ; correct character not available from Korean IME

 郞

90DE

90CE

character produced using Korean ; correct character not available from Korean IME

 

7E4D

7E61

 

 

654D

6558

 

 

6F4A

6F35

 

 

6BCE

6BCF

 

 

7501

74F6

 

 

7E4B

7E6B

Non-valid Japanese version only has "cart" element; valid Chinese version has extra "mountain" at bottom of "cart" element

 

4EB7

5EC9

 

FF10

3007

FF10 is the "fullwidth digit zero", 3007 is the "ideographic number zero". In the Japanese IME, FF10 is called [全]数字, 3007 漢数字. Both are different from 0030 ("digit zero"), the Western zero, produced by the Chinese and Korean IME's, and available in the Japanese IME as [半]数字

2010

-

002D

2010 is the "fullwidth hyphen ", 002D is the "half-width hyphen/minus ". In the Japanese IME, 2010 is called [全]ハイフン, 002D [半]ハイフン、マイナス

 · 

00B7

30FB

00B7 is the Western "middle dot", 30FB is the "ideographic centered point"; non-acceptance of 00B7 in CJK sequences is a temporary problem in some systems, officially both values are allowed; use Japanese IME, not Chinese IME, to access 30FB

ʾ

02BE

ʼ

02BC

the mapping of the MARC-8 alif changed in March 2005 from 02BE ("modifier letter right half ring") to 02BC ( "modifier letter apostrophe"); systems vary in which value they support