These pages are maintained for reference purposes only. Please refer to the main page for the latest information on their status.
The bibliographic utilities (RLG, OCLC) and most major local library system vendors have recently updated their software to rely upon third-party vendors for Chinese, Japanese and Korean input, and to conform to MARC21 standards. In practice this often means that users now use the standard Input Method Editors (IMEs) provided with Microsoft Windows 2000 and XP, or the more recent ones available on the Microsoft Office Proofing Tools 2003 CD.
(For more general information on Multilingual capabilities of Windows, see the file Multilingual Workshop. For some short guidelines to the most recent IMEs, see the file IME Workshop.)
However, for exchange purposes the domain of characters allowed by the MARC21 standard is still the one of the old EACC character set. The latter was formulated before the Unicode character sets upon which IMEs rely became available. Therefore, the two standards naturally differed somewhat in their detailed principles and solutions to questions on whether certain characters were the same or different. Moreover, with hindsight one can also acknowledge that especially Japanese and Korean catalogers were forced to input Chinese traditional variants of characters, rather than the usual Japanese and Korean ones.
Since in the older versions of the utilities by definition there simply was no access to any characters outside the EACC character set, catalogers naturally inputted a standardized character in the absence of availability of any others, also in the cases in which Japanese or Korean normal usage differed quite visibly from the traditional original. E.g., since 戸 or 户 were not available, naturally the 戶 was selected.
With the reliance upon open IMEs, however, now the frequent situation occurs, in which through the IME only the non-EACC character (since of current normal usage) is easily accessible, while the traditionally prescribed EACC equivalent is not or only with difficulty available in that particular IME. Since all actually used EACC characters are currently mapped to valid Unicode characters (see for the latest updates see the MARC character set standards at the Library of Congress ; a small subset not mapped should not occur within bibliographic records), the characters are likely to be available in an IME for another language, most likely Chinese in its traditional variant; however, it may not be easy for a non-speaker to input it from there.
The issue is not only limited to catalogers. Users also will not be able to do searches using the IME variant if that it is not the encoding of the character used by the catalogers.
To solve these problems, we have prepared the following tables, in order that catalogers can find and select the EACC validated character instead of those characters disallowed, but chosen by the IME, using simple cut-and-paste. The list as such can also function for system vendors as a list of where normalization should occur: users inputting a non-EACC character will expect the same results as if they had entered the EACC character. For this latter purpose, the list will not become outdated even if the standard MARC21 interchange would come to allow the occurrence of all Unicode characters.
These tables were intended to be a group effort. It was begun by catalogers at Yale University, and was enhanced by colleagues at Columbia and Princeton Universities before its publication here. Please send any remarks to the editor of this page, currently Martin Heijdra, Princeton University; for how to recommend additions to table 1 and 2, please refer to the main page.
The first table is a general table, including Japanese, Chinese and some Korean cases. You can also download this table in Excel format. This table was last updated October 21, 2005.
The second table deals with Korean only. There is a large set of Korean characters with duplicate readings, and these characters have been encoded twice in Unicode in order to be compatible with some Korean applications. The encoding produced using the standard Korean IME will be different depending on the pronunciation used for input, even if visibly the character is the same. However, only one pronunciation results in a valid EACC character. The table lists 267 of such cases, ordered according to the input which results in an invalid character. In one case, neither reading results in a valid EACC character; this case is listed in both tables. This Korean table also is available for download in Excel format, or for Web viewing. This table was last updated September 19, 2005.
The third table deals with a discrepancy between the Unihan database and the authoritative LC EACC-to-Unicode listing. A September 2004 update mapped 235 previously "Private Use Area" EACC characters to Unicode characters, but this update is as of now not yet reflected in the Unihan database. This means, that there are valid characters one can use which do not have the EACC field filled in in the Unihan database. Programs will differ in how they deal with these characters depending on what they base their mappings on.
Of the 235 characters, most are mapped to previously allowed characters; that is, while the Unihan database may list only one EACC equivalent for a Unicode character, this update mapped them to two or more; but since they were previously allowed, there should usually be no rejection of these characters. In addition, in June 2004 one more character had already been remapped to a newly allowed value ( 5861, 塡), which also is not yet reflected in the Unihan database.
All in all there are 19 exceptions, where there are newly allowed Unicode characters. These may or may not be rejected by a particular library system; officially, all these values are allowed for exchange. Of these:
12 |
characters do not present any special difficulty (note: 1 was already remapped in June 2004) |
1 |
character should not present any difficulty but happens not to have a glyph in Arial Unicode MS |
2 |
characters are in Extension A, for which glyphs are only available in special fonts such as SimSun (Founder Extension); provided you have such a font, many programs may be able to deal with them |
3 |
characters are in Extension B, with glyphs only available in fonts such as SimSun (Founder Extension); programs rarely can deal with these characters, and you may wish to avoid them by replacing them with the "unavailable glyph" symbol geta (U+3013, 〓). Internet Explorer may show these characters in the table as an empty box; you can see the characters in recent versions of Office programs through copy-and-paste |
1 |
character, finally, is an Ancient Hangul syllable; use a Korean font such as Batang for viewing |
Table 3 lists all these 19 characters. It is also available in Excel format. It was last updated on September 19, 2005.
The fourth table lists a set of 8 characters, which once had a valid EACC to Unicode mapping, but in September 2004 were removed from the MARC 21 repertoire. They have a EACC value in the Unihan database, but no longer exist on the full LC list of allowed mappings. Systems which have not implemented the Sept. 2004 update may allow for these characters, while they should not.
4 of these characters involve Korean multiple readings, and are also listed in the Korean table.
Table 4 was last updated on September 19, 2005; it is also available in Excel format.
Do not use |
(Unicode value) |
Use instead |
(Unicode value) |
Remarks |
---|---|---|---|---|
巻 |
5DFB |
卷 |
5377 |
look at bottom, not top, to see difference; fonts vary |
説 |
8AAC |
說 |
8AAA |
|
「 |
FF62 |
「 |
300C |
use the shorter hooks |
」 |
FF63 |
」 |
300D |
use the shorter hooks |
戸 |
6238 |
戶 |
6236 |
|
户 |
6237 |
戶 |
6236 |
|
査 |
67FB |
查 |
67E5 |
|
寛 |
5BDB |
寬 |
5BEC |
use character with dot |
歩 |
6B69 |
步 |
6B65 |
use character without stroke |
彚 |
5F5A |
彙 |
5F59 |
|
爲 |
7232 |
為 |
70BA |
|
録 |
9332 |
錄 |
9304 |
|
姫 |
59EB |
姬 |
59EC |
|
煕 |
7155 |
熙 |
7199 |
|
閲 |
95B2 |
閱 |
95B1 |
|
揺 |
63FA |
摇 |
6447 |
use character with stroke |
脱 |
8131 |
脫 |
812B |
|
薫 |
85AB |
薰 |
85B0 |
|
歳 |
6B73 |
歲 |
6B72 |
|
虚 |
865A |
虛 |
865B |
|
晩 |
6669 |
晚 |
665A |
careful; right element should have seven, not eight strokes |
渉 |
6E09 |
涉 |
6D89 |
use character without stroke |
税 |
7A0E |
稅 |
7A05 |
|
鎸 |
93B8 |
鐫 |
942B |
|
郷 |
90F7 |
鄉 |
9109 |
|
鄕 |
9115 |
鄉 |
9109 |
|
麹 |
9EB9 |
麴 |
9EB4 |
|
尭 |
5C2D |
堯 |
582F |
|
瑶 |
7476 |
瑤 |
7464 |
|
倶 |
5036 |
俱 |
4FF1 |
correct value uses the "eye" element with extended strokes |
圏 |
570F |
圈 |
5708 |
look at bottom of inside, not top, to see difference; fonts vary |
緒 |
7DD2 |
緖 |
7DD6 |
use character with extra dot |
郎 | F92C |
郎 | 90CE |
character produced using Korean 낭; correct character not available from Korean IME |
郞 | 90DE |
郎 | 90CE |
character produced using Korean 랑; correct character not available from Korean IME |
繍 | 7E4D |
繡 | 7E61 |
|
敍 | 654D |
敘 | 6558 |
|
潊 | 6F4A |
漵 | 6F35 |
|
毎 | 6BCE |
每 | 6BCF |
|
甁 | 7501 |
瓶 | 74F6 |
|
繋 | 7E4B |
繫 | 7E6B |
Non-valid Japanese version only has "cart" element; valid Chinese version has extra "mountain" at bottom of "cart" element |
亷 | 4EB7 |
廉 | 5EC9 |
|
0 |
FF10 |
〇 |
3007 |
FF10 is the "fullwidth digit zero", 3007 is the "ideographic number zero". In the Japanese IME, FF10 is called [全]数字, 3007 漢数字. Both are different from 0030 ("digit zero"), the Western zero, produced by the Chinese and Korean IME's, and available in the Japanese IME as [半]数字 |
‐ |
2010 |
- |
002D |
2010 is the "fullwidth hyphen ", 002D is the "half-width hyphen/minus ". In the Japanese IME, 2010 is called [全]ハイフン, 002D [半]ハイフン、マイナス |
· |
00B7 |
・ |
30FB |
00B7 is the Western "middle dot", 30FB is the "ideographic centered point"; non-acceptance of 00B7 in CJK sequences is a temporary problem in some systems, officially both values are allowed; use Japanese IME, not Chinese IME, to access 30FB |
ʾ |
02BE |
ʼ |
02BC |
the mapping of the MARC-8 alif changed in March 2005 from 02BE ("modifier letter right half ring") to 02BC ( "modifier letter apostrophe"); systems vary in which value they support |