Home Page

Chinese, Japanese, Korean (CJK), and Unicode Text Encoding Glossary

 

 

Don't worry, it's not rocket science... it's brain surgery!

  • Japanese Encoding
    • Shift JIS - the most popular internal code used in Japan (used in Mac and Windows).
    • EUC-JIS - A coding standard very popular on UNIX based Systems and in PC software.
    • 7-Bit JIS - including NEW-JIS, OLD-JIS and NEC-JIS. New-JIS is similar to ISO-2022.
    • ISO-2022-JP - an emerging new international Internet standard for encoding Japanese text.

  • Chinese Encoding
    • Big5 - commonly used in Taiwan and Hong Kong for traditional Chinese writing.
    • GB - commonly used in China and Singapore for simplified Chinese writing.
    • HZ - a popular Internet convention for encoding GB text, popular in newsgroup and email.
    • ISO-2022-GB - an emerging new international Internet standard for encoding Chinese text.

  • Korean Encoding
    • KSC5601 - The most popular internal code used in Korea.
    • ISO-2022-KR - an emerging new international Internet standard for encoding Korean text.

  • DBCS(e.g. Shift-JIS, GBK, KSC, Big5)
    DBCS means Double Byte Character Code Set. DBCS means Shift-JIS on Japanese Windows, and it means KSC in Hangeul Windows.

  • EUC-JP
    It is a codeset, in which 1 byte and 2 byte are mixed which is created from JIS 0208 and conformed with ISO-2022, then the range of DBCS lead byte is 0xA1-0xFE. Kata-kana with half pitch width is represented as 2 byte code.

  • JIS(ISO-2022-JP)
    This is a 7-bit and multibyte codeset, which is created from JIS 0208 and conformed with ISO-2022. It does not support Kata-kana with half pitch width.

  • Unicode(UCS-2)
    This codeset is compatible with Plane 1 in ISO-10646. This is defined by Unicode 1.X. About 60 thousand characters can be accommodated.
     

  • Unicode(UTF-16)
    This codeset is a new type of Unicode defined by Unicode 2.X.
    It abandons 16-bit fixed code partially, and it can accommodate about 1 million characters. Unicode (UCS-2) is a subset of UTF-16.

  • UTF-7
    This is a 7-bit serialized format, which can be safely used with older e-mail routers.

  • UTF-8
    This is a multibyte format coverted from UCS-4 defined ISO-10646. It accommodates about 2 billion characters. A character in this format needs various bytes from 1 to 6. However, it is characteristic that characters from 0th to 127th have same code point as US-ASCII. And about 1 milion characters in UTF-16 are mutually converted with UTF-8

  • UCS-4
    This format has 32-bit fixed width, which is defined by ISO-10646.

  • Java Source
    This format is used in Java source files.
    This is basically ASCII, but the code point over 128 is represented with 4 hexadecimal number such as \uXXXX.

  • Unicode 3.0 with Language Tags
    The is a new Unicode with language tags, which is defined by Unicode 3.0.
    You can get details of the format from the links of Unicode Consortium
    You must replace the parts of Unicode language tags with XML language tags and so on, when you open your data publicly.

  • Cho-Kanji TRON code
    This is a new TRON code adopted by BTRON OS "Cho-Kanji".
    You can get details of the format from the links of Personal Media Inc.

  • ISO-2022-ESC B(TM spec)
    This is prefectly conformed to ISO-2022, and all characters in this format are represented in 7-bit format.

  • Shift-Mojikyo(TM spec)
    The character code is Unicode but, it differs in that Mojikyo characters are assigned in the private use area.
    It is necessary to install Mojikyo truetype font.
    More information at Mojikyo Net.

  • Unicode + (&M;)Mojikyo Tag
    This format adds tags which designate a Mojikyo number to Unicode.
    The tag's format is the following.
    &Mnnnnnn;
    • start from '&M' and end ';'.
    • 6 decimal numbers between '&M' and ';'.
    • it needs 6 numbers and pads '0' if a number is less than 1000000.

  • DBCS + (&M;)Mojikyo Tag
    This format adds tags which designate a Mojikyo number to DBCS.
    Please refer about the tags' format, Unicode + (&M;)Mojikyo Tag.
    DBCS means Shift-JIS, Big5 and so on...

  • Unicode + (@;)Mojikyo Tag
    This format adds tags which designate Mojikyo number to Unicode.
    The tag's format is the following.
    @nnnnnn;
    • start form '@', end ';'
    • 6 decimal numbers between '@' and ';'.
    • it needs 6 numbers and pads '0' if a number is less than 1000000.

  • DBCS + (@;)Mojikyo Tag
    This format adds tags which designate Mojikyo number to DBCS.
    Please refer about the tag's format, Unicode + (@;)Mojikyo Tag.
    DBCS means Shift-JIS, Big5 and so on...

 

Return to FAQ Page
   
  Home Page
 

webmaster@riverlion.com
http://www.riverlion.com/
©2002 riverlion.com