Posted by: repettas | May 16, 2008

Oracle Unicode Character Sets

Oracle started supporting Unicode based character sets in Oracle 7. Below is a summary of the Unicode character sets supported in Oracle:

Oracle Unicode Supported Character Sets

AL24UTFFSS

AL24UTFFSS was the first Unicode character set supported by Oracle. It was introduced in Oracle 7.2. The AL24UTFFSS encoding scheme was based on the Unicode 1.1 standard, which is now obsolete. AL24UTFFSS as been de-supported from Oracle 9i. The migration path for existing AL24UTFFSS databases is to upgrade the database to 8.0 or 8.1, then upgrade the character set to UTF8 before upgrading the database further to 9i or 10g.

UTF8

UTF8 was the UTF-8 encoded character set introduced in Oracle 8 and 8i. It followed the Unicode 2.1 standard between Oracle 8.0 and 8.1.6, and was upgraded to Unicode version 3.0 for versions 8.1.7, 9i, 10g and 11g. To maintain compatibility with existing installations this character set will remain at Unicode 3.0 in future Oracle releases. Although specific supplementary characters were not assigned to Unicode until version 3.1, the allocation for these characters were already defined in 3.0 So if supplementary characters are inserted in a UTF8 database, it will not corrupt the actual data inside the database. They will be treated as 2 separate undefined characters, occupying 6 bytes in storage. Oracle recommends that customers switch to AL32UTF8 for full supplementary character support.

UTFE

This is the UTF8 database character set for the EBCDIC platforms. It ahs the same properties as UTF8 on ASCII based platforms. The EBCDIC Unicode transformation format is documented in Unicode Technical Report #1 UTF-EBCDIC. Which can be found at http://www.unicode.org/unicode/reports/tr16/

AL32UTF8

This is the UTF-8 encoded character set introduced in Oracle 9i. AL32UTF8 is the database character set that supports the latest version (5.0 in Oracle 11.1) of the Unicode standard. It also provides support for the newly defined supplementary characters. All supplementary characters are stored as 4 bytes. AL32UTF8 was introduced because when UTF8 was designed (in the time of Oracle 8) there wasn’t a concept of supplementary characters, there UTF8 has a maximum of 3 bytes per character. Changing the design of UTF8 wold break backward compatibility, so a new character set was introduced. The introduction of surrogate pairs should mean that no significant architecture changes are needed in future versions of the Unicode standard, so currently the plan is to keep enhancing AL32UTF8 as necessary to support future versions of the Unicode standard. For example, in Oracle 10.1 this character set was implemented the Unicode 3.2 standard, in Oracle 10.2 that has been updated to support the Unicode 4.01 standard and in Oracle 11.1 to the Unicode 5.0 standard.

Please note that pre-Oracle 9 software can have some serious problems connecting to a AL32UTF8 database.

AL16UTF16

This is the first UTF-16 encoded character set in Oracle. It was introduced in Oracle 9i as the default national character set (NLS_NCHAR_CHARACTERSET). AL16UTF16 supports the latest version (5.0 in Oracle 11.1) of the Unicode standard. It also provides support for the newly define supplementary characters. All supplementary characters are stored as 4 bytes. As with AL32UTF8, the plan is to keep enhancing AL16UTF16 as necessary to support future versions of the Unicode standard. AL16UTF16 cannot be used as a database character set (NLS_CHARACTERSET), only as the national character set (NLS_NCHAR_CHARACTERSET). The database character set is used to identify and to hold SQL, SQL metadata and PL/SQL source code. It must have either single byte 7-bit ASCII or single byte EBCDIC as a subset, whichever is native to the deployment platform. Therefore, it is not possible to use a fixed-width, multi-byte character set (such as AL16UTF16) as the database character set. Trying to create a database with AL16UTF16 as the database character set in 9i and up will give “ORA-12706: THIS CREATE DATABASE CHARACTER SET IS NOT ALLOWED”. AL16UTF16 is always in Big Endian byte order, regardless of the processor endianess.

There are only a few circumstances where you actually have an advantage of using the national characterset. In 99% of the cases simply use a UTF8 or AL32UTF8 database.

The following URLs contain a complete list of hex values and character descriptions for every Unicode character:

Unicode Version 5.0: http://www.unicode.org/Public/5.0.0/ucd/UnicodeData.txt
Unicode Version 4.0 http://www.unicode.org/Public/4.0-Update1/UnicodeData-4.0.1.txt
Unicode Version 3.2 http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt
Unicode Version 3.1 http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt
Unicode Version 3.0 http://www.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.txt
Unicode Versions 2.x http://www.unicode.org/unicode/standard/versions/enumeratedVersions.html
Unicode Version 1.1 http://www.unicode.org/Public/1.1-Update/UnicodeData-1.1.5.txt

A description of the file format can be found at: http://www.unicode.org/Public/UNIDATA/UnicodeData.html

For a glossary of Unicode terms, see: http://www.unicode.org/glossary

On above locations you can find the unicode standard, all characters that are there are referenced with their UCS-2 codepoint.

Oracle currently has no plans to desupport UTF8, they simple encourage everyone to use AL32UTF8. All codepoints defined in UTF8 are also valid in AL32UTF8. So there is never an issue with going from UTF8 to AL32UTF8.

About these ads

Responses

  1. Excellent writeup, informative and concise!

  2. Thanks for your information.

    Referring to the statement:
    “There are only a few circumstances where you actually have an advantage of using the national characterset.”

    Could you advise what those circumstances are? I am trying to determine the datatype to use (VARCHAR2 vs. NVARCHAR2) in an Oracle DB to hold East Asian language data with supplementary characters. I did a research on the web but have not found concrete information. With this quesiton answered, I believe I can make the decision.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: