Posted by: repettas | May 15, 2008

AL32UTF8/UTF8 (Unicode) Database Character Set Implications

Oracle’s recommendation that if your environment (clients and servers) consists entirely of Oracle 9i or higher, use AL32UTF8 as NLS_CHARACTERSET, otherwise use UTF8.

For the time being do NOT use expdp/impdp (Data Pump) when going to UTF8 or another multibyte characterset on ALL 10g versions including 10.1.0.5 and 10.2.0.3. This also includes 11.1.0.6. It will provoke data corruption unless patch 5874989 is applied. This is caused by the impdp Bug 5874989. The older import/export executables work fine and should be used.

If you are using 10.1.0.5, 10.2.0.2, 10.2.0.3 or 11.1.0.6 you can request a backport for Bug 5874989 if there hasn’t been a patch issued for your platform.

Some Things to Be Aware Of:

For the database side you do not need “Unicode” support from the OS where the database is running on because the Oracle database is running on because the Oracle AL32UTF8 implementation is not depending on OS features.

Server Side Considerations

AL32UTF8 is a varying width characterset, which means that the code for a character can be 1, 2, 3, or 4 bytes long. This is a big difference with charactersets like WE8ISO8859P1 or WE8MSWIN1252.

US7ASCII characters (A-z,a-z,0-1 and ./?,*,# etc…) are in UTF8 1 byte, so for most West European languages the impact is rather limited as only “special” characters like the unprintable ones will use more bytes then in a 8 bit characterset. But if you convert a Cyrillic or Arabic system to AL32UTF8 then the data will take considerable more bytes to store.

This also means that you have to make sure your columns are big enough to store the additional bytes. By default the column size is defined in BYTES and not in CHARACTERS. So a “create table <name> (<column name> VARCHAR2(2000));” means that the column can store 2000 bytes.

From 9i onwards however, it is possible to really define a column with the number of CHARACTERS you want to store. How this works and what the limits and current known problems are will be explained another day.

Codepoints:

There is a common misconception that a character is always the same code, for example the ¬£is often referred to as “code 163″ character. This is not correct! The code itself means nothing if you do not know what characterset you are using. The difference may look small, but it’s not. The pound sign for example is indeed 163 (A3 in hexadecimal) in the WE8ISO8859P1 and WE8MSWIN1252 charactersets, but in AL32UTF8 the pound sign is code 49827 (C2 A3 in hex).

So be careful if you use for example the ASCII or CHR(<code>) function, the code for a character using CHR depends on the database characterset! If you do chr(163) in a AL32UTF8 database then this 163 code is an illegal character, as 163 simply does not exist, for the pound you should use chr(49827) in a AL32UTF8 system.

Instead of CHR() you might use Unistr(‘\<code>’). Unistr() (is a 9i new feature) always works on every characterset that can display the character.

Objects and User Names are Max 30 Bytes not Characters:

Identifiers of database objects are max 30 bytes long. Names must be from 1 to 30 bytes long with these exceptions:

  • Names of databases are limited to 8 bytes.
  • Names of database links can be as long as 128 bytes.

Oracle strongly suggests never to use non-US7ASCII names for a database or a database link. The following SQL statement will return all objects having a non-US7ASCII name:

select object_name from dba_objects where object_name <> convert(object_name,’US7ASCII’);

A username can be a maximum of 30 bytes long. The name can only contain characters from your database characterset and must follow the rules described in the section “Schema Object Naming Rules”. Oracle recommends that the user name contain at least one single-byte character regardless of whether the database character set also contains multibyte characters. The following SQL statement will return all users having a non-US7ASCII name:

select username from dba_users where username <> convert(username, ‘US7ASCII’);

Using CHAR semantics is for the moment not supported in the SYS schema and that’s where the database object and user names are stored. If you have objects or users with non-US7ASCII names that take more than 30 bytes in UTF8 there is no alternative besides renaming the affected objects or user to use a name that will occupy no more than 30 bytes.

The password for users can contain only single-byte characters from your database character set regardless of whether the character set also contains multi-byte characters.

This means that in a (AL32)UTF8 database the user password can only contain US7ASCII characters as they are the only single-byte characters in UTF8. This may provoke a problem, if you migrate from (for example) a CL8MSWIN1251 database then your users can use Cyrillic in their passwords seen in CL8MSWIN1251 cyrillic i single byte, in UTF8 it is not. Note that passwords are stored in a hashed way will NOT be seen in csscan. You will need to reset for those clients the password to US7ASCII string. This restriction has been lifted in 11g, there you can use multi-byte characters as a password string. Please note that they need to be updated in 11g before they use the new 11g hashing system.

Other Things to Watch Out For:

Make sure you use a correct NLS_LANG setting when using export/import when exporting or importing to or from a UTF8 database. Using DBMS_LOB.LOADFROMFILE may have some implications when used in a UTF8 database. When using SQLLDR make sure you define the correct characterset of the “file” when you load. String functions work with characters not byte (length, like, substr …). There are of course exceptions like lengthB, substrB and instrB who explicitly deal with bytes. Since you never know the exact length of a string in BYTES in a UTF8 environment, operations based on the BYTE length should be avoided, unless you have a compelling reason to use them.

If you do not use CHAR semantics then the column size will be tripled (a CHAR 20 byte will show up as 60 Bytes in the AL32UTF8 db). If you really need to use BYTE you can use the _keep_remote_column_size=true at the MV side, but be aware that this will provoke ORA-1401 or ORA-12899 as the data WILL expand when it contains non-US7ASCII characters. Avoid the use of a mixture of BYTE and CHAR semantics in the same table and _keep_remote_column_size=true is NOT compatible with using CHAR semantics.

The Client Side:

Most people think that the NLS_LANG should be UTF8 or AL32UTF8 because they are connecting to a AL32UTF8 database. This is not necessarily true, the NLS_LANG has in fact no relation with the database characterset. It’s purpose is to let oracle know what the client characterset is, so that Oracle can do the needed conversion.

Please make sure that the difference between a client who can’t connect to a Unicode database (which is any 8.0 and up client for a UTF8 database and any 9i+ client for AL32UTF8 database) and a “real” Unicode client. A “real” Unicode client means a client who can display/insert all characters known by Unicode without the need to recompile or change the OS settings.

Configuring Your Client To Be a Unicode Client on UNIX:

To hae a Unix Unicode client you need to configure your Unix environment first to use UTF8, then you have to check your telnet software to be able to use Unicode and then (as the last step) you can set the NLS_LANG environment variable equal to AMERICAN_AMERICA.AL32UTF8 and start SQL*Plus.

Configuring Your Client To Be a Unicode Client on Windows:

On Windows you cannot use sqlplusw.exe or sqlplus.exe as a Unicode / UTF8 client. Using sqlplusw.exe with NLS_LANG set to AL32UTF8 is totally incorrect. There are 2 Oracle provided Unicode clients you can use on Windows: iSQL*Plus and Oracle SQL Developer. If you want to use / write a Unicode application on Windows then the application should be specifically written to use the Windows Unicode API, setting NLS_LANG to AMERICAN_AMERICA.AL32UTF8 is not enough. You should consult the application vendor or the vendor of the development environment to see how to properly code in Unicode.

The popular tool TOAD is NOT able to run against (AL32)UTF8 databases. Toad does not support databases with a Unicode character set such as UTF8 or AL16UTF16. The International version of Toad is the only version of Toad that does support (AL32)UTF8.

All Oracle 9i and up clients are compatible with a AL32UTF8 database, even if they use a non-AL32UTF8 NLS_LANG setting:

Note: You can use sqlplusw.exe on a Western European / US Windows Client (using a NLS_LANG set to AMERICAN_AMERICA.WE8MSWIN1252, which is the correct value for a Western European US Windows system to connect to a AL32UTF8 database. However, the SQL*Plus client will only be able to insert / view Western European characters. If another user using SQL*Plus on a correctly configured Chinese Windows System inserts data then this will not be visible in the Western European SQL*Plus client. If you try to update Chinese data using the Western European client this will also fail.

About these ads

Responses

  1. Where can I get patch 5874989? Oracle’s Metalink site which requires a user account.

  2. Hi, i was trying to install oracle UCM when i recieved the message “the oracle database is not using al32utf8 as its configured character encoding”, any idea what to do?

  3. Reinstall Oracle database and select the “Custom Setup database” option. It will ask you to choose the character set.

  4. [...] public links >> utf8 AL32UTF8/UTF8 (Unicode) Database Character Set Implications Saved by astruyk on Sat 06-12-2008 Force MySQL Encoding to UTF8 Saved by fredrmartins on Mon [...]

  5. Good contribution thanks

  6. Truly thanks for your helpful article, especially the section “Configuring Your Client To Be a Unicode Client on Windows” is what I am looking for … Neither TOAD nor PL/SQL supports Unicode at client side, but Oracle SQL Developer does. The download link is here: http://www.oracle.com/technology/software/products/sql/index.html

    Regards,
    pitini

  7. We are using WE8MSWIN1252 character set in QA and would like to revert back to AL32UTF8 which does not work according to Oracle support. The technician that initially set up the database changed the character set to WE8MSWIN1252. How can we convert back to AL32UTF8. The resource is encountering problems in workflow if we leave WE8MSWIN1252 character set as is in QA. What can we do?

  8. Correction from previous e-mail

    We are using WE8MSWIN1252 character set in DEV and would like to revert back to AL32UTF8 which does not work according to Oracle support. The technician that initially set up the database changed the character set to WE8MSWIN1252. How can we convert back to AL32UTF8. The resource is encountering problems in workflow if we leave WE8MSWIN1252 character set as is in DEV. What can we do?

  9. PL/SQL Developer is able to handle AMERICAN_AMERICA.AL32UTF8 on the client side – on windows simply set your NLS_LANG environment variable to AMERICAN_AMERICA.AL32UTF8 and you’re good to go

  10. [...] http://repettas.wordpress.com [...]

  11. [...] CHINESE_CHINA.AL32UTF8 AMERICAN_AMERICA.ZHS16GBKLinksOracle Database Character Sets AL32UTF8/UTF8 (Unicode) Database Character Set Implicationshttp://blog.csdn.net/javatim/archive/2008/07/16/2661187.aspx [...]

  12. We are working on a data migration project from a database having AL32UTF8 characterset (source database) to UTF8 characterset (target) database. All the special characters are appearing as inverted question mark post migration. We need to inform the business on what are the special characters supported by AL32UTF8 which is not supported by UTF8 and would not get migrated properly. Could you please let me know where I could find the list of all characters that is supported by individual charactersets.

  13. You say “The popular tool TOAD is NOT able to run against (AL32)UTF8 databases. Toad does not support databases with a Unicode character set such as UTF8 or AL16UTF16. The International version of Toad is the only version of Toad that does support (AL32)UTF8″. Can you explain why you think that? I use TOAD to access a number of different databases using AL32UTF8 as the central character-set, and I’ve not encountered any problems. I’m running TOAD 9.7.2.5, so not the most up-to-date version.

  14. I want to show my arabic data on 3rd party page but it appears like “?????” plz help me out by giving such a useful solution. Thanx

  15. Hi,

    We are converting WE8ISO8859P1 oracle db characterset to AL32UTF8. Before conversion, i want to check implication on PL/SQL code for byte based SQL functions.

    What all points to consider while checking implications on PL/SQL code?

    I could find 3 methods on google surfing, SUBSTRB, LENGTHB, INSTRB. What do I check if these methods are used in PL/SQL code?

    What all other methods should I check?
    What do I check in PL/SQL if varchar and char type declarations exist in code?
    How do i check implication of database characterset change to AL32UTF8 for byte bases SQL function.

    My PL/SQL code uses SUBSTR, LENGTH functions. Could these methods need change on database character set change to AL32UTF8?

    Thanks in Advance.

    Regards,
    Rashmi

  16. Hi there to every body, it’s my first go to see of this web site; this weblog includes awesome and in fact fine stuff in support
    of visitors.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Categories

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: