Oracle’s recommendation that if your environment (clients and servers) consists entirely of Oracle 9i or higher, use AL32UTF8 as NLS_CHARACTERSET, otherwise use UTF8.
For the time being do NOT use expdp/impdp (Data Pump) when going to UTF8 or another multibyte characterset on ALL 10g versions including 10.1.0.5 and 10.2.0.3. This also includes 184.108.40.206. It will provoke data corruption unless patch 5874989 is applied. This is caused by the impdp Bug 5874989. The older import/export executables work fine and should be used.
If you are using 10.1.0.5, 10.2.0.2, 10.2.0.3 or 220.127.116.11 you can request a backport for Bug 5874989 if there hasn’t been a patch issued for your platform.
Some Things to Be Aware Of:
For the database side you do not need “Unicode” support from the OS where the database is running on because the Oracle database is running on because the Oracle AL32UTF8 implementation is not depending on OS features.
Server Side Considerations
AL32UTF8 is a varying width characterset, which means that the code for a character can be 1, 2, 3, or 4 bytes long. This is a big difference with charactersets like WE8ISO8859P1 or WE8MSWIN1252.
US7ASCII characters (A-z,a-z,0-1 and ./?,*,# etc…) are in UTF8 1 byte, so for most West European languages the impact is rather limited as only “special” characters like the unprintable ones will use more bytes then in a 8 bit characterset. But if you convert a Cyrillic or Arabic system to AL32UTF8 then the data will take considerable more bytes to store.
This also means that you have to make sure your columns are big enough to store the additional bytes. By default the column size is defined in BYTES and not in CHARACTERS. So a “create table <name> (<column name> VARCHAR2(2000));” means that the column can store 2000 bytes.
From 9i onwards however, it is possible to really define a column with the number of CHARACTERS you want to store. How this works and what the limits and current known problems are will be explained another day.
There is a common misconception that a character is always the same code, for example the £is often referred to as “code 163″ character. This is not correct! The code itself means nothing if you do not know what characterset you are using. The difference may look small, but it’s not. The pound sign for example is indeed 163 (A3 in hexadecimal) in the WE8ISO8859P1 and WE8MSWIN1252 charactersets, but in AL32UTF8 the pound sign is code 49827 (C2 A3 in hex).
So be careful if you use for example the ASCII or CHR(<code>) function, the code for a character using CHR depends on the database characterset! If you do chr(163) in a AL32UTF8 database then this 163 code is an illegal character, as 163 simply does not exist, for the pound you should use chr(49827) in a AL32UTF8 system.
Instead of CHR() you might use Unistr(‘\<code>’). Unistr() (is a 9i new feature) always works on every characterset that can display the character.
Objects and User Names are Max 30 Bytes not Characters:
Identifiers of database objects are max 30 bytes long. Names must be from 1 to 30 bytes long with these exceptions:
- Names of databases are limited to 8 bytes.
- Names of database links can be as long as 128 bytes.
Oracle strongly suggests never to use non-US7ASCII names for a database or a database link. The following SQL statement will return all objects having a non-US7ASCII name:
select object_name from dba_objects where object_name <> convert(object_name,’US7ASCII’);
A username can be a maximum of 30 bytes long. The name can only contain characters from your database characterset and must follow the rules described in the section “Schema Object Naming Rules”. Oracle recommends that the user name contain at least one single-byte character regardless of whether the database character set also contains multibyte characters. The following SQL statement will return all users having a non-US7ASCII name:
select username from dba_users where username <> convert(username, ‘US7ASCII’);
Using CHAR semantics is for the moment not supported in the SYS schema and that’s where the database object and user names are stored. If you have objects or users with non-US7ASCII names that take more than 30 bytes in UTF8 there is no alternative besides renaming the affected objects or user to use a name that will occupy no more than 30 bytes.
The password for users can contain only single-byte characters from your database character set regardless of whether the character set also contains multi-byte characters.
This means that in a (AL32)UTF8 database the user password can only contain US7ASCII characters as they are the only single-byte characters in UTF8. This may provoke a problem, if you migrate from (for example) a CL8MSWIN1251 database then your users can use Cyrillic in their passwords seen in CL8MSWIN1251 cyrillic i single byte, in UTF8 it is not. Note that passwords are stored in a hashed way will NOT be seen in csscan. You will need to reset for those clients the password to US7ASCII string. This restriction has been lifted in 11g, there you can use multi-byte characters as a password string. Please note that they need to be updated in 11g before they use the new 11g hashing system.
Other Things to Watch Out For:
Make sure you use a correct NLS_LANG setting when using export/import when exporting or importing to or from a UTF8 database. Using DBMS_LOB.LOADFROMFILE may have some implications when used in a UTF8 database. When using SQLLDR make sure you define the correct characterset of the “file” when you load. String functions work with characters not byte (length, like, substr …). There are of course exceptions like lengthB, substrB and instrB who explicitly deal with bytes. Since you never know the exact length of a string in BYTES in a UTF8 environment, operations based on the BYTE length should be avoided, unless you have a compelling reason to use them.
If you do not use CHAR semantics then the column size will be tripled (a CHAR 20 byte will show up as 60 Bytes in the AL32UTF8 db). If you really need to use BYTE you can use the _keep_remote_column_size=true at the MV side, but be aware that this will provoke ORA-1401 or ORA-12899 as the data WILL expand when it contains non-US7ASCII characters. Avoid the use of a mixture of BYTE and CHAR semantics in the same table and _keep_remote_column_size=true is NOT compatible with using CHAR semantics.
The Client Side:
Most people think that the NLS_LANG should be UTF8 or AL32UTF8 because they are connecting to a AL32UTF8 database. This is not necessarily true, the NLS_LANG has in fact no relation with the database characterset. It’s purpose is to let oracle know what the client characterset is, so that Oracle can do the needed conversion.
Please make sure that the difference between a client who can’t connect to a Unicode database (which is any 8.0 and up client for a UTF8 database and any 9i+ client for AL32UTF8 database) and a “real” Unicode client. A “real” Unicode client means a client who can display/insert all characters known by Unicode without the need to recompile or change the OS settings.
Configuring Your Client To Be a Unicode Client on UNIX:
To hae a Unix Unicode client you need to configure your Unix environment first to use UTF8, then you have to check your telnet software to be able to use Unicode and then (as the last step) you can set the NLS_LANG environment variable equal to AMERICAN_AMERICA.AL32UTF8 and start SQL*Plus.
Configuring Your Client To Be a Unicode Client on Windows:
On Windows you cannot use sqlplusw.exe or sqlplus.exe as a Unicode / UTF8 client. Using sqlplusw.exe with NLS_LANG set to AL32UTF8 is totally incorrect. There are 2 Oracle provided Unicode clients you can use on Windows: iSQL*Plus and Oracle SQL Developer. If you want to use / write a Unicode application on Windows then the application should be specifically written to use the Windows Unicode API, setting NLS_LANG to AMERICAN_AMERICA.AL32UTF8 is not enough. You should consult the application vendor or the vendor of the development environment to see how to properly code in Unicode.
The popular tool TOAD is NOT able to run against (AL32)UTF8 databases. Toad does not support databases with a Unicode character set such as UTF8 or AL16UTF16. The International version of Toad is the only version of Toad that does support (AL32)UTF8.
All Oracle 9i and up clients are compatible with a AL32UTF8 database, even if they use a non-AL32UTF8 NLS_LANG setting:
Note: You can use sqlplusw.exe on a Western European / US Windows Client (using a NLS_LANG set to AMERICAN_AMERICA.WE8MSWIN1252, which is the correct value for a Western European US Windows system to connect to a AL32UTF8 database. However, the SQL*Plus client will only be able to insert / view Western European characters. If another user using SQL*Plus on a correctly configured Chinese Windows System inserts data then this will not be visible in the Western European SQL*Plus client. If you try to update Chinese data using the Western European client this will also fail.