PeopleSoft

Character Sets in the Workplace

May 1, 2008 · 1 Comment

Some musings on Oracle Character Sets:

Choosing between UTF8 and AL32UTF8 (9i and up) comes down to if your environment (database) is already set up (clients and servers) and is 9i or higher, use AL32UTF8 as NLS_CHARACTERSET, otherwise use UTF8.

People often think they NEED to use the NLS_NCHAR_CHARACTERSET (National Character set) to have UNICODE support in Oracle, this is not true, the NLS_NCHAR_CHARACTERST (used for NCHAR, NVARCHAR2 columns) is in 9i and up and always in Unicode, but you can use “normal” CHAR and VARCHAR2 columns for storing unicode. You simply need to use a database that has a AL32UTF8 / UTF8 NLS_CHARACTERSET.

select value from NLS_DATABASE_PARAMETERS where parameter = ‘NLS_CHARCTERSET’;

On the database side you do not need “Unicode” support from the OS where the database is running on because the Oracle AL32UTF8 implementation is not dependent on OS features.

AL32UTF8 is a varying width characterset, which means that the code for a character can be 1, 2, 3 or 4 bytes long.

US7ASCII characters (A-Z, a-z, 0-1 and ./?, *# etc…) are in UTF8 1 byte, so for most West European languages the impact is rather limited as only “special” characters like the euro symbol and others will use more bytes than in a 8 bit characterset. But if you convert a Cyrillic or Arabic system to AL32UTF8 then the data will take considerable more bytes to store.

This also means that you have to make sure your columns are big enough to store the additional bytes. By default the column size is defined in BYTES and not in CHARACTERS. So a “create table <tablename [ VARCHAR2 (2000)] means that the column can store 2000 BYTES.

The National Character set (NLS_NCHAR_CHARACTERSET) is a character set which is defined in addition to the (normal) database character set and is used for data stored in NCHAR, NVARCHAR2 and NCLOB columns.

Your current value for the NLS_NCHAR_CHARACTERSET can be found with the following select statement:

select value from NLS_DATABASE_PARAMETERS where parameter=’NLS_NCHAR_CHARACTERSET’;

You can not have more than 2 charactersets defined in Oracle:
The NLS_CHARACTERSET is used for CHAR, VARCHAR2, CLOB columns while the NLS_NCHAR_CHARACTERSET is used for NCHAR, NVARCHAR2, and NCLOB columns.

NLS_NCHAR_CHARACTERSET is defined when the database is created and specified with the CREATE DATABASE command. The NLS_NCHAR_CHARACTERSET defaults to AL16UTF16 if nothing is specified.

From 9i up the NLS_NCHAR_CHARACTERSET can only have 2 values: UTF8 or AL16UTF16 which are Unicode charactersets. A lot of people think they need to use the NLS_NCHAR_CHARACTERSET to have Unicode support in Oracle, but this is not true, NLS_NCHAR_CHARACTERSET (NCHAR, NVARCHAR2) is always Unicode in 9i but you can perfectly use “normal” CHAR and VARCHAR2 columns in a database that has a AL32UTF8 / UTF8 NLS_CHARACTERSET.

When trying to use another NATIONAL characterset the CREATE DATABASE command will fail with “ORA-12714 invalid national character set specified”

The character set identifier is stored with the column definition. The following describes the datatypes and the National Character Set they can use:

There are three datatypes which can store data in the national character set:

  1. NCHAR – a fixed-length national character set character string. The length of the column is ALWAYS defined in the characters (it always uses CHAR semantics)
  2. NVARCHAR2 – a variable-length national character set character string. The length of the column is ALWAYS defined in characters (it always uses CHAR semantics)
  3. NCLOB – stores national character set data of up to four gigabytes. Data is always stored in UCS2 or AL16UTF16, even if the NLS_NCHAR_CHARACTERSET is UTF8. This has very limited impact.

If you use N-types, DO use the (N’…’) syntax when coding it so literals are denoted as being in the national character set by prepending letter ‘N’, for example:

create table testa(a nvarchar2(100));
insert into testa values(N’this is a NLS_NCHAR_CHARACTERSET string’);

In 9i and up, however it is possible to really define a column with the number of CHARACTERS you want to store. You can use the Oracle process CSSCAN when going to UTF8. CSSCAN will generate 3 files:

The instchkc.err – Database Scan Individual Exception Report looks like this:

Picture 10.jpg

The instchkc.out – Character Set Scanner v.2. looks like:

Picture 11.jpg
Addressing the many different languages in the same environment, whether it be the application or the database, has been a problem for quite some time and adds a layer of complication and requires a level of abstraction in order to understand how a database stores, presents, interprets and processes data that is stored in the database representing multiple language(s).

Unfortunately there is no single character set that contains enough characters to handle the requirements of the day-to-day processing activities for today’s ebusiness operations. The European Union requires several different national character sets just to cover all of its various languages for example.

Even for a relatively simple language like English, the 7-bit ASCII character set cannot encode all of the letters, punctuation and technical symbols that are contain within it.

The problem with ASCII and other national character sets is that they are not universal; they do not contain enough characters for handling multiple languages. Most of the national character sets are created using ASCII as their encoding base. While it is possible to store some combinations of languages like English and French in the same character set, there are some combinations where this is not possible, German and Chinese are examples as are Russian and Korean, or even the combination of Traditional Chinese and Simplified Chinese. There is also the possibility of conflicts between character sets, because the same numeric code value can be used to represent two entirely different characters, or the same character can be represented by different numeric code values in different character sets.

To overcome these limitations of existing character encodings, several organizations started working on the creation of a “Global Character Set” in the late 1980s. The need for this became more pervasive with the development and proliferation of the World Wide Web in the mid-1990s. The Internet became the underlying protocol for how we conducted business and with its emphasis on “World” that made the need for a universal character set a serious requirement.

The requirements for a global character set demand that it contain all of the major living scripts, support for legacy data and implementations and be a simple enough implementation to be used worldwide. Additionally it has to support multilingual users and organizations, conform to international standards and promote the exchange of data at the global level. The character set that surfaced to fullfil these requirements became known as the Unicode character set.

The character set of a database defines how characters are stored in the database. Therefore you are limited to storing just the characters defined in the character set. If you change character sets there is a possibility that characters that you currently use are not defined in the new character set and therefore you could “corrupt” your data.

Unicode Datatype Features

The concept of the Unicode datatype, introduced in Oracle9i allows customers to support Unicode columns in a non-Unicode database. This is a very powerful feature and is further enhanced by the inter-operability between SQL NCHAR types and other datatypes. Users can store, process and retrieve SQL NCHAR data the same as SQL CHAR data.

There are several major aspects for the new Unicode datatype features that will be discussed in this section:

  • Character set encoding
  • Character length semantics
  • Interoperability
  • Data loss handling
  • Unicode string processing

Unicode Character Set Encoding

Oracle9i and Oracle Database 10g support two Unicode encodings for the Unicode datatypes. The Oracle character set names are AL16UTF16 and UTF8. When the database is first created AL16UTF16 or UTF8 can be specified as the “NATIONAL CHARACTER SET” parameter. When the national character set is not specified, the default is AL16UTF16. For details on Unicode encoding support, see Oracle’s Globalization Support Guide.

Supplementary Character is a Unicode encoded character having a Unicode code point between U+10000 and U+10FFFF. In UTF-16 encoding, it is encoded with a surrogate pair that consists of a sequence of two Unicode values, where the first value is a high-surrogate in the range U+D800 through U+DBFF and the second is a low-surrogate in the range of U+DC00 through U+DFFF.

Here is an example of creating a database with AL16UTF16 as the NCHAR character set:

CREATE DATABASE mydb
MAXINSTANCES 1
MAXLOGMEMBERS 4
MAXDATAFILES 1021
DATAFFILE SIZE 200M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
CHARACTER SET AL32UTF8
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 SIZE 100M,
GROUP 2 SIZE 100M,
GROUP 3 SIZE 100M;

To migrate a CHAR(1) column in 8859-1 to UTF-8 we would automatically expand it to a CHAR(4) to ensure it can always accommodate one character assuming that supplementary characters may be used later on. In cases where the physical limit of the data type is exceeded, limitations must be placed on the storable data.

Changing Length Semantics
Introduced in Oracle9i, length semantics offers an alternative to redefining the sizes of columns that need expansion for multibyte conversion typically to UTF-8. Length semantics are useful for both string handling and defining the storage requirements for multibyte strings of varying widths. In single-byte character sets, the number of bytes and the number of characters in a string are typically the same. In multibyte character sets, a character or code unit consists of one or more bytes. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called character semantics.

Picture 9.jpg

For example, you are migrating from an 8 bit Western European character set to a Unicode database (AL32UTF8). Suppose you have existing VARCHAR2 columns that contain
The NLS_LENGTH_SEMANTICS initialization parameter determines whether columns of character datatypes use byte or character semantics when they are created. Byte semantics is the default for the database character set. Character length semantics is the default and the only allowable kind of length semantics for NCHAR datatypes. The use cannot specify the CHAR or BYTE qualifier for NCHAR definitions.

You can check if you have that problem by using the Oracle utility csscan.

Standard views and checks:

select * from NLS_SESSION_PARAMETERS order by parameter;
select * from NLS_INSTANCE_PARAMETERS order by parameter;
select * from NLS_DATABASE_PARAMETERS order by parameter;
select * from V$NLS_PARAMETERS order by parameter;

Standard Database Name:

select NAME from V$DATABASE;

Oracle Software versions:

select &_SQLPLUS_RELEASE sqlplus_release, &_O_RELEASE Oracle_release from dual;
QL> select &_SQLPLUS_RELEASE sqlplus_release, &_O_RELEASE Oracle_release from dual;
old 1: select &_SQLPLUS_RELEASE sqlplus_release, &_O_RELEASE Oracle_release from dual
new 1: select 1002000300 sqlplus_release, 1002000300 Oracle_release from dual

SQLPLUS_RELEASE ORACLE_RELEASE
————— ————–
1002000300 1002000300

select * from V$VERSION;

SQL> select * from v$version;

BANNER
—————————————————————-
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 – 64bi
PL/SQL Release 10.2.0.3.0 – Production
CORE 10.2.0.3.0 Production
TNS for HPUX: Version 10.2.0.3.0 – Production
NLSRTL Version 10.2.0.3.0 – Production

The character set of a database defines how characters are stored in the database. Therefore you are limited to storing just the characters defined in the character set. If you change character sets there is a possibility that characters that you currently use are not defined in the new character set and therefore you could “corrupt” your data.

Unicode Datatype Features

The concept of the Unicode datatype, introduced in Oracle9i allows customers to support Unicode columns in a non-Unicode database. This is a very powerful feature and is further enhanced by the inter-operability between SQL NCHAR types and other datatypes. Users can store, process and retrieve SQL NCHAR data the same as SQL CHAR data.

There are several major aspects for the new Unicode datatype features that will be discussed in this section:

  • Character set encoding
  • Character length semantics
  • Interoperability
  • Data loss handling
  • Unicode string processing

Unicode Character Set Encoding

Oracle9i and Oracle Database 10g support two Unicode encodings for the Unicode datatypes. The Oracle character set names are AL16UTF16 and UTF8. When the database is first created AL16UTF16 or UTF8 can be specified as the “NATIONAL CHARACTER SET” parameter. When the national character set is not specified, the default is AL16UTF16. For details on Unicode encoding support, see Oracle’s Globalization Support Guide.

Supplementary Character is a Unicode encoded character having a Unicode code point between U+10000 and U+10FFFF. In UTF-16 encoding, it is encoded with a surrogate pair that consists of a sequence of two Unicode values, where the first value is a high-surrogate in the range U+D800 through U+DBFF and the second is a low-surrogate in the range of U+DC00 through U+DFFF.

Here is an example of creating a database with AL16UTF16 as the NCHAR character set:

CREATE DATABASE mydb
MAXINSTANCES 1
MAXLOGMEMBERS 4
MAXDATAFILES 1021
DATAFFILE SIZE 200M AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
CHARACTER SET AL32UTF8
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 SIZE 100M,
GROUP 2 SIZE 100M,
GROUP 3 SIZE 100M;

To migrate a CHAR(1) column in 8859-1 to UTF-8 we would automatically expand it to a CHAR(4) to ensure it can always accommodate one character assuming that supplementary characters may be used later on. In cases where the physical limit of the data type is exceeded, limitations must be placed on the storable data.

Categories: Blogroll