개요

Oracle 10g 이후의 DBMS character set  은 AL16UTF16 이다. 그리고 client 나 app 가 연결할 때 많이 쓰는 캐릭터셋은 AL32UTF8 이다. AL32UTF8 인코딩이 기존 UTF8 에 비해 장점은 Unicode Supplementary Character 를 더 잘 지원한다는 점이다.

 

Supplementary Character

Unicode Character 표준에서 code points 가 U+10000 to U+10FFFF 에 있는 문자들. 다른 말로 U+FFFF 보다 큰 Unicode 문자들

  • In UTF-8 these characters are each 4 bytes long.
  • In UTF-16 these characters require 2 surrogates (16-bit units).

Encoding 별 차이점

AL32UTF8

  • UTF8 은 Unicode 3.1 만 지원하므로 XML 작성시 Encoding을 UTF-8 로 할 경우 U+FFFF 이후 문자 시용시 문제가 될 수 있음
  • U+FFFF 이후의 char 는 4 byte 로 encoding - UTF8 에서 Supplementary Character 를 처리 못하는 문제를 해결하기 위한 UTF8의 확장
  • Oracle DBMS 내부에서는 UTF-8 대신 AL16UTF16 이 사용됨.

AL16UTF16

  • 문자를 2byte 로 저장

 

Length Semantics

In single-byte character sets, the number of bytes and the number of characters in a string are the same. In multibyte character sets, a character or code point consists of one or more bytes. Calculating the number of characters based on byte lengths can be difficult in a variable-width character set. Calculating column lengths in bytes is called byte semantics, while measuring column lengths in characters is called character semantics.

Character semantics is useful for defining the storage requirements for multibyte strings of varying widths. For example, in a Unicode database (AL32UTF8), suppose that you need to define a VARCHAR2 column that can store up to five Chinese characters together with five English characters. Using byte semantics, this column requires 15 bytes for the Chinese characters, which are three bytes long, and 5 bytes for the English characters, which are one byte long, for a total of 20 bytes. Using character semantics, the column requires 10 characters.

The following expressions use byte semantics:

  • VARCHAR2(20 BYTE)

  • SUBSTRB(string, 1, 20)

Note the BYTE qualifier in the VARCHAR2 expression and the B suffix in the SQL function name.

The following expressions use character semantics:

  • VARCHAR2(10 CHAR)

  • SUBSTR(string, 1, 10)


Ref