VARCHAR2 (byte vs char)

Some days ago I got a question about the difference between defining a variable as varchar2(10 byte) and varchar2(10 char). I had to admitt that I should have known this, but I didn´t. I couldn´t even tell what the default setting was when using varchar2(10). Actually this turned out to be the opposite of what I would have guessed. So I definitly had to do some reading. This is what I found.

The default setting is dependant of the parameter NLS_LENGTH_SEMANTICS. When I checked this on my 11g installation, this was surprisingly default set to BYTE. What is then the different between this to settings – CHAR og BYTE?

If we are using the BYTE setting this means that Oracle could possible store up to the number of bytes defined for the column. The number of possible characters will then be dependant of the characterset in use. If we are using unicode and multibyte charactersets, the number of possible characters could be less than the number of bytes defined. By setting varchar2(10 char) we´re actually defining how many possible characters that can be stored in the column regardless of how many bytes this will amount to.

When using single-byte charactersets there will be no difference between the to settings: CHAR or BYTE. Then the number of bytes and characters will be the same. When using multi-byte charactersets, then the number of bytes will either be the same as the number of characters, or eventually higher. When using fixed-size mulit-byte charactersets (UTF16) the number of bytes will always be higher than the number of characters. For instance with UTF16 which stores every characters as 2 bytes, then every character will in theory be half the number of bytes.

You might have noticed the fact that many characterset functions have related B-function, for instance the SUBSTR has a equal SUBSTRB function, and the INSTR a equal INSTRB and so on. The B-letter indicate “byte”. So when using unicode charactersets and defining a varchar2(10 byte), this should be trigger the use of the corresponding B-functions, as in SUBSTRB(text, 1,5).

What about performance?
I´ve not found anything suggesting performance differences between using CHAR or BYTE. According to Tom Kyte this is only a create issue:

    "but how would this affect performance?  It is a table create concept."

As long as we are using single-byte charactersets (as we often do in the nordic nations), there will be now difference if we define our columns as varchar2(10 byte) or varchar2(10 char). Thus we leave the NLS_LENGTH_SEMANTICS parameter at the default value, which is BYTE. If we´re using multi-byte charactersets, then we probably have to do an analysis of what should be used. Set the nls_length_semantics parameter to the desired value, and use the default setting by defining datatype as for instance varchar2(10). Using both char and byte in the same database, could possibly generate inconvenient consequence. What to choose is dependant of the situation, and I can not think of any best practise regarding this. But if you for instance are using a variable multi-byte characterset (as AL32UTF8), I would imagine using CHAR would be preffered when considering the count of possible characters.

Post a Comment

Your email is never published nor shared. Required fields are marked *