11gR2: “Unlucky” combination of a new feature, a fix, application design and code

Sometimes a new feature (and/or a fix) could give some very unexpected results. In this article I’ll show how a mix between a 11g new feature (Adaptive Cursor Sharing), a fix (in 11.2, the obsolete threshold), application design (a separate schema for every customer), and code issues (not prefixing objects with schemaname, and bind datatype mismatches), caused some major problems in a production system.

This other day I was contacted by some people having trouble with their production database. The last months they have experienced some occurences where their application started to timeout, and the database seemed to be unavailable. On every occation they had to restart the database to overcome the situation. The statspack report covering the “downtime” was not available (and the database was not licensed with diagnostic pack featuring AWR and ASH). The snap after the incident did never complete. One of the observations that was common for every incident was the CPU running at 100%. So what went wrong?

After getting access to the database, the reason for the problem was rather obvious. After several years of experience from performance taskforces on 9i, I’ve experienced several occations where not using bind variables have created simular problems in the database. After Oracle 10g came around, substituting some of the latches with more light weight mutexes, I haven’t seen “latch free” waits causing the database to become unavailable. Generally – I also think most developers have become more aware of the importance of using bind variables. But there are still many developers not knowing how to use binds when using IN-lists. Also object relational mapping APIs such as Hibernate might do this in an unefficient way ( IN (:b1, :b2, :b3 …) using a bind variable for every value in the IN-list). This, together with the fact that the CPU’s were running at 100% utilization, and never recovering, made me run the following query (and the result was rather surprising … or was it really?):

Note! The sql_text values below are changed to protect the “innocents”.

SQL> select substr(sql_text,1,50), count(*) 
 2   from v$sql 
 3   group by substr(sql_text,1,50) 
 4   having count(*)>=100 
 5   order by 2 desc;

SUBSTR(SQL_TEXT,1,50)                                COUNT(*)
-------------------------------------------------- ----------
insert into ...                                         14939
update ...                                               1629
update ...                                               1047
update ...                                                961
update ...                                                863
update ...                                                489
update ...                                                404
insert into ...                                           288
update ...                                                279
update ...                                                240
insert into ...                                           226
...
99 rows selected.

Looking at these numbers I first thought I’d found a bind issue. But the next queries showed something else:

SQL> select sql_text from v$sql where sql_text like '%' and rownum<=500;

SQL_TEXT
----------------------------------------------------------------------------------------------------
insert into t (col1, col2, col3, col4, col5, col6) values( :b1, :b2, :b3, :b4, :b5, :b6)
insert into t (col1, col2, col3, col4, col5, col6) values( :b1, :b2, :b3, :b4, :b5, :b6)
insert into t (col1, col2, col3, col4, col5, col6) values( :b1, :b2, :b3, :b4, :b5, :b6)
insert into t (col1, col2, col3, col4, col5, col6) values( :b1, :b2, :b3, :b4, :b5, :b6)
...

...

500 rows selected.

Actually – I also did a count on the corresponding SQL-id:

SQL> select count(*) from v$sql where sql_id='0qpgcrnkw20sm'

COUNT(*)
----------
14909

Oracle 11g introduced the new feature “Adaptive Cursor Sharing”. This feature meant to deal with the well known issue with bind peeking and “binds vs histograms”. When companies started to upgrade to Oracle 11g release 1, some started to experience issues with growing numbers of child cursors. Actually this was the intention with adaptive cursor sharing, but getting too many child cursors got some databases in trouble. Therefore, before 11g release 2, Oracle launched a fix. The fix was to mark the corresponding child cursors as obsolete when reaching 100 child curosrs (v$sql.is_obsolete). The result of the next query was – for this very reason – expected:

SQL> select sql_id, child_number, is_obsolete, max(cnt)
 2   from (select sql_id, child_number, is_obsolete,
 3                count(*) over(partition by sql_id, child_number, is_obsolete) cnt, 
 4                count(*) over(partition by sql_id, child_number) total_cnt
 5         from v$sql)
 6   where total_cnt > 1 and sql_id='0qpgcrnkw20sm'
 7   group by sql_id, child_number, is_obsolete
 8   order by sql_id, child_number;

SQL_ID        CHILD_NUMBER IS_OBSOLETE   MAX(CNT)
------------- ------------ ----------- ----------
0qpgcrnkw20sm            0 Y                  138
0qpgcrnkw20sm            0 N                    1
0qpgcrnkw20sm            1 Y                  141
0qpgcrnkw20sm            1 N                    1
0qpgcrnkw20sm            2 Y                  142
0qpgcrnkw20sm            2 N                    1
...
0qpgcrnkw20sm           63 Y                  140
0qpgcrnkw20sm           63 N                    1
0qpgcrnkw20sm           64 N                    1
0qpgcrnkw20sm           64 Y                  144
0qpgcrnkw20sm           65 Y                  141
0qpgcrnkw20sm           65 N                    1
0qpgcrnkw20sm           66 N                    1
0qpgcrnkw20sm           66 Y                  141
0qpgcrnkw20sm           67 Y                  145
0qpgcrnkw20sm           67 N                    1
0qpgcrnkw20sm           68 Y                  143
0qpgcrnkw20sm           69 Y                  142
0qpgcrnkw20sm           70 Y                  140
0qpgcrnkw20sm           71 Y                  142
0qpgcrnkw20sm           72 Y                  141
0qpgcrnkw20sm           73 Y                  143
0qpgcrnkw20sm           74 Y                  146
...
0qpgcrnkw20sm           96 Y                  139
0qpgcrnkw20sm           97 Y                  143
0qpgcrnkw20sm           98 Y                  139
0qpgcrnkw20sm           99 Y                  130
 
173 rows selected.

Because of the big numbers of child cursors I wrote a smal perl program running the following query in a loop:

select to_char(sysdate,'HH24:MI:SS'), count(*) 
from v$sql where sql_id='0qpgcrnkw20sm' and is_obsolete='N';

The results showed rather random numbers between 0 and 99, showing up just about every second. This indicate that the corresponding child cursors were invalidated once or more every single second. But why?

The next step was to check why we got this cursor mismatches. I did have a feeling about what was comming, but was kind of curious about why the number of child cursor seemed to be this high.

select * from
(select sql_id, nonshared_reason, count(*) from v$sql_shared_cursor 
unpivot
(nonshared_value for nonshared_reason in (
UNBOUND_CURSOR as 'UNBOUND_CURSOR',
SQL_TYPE_MISMATCH as 'SQL_TYPE_MISMATCH',
OPTIMIZER_MISMATCH as 'OPTIMIZER_MISMATCH',
OUTLINE_MISMATCH as 'OUTLINE_MISMATCH',
STATS_ROW_MISMATCH as 'STATS_ROW_MISMATCH',
LITERAL_MISMATCH as 'LITERAL_MISMATCH',
FORCE_HARD_PARSE as 'FORCE_HARD_PARSE',
EXPLAIN_PLAN_CURSOR as 'EXPLAIN_PLAN_CURSOR',
BUFFERED_DML_MISMATCH as 'BUFFERED_DML_MISMATCH',
PDML_ENV_MISMATCH as 'PDML_ENV_MISMATCH',
INST_DRTLD_MISMATCH as 'INST_DRTLD_MISMATCH',
SLAVE_QC_MISMATCH as 'SLAVE_QC_MISMATCH',
TYPECHECK_MISMATCH as 'TYPECHECK_MISMATCH',
AUTH_CHECK_MISMATCH as 'AUTH_CHECK_MISMATCH',
BIND_MISMATCH as 'BIND_MISMATCH',
DESCRIBE_MISMATCH as 'DESCRIBE_MISMATCH',
LANGUAGE_MISMATCH as 'LANGUAGE_MISMATCH',
TRANSLATION_MISMATCH as 'TRANSLATION_MISMATCH',
BIND_EQUIV_FAILURE as 'BIND_EQUIV_FAILURE',
INSUFF_PRIVS as 'INSUFF_PRIVS',
INSUFF_PRIVS_REM as 'INSUFF_PRIVS_REM',
REMOTE_TRANS_MISMATCH as 'REMOTE_TRANS_MISMATCH',
LOGMINER_SESSION_MISMATCH as 'LOGMINER_SESSION_MISMATCH',
INCOMP_LTRL_MISMATCH as 'INCOMP_LTRL_MISMATCH',
OVERLAP_TIME_MISMATCH as 'OVERLAP_TIME_MISMATCH',
EDITION_MISMATCH as 'EDITION_MISMATCH',
MV_QUERY_GEN_MISMATCH as 'MV_QUERY_GEN_MISMATCH',
USER_BIND_PEEK_MISMATCH as 'USER_BIND_PEEK_MISMATCH',
TYPCHK_DEP_MISMATCH as 'TYPCHK_DEP_MISMATCH',
NO_TRIGGER_MISMATCH as 'NO_TRIGGER_MISMATCH',
FLASHBACK_CURSOR as 'FLASHBACK_CURSOR',
ANYDATA_TRANSFORMATION as 'ANYDATA_TRANSFORMATION',
PDDL_ENV_MISMATCH as 'PDDL_ENV_MISMATCH',
TOP_LEVEL_RPI_CURSOR as 'TOP_LEVEL_RPI_CURSOR',
DIFFERENT_LONG_LENGTH as 'DIFFERENT_LONG_LENGTH',
LOGICAL_STANDBY_APPLY as 'LOGICAL_STANDBY_APPLY',
DIFF_CALL_DURN as 'DIFF_CALL_DURN',
BIND_UACS_DIFF as 'BIND_UACS_DIFF',
PLSQL_CMP_SWITCHS_DIFF as 'PLSQL_CMP_SWITCHS_DIFF',
CURSOR_PARTS_MISMATCH as 'CURSOR_PARTS_MISMATCH',
STB_OBJECT_MISMATCH as 'STB_OBJECT_MISMATCH',
CROSSEDITION_TRIGGER_MISMATCH as 'CROSSEDITION_TRIGGER_MISMATCH',
PQ_SLAVE_MISMATCH as 'PQ_SLAVE_MISMATCH',
TOP_LEVEL_DDL_MISMATCH as 'TOP_LEVEL_DDL_MISMATCH',
MULTI_PX_MISMATCH as 'MULTI_PX_MISMATCH',
BIND_PEEKED_PQ_MISMATCH as 'BIND_PEEKED_PQ_MISMATCH',
MV_REWRITE_MISMATCH as 'MV_REWRITE_MISMATCH',
ROLL_INVALID_MISMATCH as 'ROLL_INVALID_MISMATCH',
OPTIMIZER_MODE_MISMATCH as 'OPTIMIZER_MODE_MISMATCH',
PX_MISMATCH as 'PX_MISMATCH',
MV_STALEOBJ_MISMATCH as 'MV_STALEOBJ_MISMATCH',
FLASHBACK_TABLE_MISMATCH as 'FLASHBACK_TABLE_MISMATCH',
LITREP_COMP_MISMATCH as 'LITREP_COMP_MISMATCH',
PLSQL_DEBUG as 'PLSQL_DEBUG',
LOAD_OPTIMIZER_STATS as 'LOAD_OPTIMIZER_STATS',
ACL_MISMATCH as 'ACL_MISMATCH',
FLASHBACK_ARCHIVE_MISMATCH as 'FLASHBACK_ARCHIVE_MISMATCH',
LOCK_USER_SCHEMA_FAILED as 'LOCK_USER_SCHEMA_FAILED',
REMOTE_MAPPING_MISMATCH as 'REMOTE_MAPPING_MISMATCH',
LOAD_RUNTIME_HEAP_FAILED as 'LOAD_RUNTIME_HEAP_FAILED',
HASH_MATCH_FAILED as 'HASH_MATCH_FAILED',
PURGED_CURSOR as 'PURGED_CURSOR',
BIND_LENGTH_UPGRADEABLE as 'BIND_LENGTH_UPGRADEABLE',
USE_FEEDBACK_STATS as 'USE_FEEDBACK_STATS'
))
where nonshared_value = 'Y'
group by sql_id, nonshared_reason
)
where sql_id = '0qpgcrnkw20sm';


SQL_ID        NONSHARED_REASON                COUNT(*)
------------- ----------------------------- ----------
0qpgcrnkw20sm BIND_LENGTH_UPGRADEABLE             1627
0qpgcrnkw20sm AUTH_CHECK_MISMATCH                14721
0qpgcrnkw20sm TRANSLATION_MISMATCH               14721
0qpgcrnkw20sm BIND_MISMATCH                      10973

The BIND_MISMATCH and the BIND_LENGTH_UPGRADEABLE were the nonchared reasons I was expecting. The BIND_MISMATCH is caused by non matching bind datatypes from the client application. The BIND_LENGTH_UPGRADEABLE is a subtype of the BIND_MISMATCH, where the mismatch is caused by various lengths of the values associated with the bind variables.

But what about the AUTH_CHECK_MISMATCH and the TRANSLATION_MISMATCH. Bye the corresponding numbers (14721) they looked rather related. The documentation give the following definitions:

AUTH_CHECK_MISMATCH: (Y|N) 
Authorization/translation check failed for the existing child cursor

TRANSLATION_MISMATCH:(Y|N) 
The base objects of the existing child cursor do not match

BIND_MISMATCH: (Y|N) 
The bind metadata does not match the existing child cursor

The definition behind the TRANSLATION_MISMATCH really explains the issue in question. The application used separate schemas for every customer (about 26 separate schemas), without prefixing the objects with schemaname in the corresponding queries. This alone will generate 26 child cursors for every parent cursor. But having a obsolete threshold at 100 (this actually was introduced as a hidden parameter in version 11.2.0.3, the “_cursor_obsolete_threshold”) this could not be the reason alone. As many times before I found that the bind mismatch was related to date colums. Using different datatypes for a bind variable could lead to different kinds of trouble. On some occations I’ve seen this lead to bad query plans (getting all data from an access path, then converting the date column to the bind datatype( this could be a varchar or timestamp), and then filtering out the requested output). Or, as the query below shows, this could lead to bind mismatch:

SQL> select position, count(distinct datatype) 
 2   from v$sql s, v$sql_bind_metadata m 
 3   where S.CHILD_ADDRESS = M.ADDRESS and s.sql_id='0qpgcrnkw20sm'
 4   group by position
 5   having count(distinct datatype)>1
 6   order by 1;

POSITION   COUNT(DISTINCTDATATYPE)
---------- -----------------------
         4                       3
         5                       3
         6                       3

3 rows selected.

When I checked the actual datatypes, I found what I expected:

SQL> select distinct position, datatype 
 2   from v$sql s, v$sql_bind_metadata m
 3   where S.CHILD_ADDRESS =M.ADDRESS and s.sql_id='0qpgcrnkw20sm'
 4     and position in (4,5,6)
 5   order by 1;

POSITION   DATATYPE
---------- ----------
         4          1
         4         12
         4        180

         5          1
         5         12
         5        180

         6          1
         6         12
         6        180

9 rows selected.

The values 1, 12 and 180 corresponds to the datatypes varchar2, date and timestamp (see sys.type$). This mismatch could possible create (3^3=) 27 different child cursors, which alone is not enough to encounter the obsolete threshold at 100. But combined with the TRANSLATION_MISMATCH (and the AUTH_CHECK_MISMATCH) this could possible create (26*27=) 702 different child cursors. When you add some BIND_LENGTH_UPGRADEABLE mismatches too, things gets even more ugly.

One of the common reasons for bind mismatch in Java, is the use of “setNull(n, TYPES.NULL)”. Using the “setNull” is the right way to bind null values, but the “TYPES.NULL” defaults to VARCHAR. The code should explicitly reference the right datatype:

PreparedStatement pstmt = conn.prepareStatement(INSERT_SQL);
pstmt.setInt(1, val1);
...
if (val4 == null) then pstmt.setNull(4, java.sql.Types.DATE) else ...;
if (val5 == null) then pstmt.setNull(5, java.sql.Types.DATE) else ...;
if (val6 == null) then pstmt.setNull(6, java.sql.Types.DATE) else ...;

So – why did the database freeze?
Well I did never get to observe the hang situations. All the times I was monitoring the database it survived all the invalidations of cursors. The average number of active sessions was laying around 4-6 sessions. At these low concurrency rates the database survives. But at some unlucky moments, having parent cursor invalidations and some kind of peak in active sessions (hard parsing to create new child cursors), this generate contention on the library cache latches (and/or mutexes). When this happens the latch misses causes the session to spin on the CPU. On a heavy system, with continuous incomming requests, the system now just about to get jamed. The incomming requests is causing more contention on the library cache, more spinning on CPU,then latch free waits, and more incomming requests. The database is “halted”, and not able to answer any requests. Just spinning on the CPU’s, never getting to do any real work.

So – how do we recover from such a situation?
Well. One obvious answer would be to group all customers into one common schema. Oracle is a relational database, and one of the main responsibilities is actually to keep the relations among the data. If security is a issue the company could possibly consider Virtual Private Database (VPD).
This action could be rather expensive, and might take a long time to get into production.

Another action could be to tweek the default value for the hidden parameter “_cursor_obsolete_threshold”. This could introduce some of the problems seen in Oracle 11g version 1. But looking into the metadata there seems to be very few columns with histograms, and adaptive cursor sharing – even thought enabled – seems not to be in use (this could of course be because the actual cursors are invalidated because of the unhealty library cache. I really don’t know yet). The trouble with the obsolete threshold parameter is that is general for the entire system, and I would recommend to do this only on advice from Oracle support.

A probably less expensive and less general approach would be to prefix every object (in queries) with the schemaname. At least I would try to do this for the most heavilly executed queries. This could lead to a need for a bigger shared pool; getting more parent cursors with up to 100 child cursors. But the shared pool was allready rather big. The memory dedicated to Oracle was originally 6G. With these settings Oracle used 2,9G for buffer cache and about 2,9G for the shared pool. Increasing the total memory to 8,4G, Oracle started up with about 3G for the shared pool and the rest in the buffer cache. But the shared pool was slowly increasing (and the buffer cache decreasing). After quite some time the database seemed to stabelize at about 5G in the shared pool and 3G in the buffer cache.
This raises the question: How does Oracle decide if the shared pool should increase? Is 5G really not enough? I have a weird feeling that 5G is more than plenty, but that Oracle is fooled by the unhealty library cache.

Of course the bind mismatches shold also be fixed. This is a code issue, and something the developers really should learn from. This is a very common mistake in which I see to often, not only by novice developers.

I would really appreciate some comments on the topic, so feel free!

Post a Comment

Your email is never published nor shared. Required fields are marked *