A SQL Profiled query opted for a different execution plan despite the Note at the bottom of its execution plan indicating that a SQL Profile has been used. The new plan makes the query failing with the classical parallel query error due to a lack of TEMP space following a massive parallel broadcast distribution
A SQL Profiled query opted for a different execution plan despite the Note at the bottom of its execution plan indicating that a SQL Profile has been used. The new plan makes the query failing with the classical parallel query error due to a lack of TEMP space following a massive parallel broadcast distribution
ORA-12801: error signaled in parallel query server P013 ORA-01652: unable to extend temp segment by 128 in tablespace TEMP
Initially this query fails when it reached 32GB of TEMP space. When waked up, the on call DBA augmented the TEMP space but, unfortunately, the next run failed as well when it reached 137GB of TEMP space.
When it was my turn to investigate this issue, instead of continuing enlarging the TEMP tablespace, or changing the parallel distribution method, I decided to figure out why Oracle is refusing to use the SQL Profile and was compiling a new plan practically at each execution. Applying Tanel Poder nonshared script to the sql_id of the real world query gives this:
SQL> @nonshared 1b7g55gx40k79 Show why existing SQL child cursors were not reused (V$SQL_SHARED_CURSOR)... INST_ID : 1 SQL_ID : gx28sa7z20btn ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 1 REASON : <reason>PQ Slave mismatch(5)</reason> ----------------- INST_ID : 1 SQL_ID : gx28sa7z20btn ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 2 REASON : <reason>PQ Slave mismatch(5)</reason> PQ_SLAVE_MISMATCH : Y ----------------- INST_ID : 1 SQL_ID : gx28sa7z20btn ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 3 REASON : <reason>PQ Slave mismatch(5)</reason> PQ_SLAVE_MISMATCH : Y ----------------- INST_ID : 1 SQL_ID : gx28sa7z20btn ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 4 REASON : <reason>PQ Slave mismatch(5)</reason> PQ_SLAVE_MISMATCH : Y ----------------- INST_ID : 1 SQL_ID : gx28sa7z20btn ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 5 REASON : <reason>PQ Slave mismatch(5)</reason> PQ_SLAVE_MISMATCH : Y -----------------
What does this PQ_SLAVE_MISMATCH non sharing reason mean?
Oracle defines it as
(Y|N) Top-level slave decides not to share cursor
This very short definition seems indicating that a parallel slave refused to share the execution plan of its Query Coordinator and decided to hard parse its proper execution plan even though they are both (the QC and the PX slaves) running in the same instance in an 11.2.0.4 release.
I spent a couple of minutes looking at the query trying to simplify it until I found the part of it causing the execution plan mismatch. Having got a clue of what is happening in this real life query I engineered the following model with which I have reproduced the same behaviour in 11.2.0.4 and 12.1.0.1.0. Look at the following setup (where you will recognize one of the Jonathan Lewis table scripts):
select banner from gv$version where rownum=1; BANNER ----------------------------------------------------------------------------- Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production create table t1 as with generator as ( select --+ materialize rownum id from dual connect by level <=1000 ) select trunc((rownum -1)/2) product_id ,mod(rownum-1,2) flag1 ,mod(rownum-1,3) flag2 ,rownum n1 ,lpad(rownum,30) v1 from generator v1 ,generator v2 where rownum <= 1e6; create table t2 as select level id ,date '2012-06-07' + mod((level-1)*5,10)+ interval '5' minute start_date ,rpad( 'xx',10) padding from dual connect by level <=1e6; begin dbms_stats.gather_table_stats(user, 't1', method_opt => 'for all columns size 1'); dbms_stats.gather_table_stats(user, 't2', method_opt => 'for all columns size 1'); end; /
And here’s below the query I will be using in order to reproduce the non-sharing reason:
with got_my_dates as (select /*+ materialize */ id ,padding ,start_date from t2 where start_date > to_date('07/06/2012','dd/mm/yyyy') and start_date <= to_date('10/06/2012','dd/mm/yyyy') ) select /*+ parallel(4) */ t1.* ,cte.padding ,cte.start_date from t1 ,got_my_dates cte where t1.product_id = cte.id and t1.product_id <= 1e3;
A simple remark before starting the experiment. In the real life query the “with subquery” has been automatically materialized by Oracle because it is called two times in the main query. This is why the materialize hint I used above might not be absolutely necessary for the behaviour, I will be explaining hereinafter, to happen.
Let’s also confirm, before starting the experiment, that this query has not been previously parsed and as such is completely unknown:
SQL> select sql_id from gv$sql where sql_text like '%got_my_dates%' and sql_text not like '%v$sql%'; no rows selected
In the following I will execute the above query, get it execution plan, and show how many child cursor it has used during this very first execution:
SQL> – run query SQL_ID 1b7g55gx40k79, child number 0 ------------------------------------- Plan hash value: 2708956082 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | TQ |IN-OUT| ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | | 1 | TEMP TABLE TRANSFORMATION | | | | | | | 2 | PX COORDINATOR | | | | | | | 3 | PX SEND QC (RANDOM) | :TQ10000 | 994K| 22M| Q1,00 | P->S | | 4 | LOAD AS SELECT | | | | Q1,00 | PCWP | | 5 | PX BLOCK ITERATOR | | 994K| 22M| Q1,00 | PCWC | |* 6 | TABLE ACCESS FULL | T2 | 994K| 22M| Q1,00 | PCWP | | 7 | PX COORDINATOR | | | | | | | 8 | PX SEND QC (RANDOM) | :TQ20001 | 1830 | 134K| Q2,01 | P->S | |* 9 | HASH JOIN | | 1830 | 134K| Q2,01 | PCWP | | 10 | JOIN FILTER CREATE | :BF0000 | 1830 | 84180 | Q2,01 | PCWP | | 11 | PX RECEIVE | | 1830 | 84180 | Q2,01 | PCWP | | 12 | PX SEND BROADCAST | :TQ20000 | 1830 | 84180 | Q2,00 | P->P | | 13 | PX BLOCK ITERATOR | | 1830 | 84180 | Q2,00 | PCWC | |* 14 | TABLE ACCESS FULL | T1 | 1830 | 84180 | Q2,00 | PCWP | |* 15 | VIEW | | 994K| 27M| Q2,01 | PCWP | | 16 | JOIN FILTER USE | :BF0000 | 994K| 22M| Q2,01 | PCWP | | 17 | PX BLOCK ITERATOR | | 994K| 22M| Q2,01 | PCWC | |* 18 | TABLE ACCESS FULL | SYS_TEMP_0FD9D6633_140F243 | 994K| 22M| Q2,01 | PCWP | ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 6 - access(:Z>=:Z AND :Z<=:Z) filter(("START_DATE">TO_DATE(' 2012-06-07 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "START_DATE"<=TO_DATE(' 2012-06-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))) 9 - access("T1"."PRODUCT_ID"="CTE"."ID") 14 - access(:Z>=:Z AND :Z<=:Z) filter("T1"."PRODUCT_ID"<=1000) 15 - filter("CTE"."ID"<=1000) 18 - access(:Z>=:Z AND :Z<=:Z) filter(SYS_OP_BLOOM_FILTER(:BF0000,"C0")) Note ----- - Degree of Parallelism is 4 because of hint SQL> @gv$sql Enter value for sql_id: 1b7g55gx40k79 SQL_ID PLAN_HASH_VALUE CHILD_NUMBER LAST_LOAD_TIME EXECUTIONS ------------- --------------- ------------ ------------------- ---------- 1b7g55gx40k79 2708956082 0 2016-09-26/07:30:23 1 1b7g55gx40k79 2708956082 1 2016-09-26/07:30:24 0
Notice how a completely new query produces, during its very first execution, two child cursors. Interestingly, the number of executions seems to indicate that Oracle used the child cursor 0 while the number of executions of the child cursor n°1 has not been incremented.
The reason for this double child cursors is:
SQL> @nonshared 1b7g55gx40k79 Show why existing SQL child cursors were not reused (V$SQL_SHARED_CURSOR)... SQL_ID : 1b7g55gx40k79 ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF92115E6B0 CHILD_NUMBER : 0 REASON : <reason>PQ Slave mismatch(5)</reason> CON_ID : 1 ----------------- SQL_ID : 1b7g55gx40k79 ADDRESS : 00007FF92115E830 CHILD_ADDRESS : 00007FF9214A40E8 CHILD_NUMBER : 1 PQ_SLAVE_MISMATCH : Y REASON : <reason>PQ Slave mismatch(5)</reason> CON_ID : 1 -----------------
So here we are: exactly at the same situation as the real world query.
Another remark which is worth to be mentioned here is that the execution plan of child cursor n°1 is exactly identical to the child cursor n° 0 shown above except this bizarre Note at the bottom:
SQL_ID 1b7g55gx40k79, child number 1 ------------------------------------- with got_my_dates as (select /*+ materialize */ id ,padding ,start_date from t2 where start_date > to_date('07/06/2012','dd/mm/yyyy') and start_date <= to_date('10/06/2012','dd/mm/yyyy') ) select /*+ parallel(4) */ t1.* ,cte.padding ,cte.start_date from t1 ,got_my_dates cte where t1.product_id = cte.id and t1.product_id <= 1e3 Note ----- - automatic DOP: Computed Degree of Parallelism is 4
I don’t clearly get where this automatic DOP is coming from since the auto DOP parameter is not set in my personal laptop nor in the real world application:
SQL> show parameter parallel_degree_policy PARAMETER_NAME TYPE VALUE ------------------------------- ----------- ------- parallel_degree_policy string MANUAL
Last but not least, while the query was always requesting 2*DOP (8) parallel serves, Oracle managed, systematically, to give it 12 (and sometimes 16) parallel servers:
SQL> select sql_id ,process_name px_slave from gv$sql_monitor where sql_id = '1b7g55gx40k79' and sql_exec_id =16777216 and trunc(sql_exec_start) =to_date('27092016','ddmmyyyy') and sql_text is null order by 2 ; SQL_ID PX_SL ------------- ----- 1b7g55gx40k79 p000 1b7g55gx40k79 p000 1b7g55gx40k79 p001 1b7g55gx40k79 p001 1b7g55gx40k79 p002 1b7g55gx40k79 p002 1b7g55gx40k79 p003 1b7g55gx40k79 p003 1b7g55gx40k79 p004 1b7g55gx40k79 p005 1b7g55gx40k79 p006 1b7g55gx40k79 p007 12 rows selected. Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 1 Session : C##MHOURI (7:4495) SQL ID : 1b7g55gx40k79 SQL Execution ID : 16777216 Execution Started : 09/27/2016 07:29:34 First Refresh Time : 09/27/2016 07:29:34 Last Refresh Time : 09/27/2016 07:29:45 Duration : 11s Module/Action : SQL*Plus/- Service : orcl Program : sqlplus.exe Fetch Calls : 3 Parallel Execution Details (DOP=4 , Servers Allocated=12)
This information of Servers Allocated that are greater than the requested servers is not an innocent information and should always kept your attention.
That’s said, as you might have already guessed via the title of this post, the simplification of the real word query shows that the parallel slave is refusing to share the QC execution plan because of the materialisation of the Common Table Expression. This is why when I pre-empted the materialisation of the CTE via the /*+ inline */ hint the parallel slave shared the execution plan of its QC as shown below:
with got_my_dates as (select /*+ inline */ id ,padding ,start_date from t2 where start_date > to_date('07/06/2012','dd/mm/yyyy') and start_date <= to_date('10/06/2012','dd/mm/yyyy') ) select /*+ parallel(4) */ t1.* ,cte.padding ,cte.start_date from t1 ,got_my_dates cte where t1.product_id = cte.id and t1.product_id <= 1e3; 1000 rows selected. SQL> start xpsimp SQL_ID 4h1qa708b9p3j, child number 0 ------------------------------------- Plan hash value: 2637578939 -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | TQ |IN-OUT| -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10002 | 681 | 47670 | Q1,02 | P->S | |* 3 | HASH JOIN BUFFERED | | 681 | 47670 | Q1,02 | PCWP | | 4 | JOIN FILTER CREATE | :BF0000 | 681 | 16344 | Q1,02 | PCWP | | 5 | PX RECEIVE | | 681 | 16344 | Q1,02 | PCWP | | 6 | PX SEND HYBRID HASH | :TQ10000 | 681 | 16344 | Q1,00 | P->P | | 7 | STATISTICS COLLECTOR | | | | Q1,00 | PCWC | | 8 | PX BLOCK ITERATOR | | 681 | 16344 | Q1,00 | PCWC | |* 9 | TABLE ACCESS FULL | T2 | 681 | 16344 | Q1,00 | PCWP | | 10 | PX RECEIVE | | 1830 | 84180 | Q1,02 | PCWP | | 11 | PX SEND HYBRID HASH | :TQ10001 | 1830 | 84180 | Q1,01 | P->P | | 12 | JOIN FILTER USE | :BF0000 | 1830 | 84180 | Q1,01 | PCWP | | 13 | PX BLOCK ITERATOR | | 1830 | 84180 | Q1,01 | PCWC | |* 14 | TABLE ACCESS FULL | T1 | 1830 | 84180 | Q1,01 | PCWP | -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("T1"."PRODUCT_ID"="ID") 9 - access(:Z>=:Z AND :Z<=:Z) filter(("ID"<=1000 AND "START_DATE">TO_DATE(' 2012-06-07 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "START_DATE"<=TO_DATE(' 2012-06-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))) 14 - access(:Z>=:Z AND :Z<=:Z) filter(("T1"."PRODUCT_ID"<=1000 AND SYS_OP_BLOOM_FILTER(:BF0000,"T1"."PRODUCT_ID"))) Note ----- - Degree of Parallelism is 4 because of hint
Notice below how, now that the CTE is not materialized, the query is using a single child cursor :
SQL> @gv$sql2 Enter value for sql_id: 4h1qa708b9p3j SQL_ID PLAN_HASH_VALUE CHILD_NUMBER LAST_LOAD_TIME EXECUTIONS END_OF_FETCH_COUNT ------------- --------------- ------------ ------------------- ---------- ------------------ 4h1qa708b9p3j 2637578939 0 2016-09-27/18:00:54 1 1 SQL> @nonshared 4h1qa708b9p3j Show why existing SQL child cursors were not reused (V$SQL_SHARED_CURSOR)... SQL_ID : 4h1qa708b9p3j ADDRESS : 00007FF921744A58 CHILD_ADDRESS : 00007FF921744698 CHILD_NUMBER : 0 REASON : CON_ID : 1 -----------------
And spot as well that the non materialization of the CTE is so that the number of requested parallel server (2*DOP) equals the number of allocated servers as shown below :
SQL> select sql_id ,process_name px_slave from gv$sql_monitor where sql_id = '4h1qa708b9p3j' and sql_exec_id =16777216 and trunc(sql_exec_start) = trunc(sysdate) and sql_text is null order by 2 ; SQL_ID PX_SL ------------- ----- 4h1qa708b9p3j p000 4h1qa708b9p3j p001 4h1qa708b9p3j p002 4h1qa708b9p3j p003 4h1qa708b9p3j p004 4h1qa708b9p3j p005 4h1qa708b9p3j p006 4h1qa708b9p3j p007 8 rows selected. Global Information ------------------------------ Status : DONE (ALL ROWS) Instance ID : 1 Session : C##MHOURI (7:4495) SQL ID : 4h1qa708b9p3j SQL Execution ID : 16777216 Execution Started : 09/27/2016 18:00:54 First Refresh Time : 09/27/2016 18:00:54 Last Refresh Time : 09/27/2016 18:00:57 Duration : 3s Module/Action : SQL*Plus/- Service : orcl Program : sqlplus.exe Fetch Calls : 3 Parallel Execution Details (DOP=4 , Servers Allocated=8)
And for those who can’t deal without execution plan here’s below the new execution plan
Plan hash value: 2637578939 -------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | TQ |IN-OUT| -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10002 | 738 | 51660 | Q1,02 | P->S | |* 3 | HASH JOIN BUFFERED | | 738 | 51660 | Q1,02 | PCWP | | 4 | JOIN FILTER CREATE | :BF0000 | 737 | 17688 | Q1,02 | PCWP | | 5 | PX RECEIVE | | 737 | 17688 | Q1,02 | PCWP | | 6 | PX SEND HYBRID HASH | :TQ10000 | 737 | 17688 | Q1,00 | P->P | | 7 | STATISTICS COLLECTOR | | | | Q1,00 | PCWC | | 8 | PX BLOCK ITERATOR | | 737 | 17688 | Q1,00 | PCWC | |* 9 | TABLE ACCESS FULL | T2 | 737 | 17688 | Q1,00 | PCWP | | 10 | PX RECEIVE | | 1986 | 91356 | Q1,02 | PCWP | | 11 | PX SEND HYBRID HASH | :TQ10001 | 1986 | 91356 | Q1,01 | P->P | | 12 | JOIN FILTER USE | :BF0000 | 1986 | 91356 | Q1,01 | PCWP | | 13 | PX BLOCK ITERATOR | | 1986 | 91356 | Q1,01 | PCWC | |* 14 | TABLE ACCESS FULL | T1 | 1986 | 91356 | Q1,01 | PCWP | -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("T1"."PRODUCT_ID"="ID") 9 - access(:Z>=:Z AND :Z<=:Z) filter(("ID"<=1000 AND "START_DATE">TO_DATE(' 2012-06-07 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "START_DATE"<=TO_DATE(' 2012-06-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))) 14 - access(:Z>=:Z AND :Z<=:Z) filter(("T1"."PRODUCT_ID"<=1000 AND SYS_OP_BLOOM_FILTER(:BF0000,"T1"."PRODUCT_ID"))) Note ----- - Degree of Parallelism is 4 because of hint
While the initial execution plan uses two DFO trees, the new one managed to use only a single DFO trees with 3 DFOs. Having multiple parallelisers (or DFOs) might sometimes create issues like in this current case of TEMP TABLE transformation and PQ_SLAVE_MISMATCH.
BOTTOM LINE
Even though this issue doesn’t reproduce in 12.0.1.2, for previous Oracle releases, watch out your parallel queries using a temp table transformation. You might be confronted to an execution plan instability due to the PQ_SLAVE_MISMATCH non-sharing reason where a parallel slave refuses to share the execution plan of its query coordinator. Bear in mind as well that in such situation even a SQL Profile will not succeed to definitely guarantee the same execution plan at each execution.
