If you are going to enable a foreign key integrity constraint, then make sure you have fresh and representative statistics on both child and parent table ends. If this is not so, then you might lose a lot of time enabling your constraint because of a non-easy to spot recursive query.
Here’s below a summary of a real life 11.2.0.4 case:
SQL Text ------------------------------ BEGIN XZA_DDL.ENABLE_FK(owner => 'ABCDXY', nomtab => 'TABLE_ABC_C', nomfk => NULL); END; Global Information ------------------------------ Status : EXECUTING Instance ID : 1 Session : XYZ_ABC(13:25335) SQL ID : 46bn1bvfkkpvn SQL Execution ID : 16777216 Execution Started : 01/13/2020 16:54:37 First Refresh Time : 01/13/2020 16:54:43 Last Refresh Time : 01/14/2020 15:23:35 Duration : 80939s Module/Action : SQL Developer/- Service : SYS$USERS Program : SQL Developer Global Stats =========================================================================================== | Elapsed | Cpu | IO | Concurrency | PL/SQL | Other | Buffer | Read | Read | | Time(s) | Time(s) | Waits(s) | Waits(s) | Time(s) | Waits(s) | Gets | Reqs | Bytes | =========================================================================================== | 80937 | 79809 | 3.27 | 1.74 | 0.00 | 1124 | 2G | 215 | 21MB | ===========================================================================================
The very long enabling FK process was still in EXECUTING status after more than 22 hours as shown above. Let’s check what ASH can tell us about this situation:
SQL> select sql_id, count(1) 2 from gv$active_session_history 3 where 4 sample_time between to_date('14012020 12:00:00', 'ddmmyyyy hh24:mi:ss') 5 and to_date('14012020 15:45:00', 'ddmmyyyy hh24:mi:ss') 6 group by sql_id 7 order by 2 desc 8 ; SQL_ID COUNT(1) ------------- ---------- 17801 608srf0vf5f3q 12937 --> alter table ABCDXY. TABLE_ABC_C a8gtvr24afy70 10919 1vtd595xys9fp 9135 7q4kq2auz89x1 4203 fdxa2ph5250x1 3058
Manifestly it seems that we must focus our attention on the 608srf0vf5f3q SQL_ID.
From a wait event point of view, ASH shows this:
SQL> select event, count(1) 2 from gv$active_session_history 3 where 4 sample_time between to_date('14012020 12:00:00', 'ddmmyyyy hh24:mi:ss') 5 and to_date('14012020 15:45:00', 'ddmmyyyy hh24:mi:ss') 6 group by event 7 order by 2 desc; EVENT COUNT(1) ------------------------ ---------- 48756 row cache lock 12194 db file parallel write 6867 db file scattered read 6519 db file sequential read 5356 direct path read 1794 …/../
A lot of CPU burning followed by the row cache lock wait event. However, the SQL_ID 608srf0vf5f3q is reported to be only burning CPU as the followings proves:
SQL> select event, count(1) 2 from gv$active_session_history 3 where 4 sample_time between to_date('14012020 12:00:00', 'ddmmyyyy hh24:mi:ss') 5 and to_date('14012020 15:45:00', 'ddmmyyyy hh24:mi:ss') 6 and sql_id = '608srf0vf5f3q' 7 group by event 8 order by 2 desc; EVENT COUNT(1) ------- ---------- 12937
Since the enabling process was still running I decided to “snap’’ the SID (13) of the PL/SQL stored package which I got from the corresponding gv$sql_monitor
SQL> @snapper all 5 1 13 Sampling SID 13 with interval 5 seconds, taking 1 snapshots... -- Session Snapper v4.26 - by Tanel Poder ( http://blog.tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! ------------------------------------------------------------------------------------------------ SID, USERNAME, TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, ------------------------------------------------------------------------------------------------ 13, XYZ_ABC, STAT, session logical reads , 142350, 27.03k, , 13, XYZ_ABC, STAT, consistent gets , 142351, 27.03k, , 13, XYZ_ABC, STAT, consistent gets from cache , 142352, 27.03k, , 13, XYZ_ABC, STAT, consistent gets from cache (fastpath), 142352, 27.03k, , 13, XYZ_ABC, STAT, logical read bytes from cache , 1166147584, 221.41M, , --> this looks interesting 13, XYZ_ABC, STAT, calls to kcmgcs , 22, 4.18, , 13, XYZ_ABC, STAT, no work - consistent read gets , 142334, 27.02k, , 13, XYZ_ABC, STAT, table scans (short tables) , 3, .57, , 13, XYZ_ABC, STAT, table scan rows gotten , 23484525, 4.46M, , 13, XYZ_ABC, STAT, table scan blocks gotten , 142335, 27.02k, , 13, XYZ_ABC, STAT, buffer is pinned count , 3, .57, , 13, XYZ_ABC, TIME, DB CPU , 6003174, 1.14s, 114.0%, 13, XYZ_ABC, TIME, sql execute elapsed time , 6003624, 1.14s, 114.0%, 13, XYZ_ABC, TIME, DB time , 6003624, 1.14s, 114.0%, -- End of Stats snap 1, end=2020-01-14 15:30:20, seconds=5.3 ---------------------------------------------------------------------------------- ActSes %Thread | INST | SQL_ID | SQL_CHILD | EVENT | WAIT_CLASS ---------------------------------------------------------------------------------- 1.00 (100%) | 1 | 608srf0vf5f3q | 0 | ON CPU | ON CPU -- End of ASH snap 1, end=2020-01-14 15:30:20, seconds=5, samples_taken=48, AAS=1
Nonchalantly, all the investigations are showing that the culprit query is 608srf0vf5f3q which is burning a lot of CPU. The 221 M of logical read bytes from cache could very well correlate with high CPU consumption. But we still don’t know why and who is reading those hundreds of millions of logical reads from cache?
In addition, the ash_wait_chain script applied to this SQL_ID 608srf0vf5f3q shows that this query is not blocked by any other process:
SQL> @ash_wait_chains event2 sql_id='608srf0vf5f3q' "timestamp'2020-01-14 12:00:00'" "timestamp'2020-01-14 15:45:00'" -- Display ASH Wait Chain Signatures script v0.4 BETA by Tanel Poder ( http://blog.tanelpoder.com ) %This SECONDS AAS WAIT_CHAIN LAST_SEEN ------ ---------- ------- ------------- -------------------- 100% 12937 1.0 -> ON CPU 2020-01-14 15:44:59
That’s weird. Isn’t it?
After a couple of minutes going round and round in circles, I decided to use the brute force by:
- Killing the process
- Activating the 10046 trace
- Running the process for a couple of minutes
- Stopping again the process
- Analyzing the trace file
alter session set tracefile_identifier = EnablingForeignKey; @46on 12 exec XZA_DDL.ENABLE_FK(owner => 'ABCDXY', nomtab => 'TABLE_ABC_C', nomfk => NULL); @46off
And here’s below what I found in the TKPROFED generated trace file
SQL ID: 56tj8jdcw15jb Plan Hash: 2257352843 select /*+ all_rows ordered dynamic_sampling(2) */ A.rowid, :1, :2, :3 from "ABCDXY"."TABLE_ABC_C" A , "ABCDXY"."TABLE_PARENT_ABC" B where( "A"."COL_1" is not null and "A"."COL_2" is not null and "A"."COL_3" is not null and "A"."COL_4" is not null) and( "B"."COL_1" (+)= "A"."COL_1" and "B"."COL_2" (+)= "A"."COL_2" and "B"."COL_3" (+)= "A"."COL_3" and "B"."COL_4" (+)= "A"."COL_4") and( "B"."COL_1" is null or "B"."COL_2" is null or "B"."COL_3" is null or "B"."COL_4" is null) call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 1 0.00 0.00 0 0 0 0 Fetch 1 217.83 224.03 0 6179071 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 217.83 224.03 0 6179071 0 0
This recursive query was enough for me to understand and fix this issue. I’ve then started by checking the historical execution of this recursive query as shown below:
SQL> @SqlHistStat Enter value for sql_id: 56tj8jdcw15jb Enter value for from_date: 01012020 SNAP_BEGIN PLAN_HASH_VALUE EXECS END_OF_FETCH AVG_ETIME AVG_LIO AVG_ROWS ------------------------- --------------- -------- ------------ ---------- ---------- ---------- 13-JAN-20 04.00.44.563 PM 2257352843 1 0 381 10934401 0 13-JAN-20 05.00.56.481 PM 2257352843 0 0 3551 104544327 0 13-JAN-20 06.00.05.981 PM 2257352843 0 0 3609 104844651 0 13-JAN-20 07.00.15.180 PM 2257352843 0 0 3619 84153912 0 13-JAN-20 08.00.29.468 PM 2257352843 0 0 3610 104607796 0 13-JAN-20 09.00.38.801 PM 2257352843 0 0 3619 93027006 0 13-JAN-20 10.00.49.488 PM 2257352843 0 0 3595 102089647 0 13-JAN-20 11.00.59.446 PM 2257352843 0 0 3545 99557582 0 14-JAN-20 12.00.09.039 AM 2257352843 0 0 3609 104246681 0 14-JAN-20 01.00.18.823 AM 2257352843 0 0 3608 106861170 0 14-JAN-20 02.00.28.120 AM 2257352843 0 0 3611 103242627 0 14-JAN-20 03.00.37.711 AM 2257352843 0 0 3610 106157348 0 14-JAN-20 04.00.47.161 AM 2257352843 0 0 3608 103770559 0 14-JAN-20 05.00.56.531 AM 2257352843 0 0 3547 103304809 0 14-JAN-20 06.00.05.487 AM 2257352843 0 0 3609 105243323 0 14-JAN-20 07.00.14.675 AM 2257352843 0 0 3608 107326829 0 14-JAN-20 08.00.23.442 AM 2257352843 0 0 3608 108137854 0 14-JAN-20 09.00.32.133 AM 2257352843 0 0 3609 107183542 0 14-JAN-20 10.00.41.111 AM 2257352843 0 0 3610 104855775 0 14-JAN-20 11.00.50.825 AM 2257352843 0 0 3554 100223228 0 14-JAN-20 12.00.00.751 PM 2257352843 0 0 3610 99540252 0 14-JAN-20 01.00.11.282 PM 2257352843 0 0 3614 101153086 0 14-JAN-20 02.00.21.974 PM 2257352843 0 0 3610 98943412 0 14-JAN-20 03.00.32.525 PM 2257352843 1 0 436 12358214 0 24 rows selected.
As you can see, through the END_OF_FETCH column, this query has spanned 24 one-hour snapshots without completing. It uses the following execution plan:
-------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Pstart| Pstop | -------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | |* 1 | FILTER | | | | | | 2 | NESTED LOOPS OUTER | | 1 | | | | 3 | PARTITION RANGE ALL | | 1 | 1 | 2 | | 4 | TABLE ACCESS FULL | TABLE_ABC_C | 1 | 1 | 2 | | 5 | PARTITION RANGE ITERATOR| | 1 | KEY | KEY | |* 6 | TABLE ACCESS FULL | TABLE_PARENT_ABC | 1 | KEY | KEY | -------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(("B"."COL_1" IS NULL OR "B"."COL_2" IS NULL OR "B"."COL_3" IS NULL OR "B"."COL_4" IS NULL)) 6 - filter(("B"."COL_1"="A"."COL_1" AND "B"."COL_2"="A"."COL_2" AND "B"."COL_3"="A"."COL_3" AND "B"."COL_4"="A"."COL_4"))
I don’t know why whenever I have to fix a very long non-ending process, I almost always end up by finding that a NESTED LOOPS is for something in this pain
It is clear now that the TABLE_ABC_C child table and its parent TABLE_PARENT_ABC need to have fresh statistics before relaunching the FK enabling process. I always keep saying that, when an estimated cardinality of 1 is reported in the execution plan, then there is a big change that this is due to a non-fresh or non-representative statistic.
SQL> exec dbms_stats.gather_table_stats ('ABCDXY', 'TABLE_ABC_C', cascade => true, no_invalidate => false); SQL> exec dbms_stats.gather_table_stats ('ABCDXY', 'TABLE_PARENT_ABC', cascade => true, no_invalidate => false); SQL> @HistStat Enter value for owner: ABCDXY Enter value for table_name: TABLE_ABC_C OBJECT_NAME OBJECT_TYPE PREV_STAT_TIME ROWCNT SAMPLESIZE ESTIMATE_PCT -------------- ------------ -------------------- ---------- ---------- ------------ TABLE_ABC_C TABLE TABLE_ABC_C TABLE 09/01/2020 16:59:25 0 0 0 TABLE_ABC_C TABLE 14/01/2020 16:34:17 946798 946798 100 TABLE_ABC_C TABLE 14/01/2020 17:09:02 946798 946798 100 SQL> @HistStat Enter value for owner: ABCDXY Enter value for table_name: TABLE_PARENT_ABC OBJECT_NAME OBJECT_TYPE PREV_STAT_TIME ROWCNT SAMPLESIZE ESTIMATE_PCT ------------------- ------------ -------------------- ------- ---------- ------------ TABLE_PARENT_ABC TABLE TABLE_PARENT_ABC TABLE 09/01/2020 16:59:25 0 0 0 TABLE_PARENT_ABC TABLE 14/01/2020 16:34:43 9032840 9032840 100 TABLE_PARENT_ABC TABLE 14/01/2020 17:08:52 9032840 9032840 100
And finally, the coast is clear to re-enable the foreign key in about 10 seconds as shown below:
SQL> exec DBA_DDL.ENABLE_FK(owner => 'ABCDXY', nomtab => 'TABLE_ABC_C', nomfk => NULL); PL/SQL procedure successfully completed. Elapsed: 00:00:10.29 --> 10 seconds!
This has been possible because the new execution plan switched from that dramatic NL to a more suitable HASH JOIN as shown below:
SQL_ID 56tj8jdcw15jb, child number 0 ------------------------------------- Plan hash value: 1651427782 ------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Pstart| Pstop | ------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | | | | | | | 1 | PARTITION RANGE ALL | | 1 | 90 | | 1 | 3 | |* 2 | FILTER | | | | | | | |* 3 | HASH JOIN OUTER | | 1 | 90 | 18M| | | | 4 | TABLE ACCESS FULL| TABLE_ABC_C | 946K| 46M| | 1 | 3 | | 5 | TABLE ACCESS FULL| TABLE_PARENT_ABC | 9032K| 335M| | 1 | 3 | ------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter(("B"."COL_1" IS NULL OR "B"."COL_2" IS NULL OR "B"."COL_3" IS NULL OR "B"."COL_4" IS NULL)) 3 - access("B"."COL_4"="A"."COL_4" AND "B"."COL_1"="A"."COL_1" AND "B"."COL_2"="A"."COL_2" AND "B"."COL_3"="A"."COL_3") SQL> @sqlstats Enter value for sql_id: 56tj8jdcw15jb CHILD PLAN_HASH_VALUE AVG_GETS AVG_PIOS AVG_ETIME EXECS ---------- --------------- ---------- ---------- ---------- --------- 0 1651427782 133519 109286 27.018412 1
Bottom Line
Regardless of the kind of Oracle SQL operations you are executing you should always make sure you have given, beforehand, Oracle enough statistics information about your table/index/column. If this is not so, then be sure that you will certainly end up by stopping to be lucky and by facing a performance trouble. And, in the case of enabling Foreign key constraint, this performance pain will be exacerbated by a recursive query involving the parent-child tables. If this recursive query has a wrong execution plan because of lack of statistics, then it will drastically delay the FK enabling execution time and it will be very hard to point out that the root cause of the performance pain is due to that recursive query.