Direct Path Reads and Cell Offloading

Most people are relating direct path reads with an algorithm which is just controlling the way our read is performed. But actually in Exadata environment this is the algorithm which is balancing the load between the Compute and the Storage nodes. Something really important.
As usual, the algorithm is not perfect and for some situations we shall be able to control it.

In here we’ll cover the flowing parameters:

_small_table_threshold
_very_large_object_threshold
_direct_read_decision_statistics_driven
_serial_direct_read

Each parameter have its own “charm” and reason to be here.

THE ALGORITHM

A famous fact is that the optimizer is not aware about the Direct Path Reads /DPR/. All decisions are made at run time and they are based on few parameters.
The first parameter, the most important one is _small_table_threshold. This parameter is calculated at instance startup and its value is equal to 2% of the buffer cache size. The second parameter is _very_large_object_threshold and his default value is 500. Based on this parameter the Very Large Object Threshold is calculated.

For direct path reads /DPR/ we have three thresholds:

Small Table Threshold – STT = _small_table_threshold – 2% of the buffer cache;
Medium Table Threshold – MTT = small_table_threshold*5 – 10% of the buffer cache – deprecated;
Very Large Object Threshold – VLOT = (_small_table_threshold*_very_large_object_threshold)/2 – 5 times buffer cache size;

In few words, everything below STT is never read with DPR, everything above VLOT is always read with DPR.
If the object is between STT and VLOT, and is not compressed with OLTP/HCC and is not cached/dirty on 50%, then we have DPR. If the object is between STT and VLOT, and is compressed with OLTP/HCC and is not cached/dirty on 95%, then we have DPR.

For clearance:
dpr

Yes that’s true, we have more DPR for compressed objects. In my opinion there are 2 general reasons for this. First, “typically” we are compressing old/rarely accessed data. Second, in Exadata environment the cells are capable to decompress, so the hard work is done in there. Nice.

About MTT, this threshold is obsoleted /in 11.2 AFAIK/. Before it was standing for what STT is now.


TEST CASES

The Environment:

SQL>  select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE    11.2.0.4.0      Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production

SQL> set linesize 171
col KSPPINM for a45
col KSPPSTVL for a14
col KSPPDESC for a95
select ksppinm,ksppstvl
from x$ksppi x, x$ksppcv y
where (x.indx = y.indx)
and ksppinm in (
'_small_table_threshold'
,'_very_large_object_threshold'
,'_direct_read_decision_statistics_driven'
,'_serial_direct_read'
)
 10  order by ksppstvl;

KSPPINM                                       KSPPSTVL
--------------------------------------------- --------------
_very_large_object_threshold                  500
_small_table_threshold                        529
_serial_direct_read                           AUTO
_direct_read_decision_statistics_driven       TRUE

SQL>

The tests performed in here are made against Exa and nonExa environments.

For the example we need to create two tables and we must perform the tests outside of user sys.

-----for all examples we will need those 2 tables:
SQL> create table sdpx(x varchar2(3700),xx varchar2(3700));

Table created.
--or use compress for OLTP
SQL> create table sdpxc(x varchar2(3700),xx varchar2(3700)) compress for query;

Table created.

SQL>

Let’s start with STT>OBJECT_SIZE:

SQL> insert /*+ APPEND */ into sdpx
select DBMS_RANDOM.STRING('p',3700),
DBMS_RANDOM.STRING('p',3700)
  4  from dual connect by rownum <= 500;

500 rows created.

SQL> commit;

Commit complete.

SQL> alter system flush buffer_cache;

System altered.

SQL>  alter session set events 'trace [NSMTIO] disk highest, memory highest';

Session altered.

SQL>  select count(*) from sdpx;

  COUNT(*)
----------
       500

SQL>

--trace file result:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect's size: 518 (blocks), Threshold: MTT(2646 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::kctfsage::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: 1bjugqtm3abj6, plan_hash_value: 4229476807, Object#: 87836, Parition#: 0

From the trace we can see that we have a Cached Read, without reporting any kind of checks apart from the object size. This is the expected behavior, because our object is 518 blocks in size and our STT is 529. But there is something strange with this trace file. It reports that our object is bigger than the STT and smaller than MTT. This is obviously not true. I think that this output is left unchanged from an older version where MTT was still an active threshold.

Now OBJECT_SIZE > STT:
For this situation I made a pl/sql block which is performing 2 basic tests on the both tables.
The point of this pl/sql block is just to generate accurate trace file.
Here it is:

set serverout on
alter session set events 'trace [NSMTIO] disk highest, memory highest';
---cached and dirty reads test for STT and MTT
declare
--set min object size
v_target_blks int default 768;
v_row_step int default 100;
v_cr_rows int;
v_alloc_blocks int;
v_dpr_s int;
v_dpr_c int;
v_tmp int;
begin

--truncate the table and drop statistics /if any/
execute immediate ('truncate table sdpx drop storage');
execute immediate ('truncate table sdpxc drop storage');
DBMS_STATS.DELETE_TABLE_STATS('','sdpx',no_invalidate => false );
DBMS_STATS.DELETE_TABLE_STATS('','sdpxc',no_invalidate => false );


--generate data for sdpx - no compression table:
insert into sdpx select rpad('X',3700,'X'),rpad('X',3700,'X') from dual connect by rownum <= v_target_blks; 
--deferred segment creation fix:
insert into sdpxc select rpad('X',3700,'X'),rpad('X',3700,'X') from dual connect by rownum = 1; 
commit;

--generate data for sdpxc - compression table:
loop
	select BLOCKS into v_alloc_blocks from dba_segments where segment_name='SDPXC';
	insert /*+ APPEND */ into sdpxc select DBMS_RANDOM.STRING('p',3700),DBMS_RANDOM.STRING('p',3700)
	from dual connect by rownum <= v_row_step; 
	commit;
	if v_alloc_blocks >= v_target_blks then
		exit;
	end if;
end loop;


--check cached blocks sdpx:
--flush buffers
execute immediate('alter system flush buffer_cache');
execute immediate('alter system flush shared_pool');
for c in (select rowid,rownum from sdpx) loop
	--perform buff read
	select count(*) into v_tmp from sdpx where rowid=c.rowid;
	--check statistics
	select value into v_dpr_s from v$mystat where statistic#=97; 
	--perform FTS
	select count(*) into v_tmp from sdpx;
	--check statistics
	select value into v_dpr_c from v$mystat where statistic#=97; 
	--no dp, show stats
	if v_dpr_s=v_dpr_c then
		v_tmp:=(c.rownum/v_target_blks)*100;
		dbms_output.put_line('SDRX: cached read at '||v_tmp||'% cached rows');
		exit;
	end if;
end loop;


--check cached blocks sdpxc:
--flush buffers
execute immediate('alter system flush buffer_cache');
execute immediate('alter system flush shared_pool');
for c in (select rowid,rownum from sdpxc) loop
	--perform buff read
	select count(*) into v_tmp from sdpxc where rowid=c.rowid;
	--check statistics
	select value into v_dpr_s from v$mystat where statistic#=97; 
	--perform FTS
	select count(*) into v_cr_rows from sdpxc;
	--check statistics
	select value into v_dpr_c from v$mystat where statistic#=97; 
	--no dp, show stats
		v_tmp:=(c.rownum/v_cr_rows)*100;
	if v_dpr_s=v_dpr_c then
		dbms_output.put_line('SDRXC: cached read at '||v_tmp||'% cached rows');
		exit;
	end if;
end loop;


--check dirty blocks sdpx:
--flush buffers
execute immediate('alter system flush buffer_cache');
execute immediate('alter system flush shared_pool');
for c in (select rowid,rownum from sdpx) loop
	--perform update
	update sdpx set x=x where rowid=c.rowid;
	--check statistics
	select value into v_dpr_s from v$mystat where statistic#=97; 
	--perform FTS
	select count(*) into v_tmp from sdpx;
	--check statistics
	select value into v_dpr_c from v$mystat where statistic#=97; 
	--no dp, show stats
	if v_dpr_s=v_dpr_c then
		v_tmp:=(c.rownum/v_target_blks)*100;
		dbms_output.put_line('SDRX: cached read at '||v_tmp||'% dirty rows');
		exit;
	end if;
end loop;


--check cached blocks sdpxc:
--flush buffers
execute immediate('alter system flush buffer_cache');
execute immediate('alter system flush shared_pool');
for c in (select rowid,rownum from sdpxc) loop
	--perform update
	update sdpxc set x=x where rowid=c.rowid;
	--check statistics
	select value into v_dpr_s from v$mystat where statistic#=97; 
	--perform FTS
	select count(*) into v_cr_rows from sdpxc;
	--check statistics
	select value into v_dpr_c from v$mystat where statistic#=97; 
	--no dp, show stats
		v_tmp:=(c.rownum/v_cr_rows)*100;
	if v_dpr_s=v_dpr_c then
		dbms_output.put_line('SDRXC: cached read at '||v_tmp||'% dirty rows');
		exit;
	end if;
end loop;

rollback;

end;
/

Lets run it and see the result:

SQL> /
SDRX: cached read at 55% cached rows
SDRXC: cached read at 97% cached rows
SDRX: cached read at 53% dirty rows
SDRXC: cached read at 92% dirty rows

PL/SQL procedure successfully completed.

SQL>

The plsql block is returning rows, so it’s not perfectly accurate. As I said before, the point is generating accurate trace file.

Now lets check the trace:

----chaced non compressed - last direct read at:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 87878, objn: 87874
ckpt: 1, nblks: 874, ntcache: 436, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87878, Object_Size = 874 blocks
SqlId = 1nb5kz7dm1jf6, plan_hash_value = 4229476807, Partition# = 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 87878, objn: 87874
ckpt: 0, nblks: 874, ntcache: 437, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87878, Object_Size = 874 blocks
SqlId = 1nb5kz7dm1jf6, plan_hash_value = 4229476807, Partition# = 0


--cached compressed - last direct read at:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 87879, objn: 87875
ckpt: 1, nblks: 728, ntcache: 690, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87879, Object_Size = 728 blocks
SqlId = aw4bnnzhm1vgz, plan_hash_value = 1064431592, Partition# = 0
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:NoDirectRead:[CACHE_READ]: tsn: 4, objd: 87879, objn: 87875
ckpt: 0, nblks: 728, ntcache: 698, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87879, Object_Size = 728 blocks
SqlId = aw4bnnzhm1vgz, plan_hash_value = 1064431592, Partition# = 0


---dirty non compressed table - last direct read at:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 87878, objn: 87874
ckpt: 0, nblks: 874, ntcache: 436, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87878, Object_Size = 874 blocks
SqlId = 1nb5kz7dm1jf6, plan_hash_value = 4229476807, Partition# = 0


---dirty compressed table - last direct read at:
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 87879, objn: 87875
ckpt: 0, nblks: 1524, ntcache: 1436, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87879, Object_Size = 1524 blocks
SqlId = aw4bnnzhm1vgz, plan_hash_value = 1064431592, Partition# = 0

From here we can see that with the non compressed table, the DPR is stopping at 50% cached or dirty blocks.
And with the compressed table, the DPR is stopping at 95% cached or dirty blocks.

Lets try the VLOT now.
Again one small pl/sql block:

set serverout on
alter session set events 'trace [NSMTIO] disk highest, memory highest';
declare
v_stt int default 401;
v_vlot int default 500;
v_dpr_s int;
v_dpr_c int;
v_blks_c int;
v_tmp int;
begin
execute immediate ('truncate table sdpx drop storage');
execute immediate ('truncate table sdpxc drop storage');
DBMS_STATS.DELETE_TABLE_STATS('','sdpx',no_invalidate => false );
DBMS_STATS.DELETE_TABLE_STATS('','sdpxc',no_invalidate => false );
--generate data
insert /*+ APPEND */ into sdpx select rpad('X',3700,'X'),rpad('X',3700,'X') from dual connect by rownum <= (v_stt*v_vlot); 
commit;
execute immediate('alter system flush buffer_cache');
execute immediate('alter system flush shared_pool');
execute immediate('alter session set "_serial_direct_read"=never');
select count(*) into v_tmp from sdpx;
execute immediate('alter session set "_serial_direct_read"=auto');
--check statistics
select value into v_dpr_s from v$mystat where statistic#=97; 
select count(*) into v_tmp from sdpx;
--check statistics
select value into v_dpr_c from v$mystat where statistic#=97; 
if v_dpr_s!=v_dpr_c then
--check cached blocks
select count(distinct BLOCK#) into v_blks_c from v$bh where objd=(select dataobj# from sys.obj$ where name='SDPX');
dbms_output.put_line('Direct Path read');
dbms_output.put_line('Cached blocks: '||v_blks_c);
end if;
end;
/

In general this pl/sql block is just caches all table blocks that it can cache and then performs full table scan.

Let’s see the result:

SQL>  /
Direct Path read
Cached blocks: 17156

PL/SQL procedure successfully completed.

SQL>

It reports how many blocks were cached and that we had a DPR.

And the trace file:

NSMTIO: qertbFetch:DirectRead:[OBJECT_SIZE>VLOT]
NSMTIO: Additional Info: VLOT=132300
Object# = 87880, Object_Size = 201184 blocks
SqlId = 1nb5kz7dm1jf6, plan_hash_value = 4229476807, Partition# = 0

So, it just checks the object size and directly performs DPR because the object is bigger than the VLOT.

As conclusions:
DPR starts at STT;
For HCC/OLTP compressed objects DPR stops at: 95% cached/dirty blocks;
For non compressed objects DPR stops at: 50% cached/dirty blocks;
Always DPR for objects above VLOT;

I want to point for second time we shall perform the tests outside of user sys.


CONTROLLING DIRECT PATH READS

_serial_direct_read is maybe the most famous way to control DPR or Cell Offloading. Yes this is underscore parameter, but is not undocumented. You can find it in the official Oracle Exadata Course, Oracle Exadata Documentation and in MOS notes.

The parameter can be changed dynamically on session level and takes the following values:
auto – automatic decisions about DPR – the algorithm above;
never – never offload/do DPR;
always – always offload/do DPR;

Unfortunately it cannot be specified as hint, so you need to include it in your code or to create a trigger which will alter the specific sessions.
This is not really a comfortable way of enforcing DPR. Fortunately now we have a better one:)

_direct_read_decision_statistics_driven it’s a game changer.
This parameter allow us to use the statistics instead of the object header for determining the object size. And the best thing is that the parameter value is by default TRUE /V => 11.2.0.2/, which means that we don’t have to change it at all. The only thing which we shall do is to take advantage from it.

Lets make a short example:

--chekingig the DRP
SQL> truncate table sdpx drop storage;

Table truncated.

insert /*+ APPEND */ into sdpx
select DBMS_RANDOM.STRING('p',3700),
DBMS_RANDOM.STRING('p',3700)
  4  from dual connect by rownum <= 300;

300 rows created.

SQL> commit;

Commit complete.

SQL> exec DBMS_STATS.GATHER_TABLE_STATS ('','SDPX', no_invalidate => false);

PL/SQL procedure successfully completed.

--flush buffer cache - CAUTION - in prod is better to wait "age out" of the object blocks :)
SQL> alter system flush buffer_cache;

System altered.

SQL>

--trace file output:
NSMTIO: qertbFetch:NoDirectRead:[- STT < OBJECT_SIZE < MTT]:Obect''s size: 314 (blocks), Threshold: MTT(2646 blocks),
_object_statistics: enabled, Sage: enabled,
Direct Read for serial qry: enabled(::::kctfsage::), Ascending SCN table scan: FALSE
flashback_table_scan: FALSE, Row Versions Query: FALSE
SqlId: gb1uqwc73sutd, plan_hash_value: 4229476807, Object#: 87889, Parition#: 0

--changing nlbks to bigger than STT:
SQL> alter system flush buffer_cache;

System altered.

SQL> exec DBMS_STATS.SET_TABLE_STATS ('','SDPX', numblks => 550, no_invalidate => false);

PL/SQL procedure successfully completed.

SQL> select count(*) from sdpx;

  COUNT(*)
----------
       300

SQL>


--trace file output:
*** 2014-02-23 19:08:46.150
NSMTIO: qertbFetch:[MTT < OBJECT_SIZE < VLOT]: Checking cost to read from caches(local/remote) and checking storage reduction factors (OLTP/EHCC Comp)
NSMTIO: kcbdpc:DirectRead: tsn: 4, objd: 87889, objn: 87874
ckpt: 1, nblks: 550, ntcache: 0, ntdist:0
NSMTIO: Additional Info: VLOT=132300
Object# = 87889, Object_Size = 550 blocks
SqlId = gb1uqwc73sutd, plan_hash_value = 4229476807, Partition# = 0

Nice! With just changed numblks and achieved DPR!
In the beginning oracle assumed that we had a small object which shall not be read with DPR. Then, after changing the numblks statistic to a value bigger than STT we achieved DPR. That’s great, we have control on object level!
But we shall be careful, because this can change the decisions made by the optimizer for all plans related with that object. If the difference between the values is not big, the chance something to go wrong is small.
The other thing we shall have on mind is the statistics gathering. Next time when the statistics are gathered, the numblks can be changed to an inappropriate value again. We can prevent this by locking the object statistics and by creating a job for manual statistic gathering for that object. For example, the job will just gather the statistics with force for that object and then it will set the numblks to appropriate value, if it’s not already.

Happy geeking ( :;
Kovachev

Update: Changing STT is not affecting VLOT value, check the comments;

About these ads

15 thoughts on “Direct Path Reads and Cell Offloading

  1. Niki, there is also a 10949 trace to influence serial direct path reads. It looks like the opposite of _serial_direct_read. In some cases we might want to use conventional path even for big tables.

  2. Nikolay,
    you wrote:
    “Very Large Object Threshold – VLOT = (_small_table_threshold*_very_large_object_threshold)/2 ”

    I talked about it with Igor Usoltsev year ago, when i played with index_stats and direct path reads – http://orasql.org/2013/03/07/just-test-of-adaptive-direct-path-reads-with-index_stats/
    And Igor tested later my guess, as we agreed: http://iusoltsev.wordpress.com/2013/06/03/_very_large_object_threshold/
    and confirmed that “_very_large_object_threshold” is a percentage of _db_block_buffers

    Regards,
    Syan Malakshinov

    • Hello,
      I’m saying: “VLOT = (_small_table_threshold*_very_large_object_threshold)/2 – 5 times buffer cache size”
      So, 5 times buffer cache size. I was searching for some way to calculate VLOT, I found 2.
      The intention of this post is to describe the behavior of the algorithm and how to control it in safe approach /read the whole post :) /
      Changing hidden and undocumented parameters is definitely is not a safe one.

      Let’s resolve with a fast test:
      STT=940
      trc: NSMTIO: Additional Info: VLOT=235200

      STT=5
      trc: NSMTIO: Additional Info: VLOT=235200

      Indeed STT is not affecting VLOT.

      Thanks for pointing and stopping by : )

  3. Hi Nikolay
    Have you seen behaviour of parallel setting in RAC affecting decision for DPR .
    i have seen parallel queries choosing DPR over db file scattered read when parallel max servers being increased .
    i have test case if required to share

    • I mean, that if you have plan with parallel full scan operation(ie parallel slaves doing full scans) and your version>=10.1 than more likely that will be used direact path reads. And another DOP value can affect plan, or even disable parallelizing if it’s equal to 0 or 1.

      • Hello,
        First – DPR is the default behavior of parallel operations. Indeed there are some cases in which parallel operations are not using DPR, but this is the opposite case of what is described in the previous comment. First of all, in order to have a situation in which a parallel read operation is using by default buffered read we need “In-Memory Parallel Execution”. So, the test case shall be In-Memory Parallel execution which is choosing DPR.

        Now, to be more specific, we can find the following described in the documentation for In-mem parallel:
        “If the size of the object is larger than the size of the buffer cache (single instance) or the size of the buffer cache multiplied by the number of active instances in an Oracle RAC cluster, then the object is read using direct-path reads”
        So, we shall implement the same test cases as in here, but with in-mem parallel /I suppose the thresholds will be the same/ and an additional tests with different values for PARALLEL_MAX_SERVERS parameter.

        The test case sounds interesting, that’s why I asked for a link :)

        Thanks.

  4. Nikolay
    i didn’t write a blog on it . at the moment i am not able reproduce the behaviour so apologies for that.
    what i remember from the previous experience
    query was like

    select /*+ parallel */ count(1) table ;

    on 2 node RAC
    when parallel_force_local=true
    parallel_max_servers =8
    it didn’ t use DPR and used scattered read.

    when i set parallel_force_local=false the query used DPR with 16 sessions running in parallel

    But i will try to reproduce it and then share .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s