Thursday, November 27, 2008

OS Code Swarms - cool!






OS performance diagnostics for Oracle

I found a presentation on OS performance diagnostics for Oracle. It's a good start if you haven't done much performance tuning yourself. It will let you to familiarize with different utilities and point you to some basic parameters you should be aware of. Most of the presentation is valid for any application ruining in a system not only Oracle.

Friday, November 14, 2008

Oracle's Listener Performance

Recently I posted a blog entry on TCP ListenDrops and Listener. Since the last tuning we haven't experience even single listendrop - good. Nevertheless we started experiencing some performance issues with the same Oracle instance. The problem was that although there was plenty of CPU to spare, no increase in disk I/O and generally nothing wrong with the system at first glance, getting connected to Oracle did take even 4 seconds. Once connected the database was responding really fast. Because the connection time used to be well below 1s before the impact was relatively huge - especially when you have some programs doing couple of connections and sql queries to the database while a customer is waiting for a page to be build - almost 4s extra for each database connection and you end-up with extra 4-12s of delay for an user. So what happened?

First I wanted to check if it is the database problem afterall. There was a CGI script I knew was producing the results much slower than it used to do. I was told it usually did produce the report within 5s and now it takes 8-15s to do so. Quick look around of a midlleware box and at it's historical data (CPU, run queue, network, etc.) didn't reveal anything obvious. So I quickly changed slightly the shortlived.d script from DTraceToolkit to print me a total CPU time and real time it took to execute the CGI script on a production. The real times were oscilating between 8-16s while vtime was always within 2s. What it means is that the script is definitely not CPU bound. Quick look at the script itself and it doesn't actually do much - basically two database connections and it prints parsed output. It strongly suggests that it spends most of its time waiting for the database and that's where I should focus on the problem.

I tried to connect to the database using sqlplus client and it took couple of seconds even before I got a prompt. First thought was that it is the ListenDrop issue I was blogging about recently - it wasn't. So I checked if the problem is still there if I connect to the listener locally from the db host itself instead of going over a network, to eliminate everything in between. The issue was still there so it definitely wasn't a network issue (and least not per se) and rather something local to the db box.

I decided to use truss utility on listener to see why it takes some much time to establish a session. It is easy to filter out my connection as there is nothing else connecting locally to the database so I need to look for an accept() related to an IP address of the db server. So I run sqlplus client and truss more or less at the same time and repeat it several times. It quickly turned out that it takes over 3s on average before listener even does accept() for my connection


# truss -o /tmp/milek1 -d -f -v all -ED -p 15697
15697: 3.9228 0.0001 0.0001 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 35705


It means that the problem is somewhat related to the ListenDrop issue. It means that new connection is being backloged for almost 4s before listener picks it up - it just doesn't do accept()'s quick enough to drain the queue in real time. So either we are generating more connections per second to the database or for some reason listener is slower. Because it was already late and we couldn't find any recent changes which would cause more connections to the database I decided to look at what listener actually does and how much time it takes.

After a quick analyze of truss output for several connections to the listener and knowing that listener is a single threaded process the overall picture of what it does is:

[...]
pollsys()
accept()
read() - over a network descriptor!
pipe()
pipe()
fork1()
fork1()
execve() - exec() given oracle process
_exit()
waitid() - wait for child to exit
read() - from pipe
write() - to pipe
read() - from pipe
read() - from pipe
close() - close pipes and network fd
pollsys()
[...]


And it does it in a loop. Lets see a truss output from one such a connection handling:

# truss -o /tmp/milek6 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek6
[...]
15697: 1.5426 1.0901 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5440 0.0014 0.0000 getsockname(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5442 0.0002 0.0000 getpeername(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) Err#134 ENOTCONN
15697: 1.5463 0.0021 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41612
15697: 1.5464 0.0001 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.5465 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.5466 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.5468 0.0002 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.5472 0.0004 0.0001 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.5473 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.5474 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.5475 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.5476 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.5478 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.5489 0.0011 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.5494 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.5497 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.5497 0.0000 0.0000 pipe() = 14 [15]
15697: 1.5498 0.0001 0.0000 pipe() = 16 [17]
15697: 1.5798 0.0300 0.0299 fork1() = 8195
15697: 1.5853 0.0055 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.6395 0.0542 0.0000 waitid(P_PID, 8195, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=8195 status=0x0000
15697: 1.6396 0.0001 0.0000 close(14) = 0
15697: 1.6397 0.0001 0.0000 close(17) = 0
15697: 1.7063 0.0666 0.0000 read(16, " N T P 0 8 1 9 7\n", 64) = 10
15697: 1.7072 0.0009 0.0000 getpid() = 15697 [1]
15697: 1.7073 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.7076 0.0003 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.7078 0.0002 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.7079 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8377 0.1298 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0020 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8397 0.0000 0.0000 close(15) = 0
15697: 1.8398 0.0001 0.0000 close(16) = 0
15697: 1.8400 0.0002 0.0000 close(13) = 0
15697: 1.8424 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x06919427
15697: 1.8425 0.0001 0.0001 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220
15697: 2.6446 0.8021 0.0000 pollsys(0x10041CDA8, 2, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
[...]


Now let's look closely at timing. It took almost 0.3s (1.84-1.54 = 0.30) from accept() to finish handling that connection and move to another accept(). Assuming that would be an average time to handle one connection to the listener it would be able to process only about 3 connections per second! That's not much.

Check the times I marked in red in above truss output and add these times: 0.0300+0.0542+0.0666+0.1298 = .2806. Remember that it took about 0.3s for entire loop to complete. So from the moment listener does fork1() to the moment it 2nd time reads from a pipe from a child it takes more than 90% of entire real time spent in handling a request.

When listener does a fork1() it's child will basically do another fork1() and exit:

# grep ^8195 /tmp/milek6
8195: 1.5798 1.5798 0.0000 fork1() (returning as child ...) = 15697
8195: 1.5855 0.0057 0.0000 getpid() = 8195 [15697]
8195: 1.5863 0.0008 0.0000 lwp_self() = 1
8195: 1.5864 0.0001 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.5868 0.0004 0.0000 schedctl() = 0xFFFFFFFF7F78A000
8195: 1.6115 0.0247 0.0246 fork1() = 8197
8195: 1.6159 0.0044 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
8195: 1.6163 0.0004 0.0000 _exit(0)


Let's see what happens in a 2nd child then:


# grep ^8197 /tmp/milek6
8197: 1.6115 1.6115 0.0000 fork1() (returning as child ...) = 8195
[...]
8197: 1.6433 0.0139 0.0137 execve("/dslrp/ua01/app/oracle/product/9.2.0/bin/oracle", 0x1002A74B0, 0x1112E95E0) argc = 2
[...]

8197: 1.7062 0.0000 0.0000 write(17, " N T P 0 8 1 9 7\n", 10) = 10
[exec has already completed, a new process is up and it is signaling to listner that it is up]
[it writes to pipe, it is the first read from the pipe in the listener]


[...]
8197: 1.7153 0.0001 0.0000 cladm(CL_INITIALIZE, CL_GET_BOOTFLAG, 0xFFFFFFFF7D1044FC) = 0
8197: bootflags=CLUSTER_CONFIGURED|CLUSTER_BOOTED
8197: 1.7164 0.0011 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.7164 0.0000 0.0000 cladm(CL_CONFIG, CL_HIGHEST_NODEID, 0xFFFFFFFF7D104508) = 0
8197: nodeid=64
8197: 1.7165 0.0001 0.0000 cladm(CL_CONFIG, CL_NODEID, 0xFFFFFFFF7D104504) = 0
8197: nodeid=2
8197: 1.7176 0.0011 0.0010 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7186 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7196 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7206 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E88) = 0
8197: 1.7219 0.0013 0.0012 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7229 0.0010 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4E28) = 0
8197: 1.7238 0.0009 0.0009 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4EE8) = 0
8197: 1.7256 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7273 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7291 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7308 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7325 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7343 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7360 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7377 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7395 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7412 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7429 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7447 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7465 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7482 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7499 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7517 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7534 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7552 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7570 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7587 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7605 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7622 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7639 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7657 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7674 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7691 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7709 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7726 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7743 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7760 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7778 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7795 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7812 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7830 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7847 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7864 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7882 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7899 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7917 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7934 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7951 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7968 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.7986 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8003 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8020 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8037 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8054 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8072 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8089 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8106 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8123 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8140 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8158 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8175 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8192 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8209 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8226 0.0017 0.0016 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8244 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8262 0.0018 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8279 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8296 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8313 0.0017 0.0017 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4F08) = 0
8197: 1.8322 0.0009 0.0008 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF4FB8) = 0
8197: 1.8324 0.0002 0.0001 cladm(CL_CONFIG, 17, 0xFFFFFFFF7FFF5168) = 0
[...]
8197: 1.8377 0.0001 0.0000 write(17, "\0\0\0\0", 4) = 4
[it is a second time it writes to pipe and that's the 2nd read from the pipe in listener]

[...]
8197: 1.8397 0.0002 0.0000 write(17, "\0\0\0\0", 4) = 4
[3rd time it writes to pipe and after the 3rd read from pipe listener cleans up and moves to another connection]

[...]


It calls cladm() 75 times and the total time it takes to just handle these 75 cladm()'s is about .11s - that's quite a lot of time compared to .28s spent in listener waiting for a child. If I could get rid of these cladm() I would almost double the connection rate listener can handle.
Lets see what exactly is calling cladm():

# dtrace -n syscall::cladm:entry'{@[ustack()]=count();}' -n tick-5s'{printa(@);exit(0);}'
[...]
libc.so.1`_cladm+0x4
nss_cluster.so.1`0xffffffff7d003364
nss_cluster.so.1`0xffffffff7d00219c
nss_cluster.so.1`0xffffffff7d0026e8
nss_cluster.so.1`0xffffffff7d002d1c
nss_cluster.so.1`0xffffffff7d0019b8
libc.so.1`nss_search+0x288
libnsl.so.1`_switch_gethostbyname_r+0x70
libnsl.so.1`_get_hostserv_inetnetdir_byname+0x958
libnsl.so.1`gethostbyname_r+0xb8
oracle`nttmyip+0x2c
oracle`nttcon+0x698
oracle`ntconn+0xd4
oracle`nsopen+0x840
oracle`nsgetinfo+0xb4
oracle`nsinh_hoffable+0x38
oracle`nsinh_hoff+0xe20
oracle`nsinherit+0x204
oracle`niotns+0x44c
oracle`osncon+0x3a0
64



Quick look at SunSolve and I found Document ID: 216260

Lets confirm if we have a cluster keyword in /etc/nsswitch.conf file:


# grep ^hosts /etc/nsswitch.conf
hosts: cluster files [SUCCESS=return] dns


I removed cluster keyword from the file and checked with truss again if it made any difference.


# truss -o /tmp/milek4 -d -f -v all -ED -p 15697
# grep ^15697 /tmp/milek4
[...]
15697: 1.6867 0.0020 0.0020 accept(12, 0xFFFFFFFF7FFF889C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 13
15697: AF_INET name = 10.44.29.32 port = 41573
15697: 1.6869 0.0002 0.0000 getsockname(13, 0xFFFFFFFF7FFF888C, 0xFFFFFFFF7FFF88AC, SOV_DEFAULT) = 0
15697: AF_INET name = 10.44.29.18 port = 1521
15697: 1.6870 0.0001 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF88AC) = 0
15697: 1.6871 0.0001 0.0000 setsockopt(13, tcp, TCP_NODELAY, 0xFFFFFFFF7FFF8A14, 4, SOV_DEFAULT) = 0
15697: 1.6881 0.0010 0.0000 fcntl(13, F_SETFD, 0x00000001) = 0
15697: 1.6885 0.0004 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnk59.so", F_OK) = 0
15697: 1.6887 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libngss9.so", F_OK) Err#2 ENOENT
15697: 1.6888 0.0001 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnnts9.so", F_OK) Err#2 ENOENT
15697: 1.6890 0.0002 0.0000 access("/dslrp/ua01/app/oracle/product/9.2.0/lib/libnrad9.so", F_OK) = 0
15697: 1.6891 0.0001 0.0000 sigaction(SIGPIPE, 0xFFFFFFFF7FFF8B80, 0xFFFFFFFF7FFF8CA8) = 0
15697: new: hand = 0x00000001 mask = 0x9FBFF057 0x0000FFF7 0 0 flags = 0x000C
15697: old: hand = 0x00000001 mask = 0 0 0 0 flags = 0x0000
15697: 1.6893 0.0002 0.0000 pollsys(0x10041CDA8, 3, 0x00000000, 0x00000000) = 1
15697: fd=9 ev=POLLIN|POLLRDNORM rev=0
15697: fd=12 ev=POLLIN|POLLRDNORM rev=0
15697: fd=13 ev=POLLIN|POLLRDNORM rev=POLLIN|POLLRDNORM
15697: 1.6906 0.0013 0.0000 read(13, "\0F0\0\001\0\0\001 801 ,".., 8208) = 240
15697: 1.6911 0.0005 0.0000 ioctl(13, FIONBIO, 0xFFFFFFFF7FFF9E7C) = 0
15697: 1.6914 0.0003 0.0000 fcntl(13, F_SETFD, 0x00000000) = 0
15697: 1.6915 0.0001 0.0000 pipe() = 14 [15]
15697: 1.6918 0.0003 0.0000 pipe() = 16 [17]
15697: 1.7276 0.0358 0.0357 fork1() = 1214
15697: 1.7327 0.0051 0.0000 lwp_sigmask(SIG_SETMASK, 0x00000000, 0x00000000) = 0xFFBFFEFF [0x0000FFFF]
15697: 1.7874 0.0547 0.0000 waitid(P_PID, 1214, 0xFFFFFFFF7FFF6D10, WEXITED|WTRAPPED) = 0
15697: siginfo: SIGCLD CLD_EXITED pid=1214 status=0x0000
15697: 1.7877 0.0003 0.0000 close(14) = 0
15697: 1.7878 0.0001 0.0000 close(17) = 0
15697: 1.8755 0.0877 0.0000 read(16, " N T P 0 1 2 1 6\n", 64) = 10
15697: 1.8771 0.0016 0.0000 getpid() = 15697 [1]
15697: 1.8772 0.0001 0.0000 fcntl(16, F_SETFD, 0x00000001) = 0
15697: 1.8777 0.0005 0.0000 write(15, "\0\0\0 =", 4) = 4
15697: 1.8778 0.0001 0.0000 write(15, " ( A D D R E S S = ( P R".., 61) = 61
15697: 1.8779 0.0001 0.0000 write(15, "\0\00401", 4) = 4
15697: 1.8829 0.0050 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8848 0.0019 0.0000 read(16, "\0\0\0\0", 4) = 4
15697: 1.8851 0.0003 0.0000 close(15) = 0
15697: 1.8852 0.0001 0.0000 close(16) = 0
15697: 1.8854 0.0002 0.0000 close(13) = 0
15697: 1.8878 0.0024 0.0000 lseek(3, 0, SEEK_CUR) = 0x0690820E
15697: 1.8879 0.0001 0.0000 write(3, " 0 7 - N O V - 2 0 0 8 ".., 220) = 220


Now the timings are better and if you compare numbers you will see more than 35% of time reduction it takes for listener to wait for a child before it can move to process another request. It improves the connaction rate listener can do from 3.3 per second to about 5 per second.

Why cladm() is being called 75 times each times Oracle process starts - I don't know, at first look looks like some bug in nss_cluster.so.1 library. It doesn't matter for now as it is a good workaround. Nevertheless it will have to be raised with Sun's support.

The other issue is that system hasn't been patched recently and nssswitch.conf file hasn't been changed. So while it is good to see some performance improvement there must be something else which caused listener to process new connections more slowly. I manually confirmed if the change made any difference - now sqlplus client connect in less than 2s and the CGI script executes in shorter time. Puting the cluster keyword back and numbers got worse again. Removing it and numbers improve. Good.

If you look closely again at timings you will notice that after cladm() was removed from the picture the majority of time listener spends now is waiting for two fork1()'s and execve() to complete. Now why would they be slower than before? I was almost certain it's not due to CPU but it could be due to memory and it rather has to be sometething related only to listener as once oracle process is up it does process requests as fast as usual.

I checked how big is a listener process and it was over 400MB in size (RSS). We quickly set-up another listener on the box, listening on a different tcp port, and it turned out it was only about 23MB in size. We tried to connect to the database using the new listener and it took much less than a second. We switch over some traffic to new listener and it was able to process more requests per second. We could also verify it by writing simple dtrace script which measures number of accept()'s per second break down by PID.


cat accept_rate.d
#!/usr/sbin/dtrace -qs

syscall::accept:entry
{
setopt("aggsortkey");
setopt("aggsortkeypos", "1");
@[execname,pid]=count();
}

tick-1s
{
printf("%Y\n", walltimestamp);
printa("process: %-20s [%-5d] %@5d\n",@);
clear(@);
printf("\n");
}


We left 2nd listener and reconfigured some clients tnsora files so they were load-balance between two listeners when establishing new connections to the database. That way we could compare both listeners and further convince our selfes that restart should help the performance.

Knowing that basically listener loops between pollsys() syscalls while handling new connections I wrote a simple script showing time distribution it takes listener to handle a connection.


# cat listener_loop_times.d
#!/usr/sbin/dtrace -qs

syscall::pollsys:return
/execname == "tnslsnr"/
{
self->t=timestamp;
self->vt=vtimestamp;
}

syscall::pollsys:entry
/self->t/
{
@time[execname,pid]=lquantize((timestamp-self->t)/10000000,0,30);
@vtime[execname,pid]=lquantize((vtimestamp-self->vt)/10000000,0,10);

self->t=0;
self->vt=0;
}

tick-5s
{
printf("%Y\n", walltimestamp);
printa(@time);
printa(@vtime);
}


Below yuo can see a sample output, where 25351 is a PID of the new listener and 15697 is a PID of the old listener which grew to over 400MB in size.


18 92927 :tick-5s
tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 162
1 | 0
2 |@@@@@@@@@ 71
3 |@@@@@@@@@@@ 87
4 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 206
1 | 0
2 | 0
3 | 0
4 | 0
5 | 0
6 | 0
7 | 0
8 | 1
9 |@ 10
10 |@@@@@@ 67
11 |@@@@@@@@@ 90
12 |@@@@ 41
13 | 3
14 | 1
15 | 0


tnslsnr 25351
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 320
1 | 0

tnslsnr 15697
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@ 207
1 | 0
2 |@@@@@@@ 76
3 |@@@@@@@@@@@ 118
4 |@@ 17
5 | 0
6 | 1
7 | 0



We did restart the original listener and its performance was much better as expected.

It turned out there is an Oracle's bug (5576565) which is a memory leak in listener causing a performance problem. Oracle provides Patch 5576565 which should fix the issue.

Some quick conclusions:
  • Standard (non-mts) listener architecture is far from being highly performant
  • Use connection pooling (re-using connections + query queue) if you can
  • If you need Oracle to handle higher connection rate then deploy multiple listeners and set-up clients to load-balance between them
  • Restarting listener won't affect already established tcp/ip connection to a database, so for example all other applications using connection pooling or already long running reports shouldn't be affected by listener restart
  • MTS performance characteristics should be different
  • Choose an operating system with good observability tools

Wednesday, November 12, 2008

Fishworks

When you think about Open Solaris you have mane really unique features there which you can't find on other platforms. Then there is a big market of customers who don't have or don't need proper skills to be able to use these tools - what they want is an appliance aproach, when you put something in your network and it just works for you. It would be great if one could get all the benefits of Open Solaris and yet didn't had to be a sr sys admin in order to be able to use them. When it comes to NAS appliances Fishworks is just that. It's like an NetApp filer but better - more flexible, more scalable, better performance and cheaper. You can find a good overview on new appliances here.

It is really interesting to know how it all started three years ago. About a year later Bryan asked me if I want to participate in eval program for Fishworks - I didn't even know what it was back then but hey.. it's Bryan, it has to be something exiting! So I said yes right a way :) :) :)

Now it the environments I use to work I'm not that big enthusiast for appliances nevertheless I can appreciate them in environments they belong to - and I really believe that what they come up with will be a disruptive technology in a market where NetApp has almost a monopoly and is charging insane money for their proprietary technology. Don't get me wrong, NetApp does provide very good technology - it's just that it always has been overpriced and there was no real competition for them. Thanks to open source technologies like Open Solaris, ZFS, Dtrace, etc. it's no longer the case. Actually in the environments I used to work thanks to Open Solaris/ZFS NetApp usually is no longer an option at all. Now thanks to Fishworks you have the alternative for NetApp, and I believe it is a much better alternative. And for everyone for who it is important - Fishworks is built on top of open source software with file system being the most important building block. NetApp file system is a proprietary and closed and you end-up being locked-in by them to some degree.

Almost two years ago I wrote one of my emails with observations about early Fishworks prototype and I was so exited with Analytics that I went wild and proposed to create Doctor D! - virtual advisor which would help you to tell what's wrong with your system and what you should do to improve the situation. Of course we can't deliver something like this in a forseable future. Nevertheless the Analytics gives you so much insight in a system that in essence you are becoming Doctor D! All you need to do is to correlate all forensic data presented to you in a very attractive manner. Of course it uses DTrace underneath, that's where D! is coming from, but because we are talking about appliance you don't have to know Dtrace or any other OS tools at all and yet still you can harnes the power of these tools and fix your issue.


You can get more information on official Fishworks blog.
Also check below blogs for more interesting info.

New stories I got from here:


Wall Street Journal
: Sun Expands 'Open' Storage Line
http://online.wsj.com/article/SB122627611222612077.html?mod=googlenews_wsj

Forbes: Sun's Flash of Hope
http://www.forbes.com/cionetwork/2008/11/08/sun-storage-memory-tech-enter-cx_ag_1110sun.html

San Jose Mercury News
: Sun hopes new line of storage servers will help revive its fortunes
http://www.mercurynews.com/breakingnews/ci_10929798?nclick_check=1

eWeek: Sun Discards RAID in Its First-ever Storage Appliance
http://www.eweek.com/c/a/Data-Storage/Sun-Unveils-Its-Firstever-Storage-Appliance/

IDG News Service: Sun Rolls out Its Own Storage Appliances
http://www.pcworld.com/businesscenter/article/153566/sun_rolls_out_its_own_storage_appliances.html

InformationWeek: Sun Unveils 'Open Storage' Appliances
http://www.informationweek.com/news/storage/systems/showArticle.jhtml?articleID=212001591

The Register: Sun trumpets radically simple open storage boxes
http://www.theregister.co.uk/2008/11/10/suns_amber_road_storage/

CRN: Sun Intros Solid State Drives, Analytics In Storage Line
http://www.crn.com/storage/212001321

CNET News: Sun expands its open storage line, hopes for accelerated growth
http://news.cnet.com/8301-13505_3-10092362-16.html

ZDNet: Sun claims to revolutionise storage with new array
http://news.zdnet.co.uk/hardware/0,1000000091,39547411,00.htm

TheStreet: Sun to Unveil New Open Storage Hardware
http://www.thestreet.com/story/10446870/1/sun-to-introduce-new-open-storage-hardware.html?puc=googlefi&cm_ven=GOOGLEFI&cm_cat=FREE&cm_ite=NA