Monday, June 18, 2007

Solving Performance issue of a Multi-threaded application on Solaris, Linux & Windows

On Solaris10 ::
When my colleague approached me with the following problem,
Brief background:: His application is a multi-threaded application running on T2000 which is multi-core multi-threaded server Ultra Sparc T1 processor. He asks,
How do I know which process is running on CPU ? And why it's running on that CPU? And what it's doing there? To start with,
There are tools around which he had used like "mpstat" to narrow down the problem to a particular CPU. In this case it was CPU 0 which is busy almost all the time and other CPUs are free.
And mpstat shows that "ithr" (so many network interrupts are being handled by this CPU) is flagging large numbers only on one of the CPUs. So gives us an starting point to drill down & we know that it's CPU 0 which is consuming most of the system resources and rest of all other CPUs are less loaded.

Tried digging a step(1) deeper to know who's running on CPU0? as follows,
# dtrace -n 'sched:::on-cpu /cpu == 0/ { @[execname, pid] = count() }'
>> Found out the culprit process name and it's pid by this method. This one-liner dtrace command can be used to identify any process which is occupying CPU most of the time with the sample interval. And come to know that this particular application is consuming most of the CPU time.

With this process name & process id, we started digging further a step(2) down to know what this process is doing? as follows,
# dtrace -n 'syscall:::entry /pid == 8876/ { @[probefunc] = count() }'
>> With this it's clear that, which systems calls are keeping the system busy and is it expected by this application. And the answer is Yes as this is a network intensive application which is busy all the time doing network related read & write operations. With this could make out that the top system calls that are being repeatedly called by this processes are read/write operations on the network.

Now looking at the solution, on Solaris 10u3 and above we can apply following changes to the system's kernel parameters to handle incoming network requests by all the available CPUs as follows,

set ip:ip_squeue_fanout=1
set ip:ip_squeue_bind=0
* the below value has to be based on the number of CPU's or cores available.
set ip:ip_soft_rings_cnt=16

After applying these changes one has to reboot the system. And then after reboot, measuring the performance of the same application gives almost double the performance and it was scaling up as we enable more CPUs on this system as all the incoming threads were handled by all the available CPUs. This is specific to application to application and this is being well written multi-threaded application it scaled well in this case. And now we could see that all the CPUs are handling the incoming network interrupts and all the CPUs are equally busy.

If it's a live production environment one can use ndd commands to change the dynamic kernel parameters to the live system and see the effect immediately without rebooting the node. Here are the ndd commands that can be used on a live system to set some of the above /etc/system values on to the live kernel learn as follows,

ndd -set /dev/ip ip_squeue_fanout 1
ndd -set /dev/ip ip_squeue_bind 0

To get the values from the live kernel one can use "ndd -get" option to get the values.

Description of the parameters that are set::

ip_squeue_fanout: Controls whether incoming connections from one NIC are fanned out across all CPUs. A value of 0 means incoming connections are assigned to the squeue attached to the interrupted CPU. A value of 1 means the connections are fanned out across all CPUs. The latter is required when NIC is faster than the CPU (say 10Gb NIC) and multiple CPUs need to service the NIC. Set by way of /etc/system by adding the following line:

set ip:ip_squeue_fanout=1

ip_squeue_bind: Controls whether worker threads are bound to specific CPUs or not. When bound (default), they give better locality. The non-default value (don't bind) should be chosen only when processor sets are to be created on the system. Unset by way of /etc/system by adding the following line:

set ip:ip_squeue_bind=0
ip_soft_rings_cnt: Determines the number of squeues to be used to fanout the incoming TCP/IP connections. The incoming traffic is placed on one of the rings. If the ring is overloaded, packets are dropped. For every packet that gets dropped, the kstat dls counter, dls_soft_ring_pkt_drop, is incremented.
Default: 2
Range: 0 - nCPUs, where nCPUs is the maximum number of CPUs in the system
Dynamic? No. The interface should be plumbed again when changing this parameter.
When to Change? Consider setting this parameter to a value greater than 2 on systems that have 10 Gbps NICs and many CPUs.
set ip:ip_soft_rings_cnt=16
Note:: Here by looking at the mpstat one can come to know that's the network interrupts which is the problem, but to add to that if a developer is curious to know if his/her own application is into this state this analysis is helpful to know that my own application which is facing the limitations/default setting an OS would have.

On Linux (RedHat/SuSe etc) & Windows (XP/Vista or any latest server) ::
If we happen to come across similar problem on Linux we would use top to find out which is the top process consuming most of the systems resources and try to drill down from there And probably take "strace" of that processes to know which system calls it's making and try to capture all that out put in a file and post processes that file to know which system call is being made most frequently etc.. and The overhead that strace brings in to the application is too much which one would like to avoid using it in the production environment.

If we happen to come across similar problem on Windows one would look at the available windows GUI to look at the top applications consuming the resources in the "Windows task Manager" window and can sort based on various parameters like CPU, memory etc. To drill down probably one can use windows native performance tools which will give high level info with regards to what's happening in the system. And can use use third party tools to profile a given application and understand what it is doing etc..

Well known profiling tool on windows & Linux for multi-threaded application are from "Intel® Thread Profiler 3.1 for Windows" Intel Thread profiler for Linux etc..

Open to know more tools on Linux & windows platforms which can help drill the problems easily without taxing the over-all application or system performance.

FreeRADIUS with MySQL cluster

About: This is all about deploying FreeRADIUS with MySQL cluster , understand about FreeRADIUS deployment options with MySQL cluster for h...