There’s one aspect of the Linux kernel and the GNU operating system and related tools in which it might be lacking behind, specially with kernel 2.4 series. I’m talking about I/O accounting or how to know what’s going on with the hard disc or other devices which are used to write and read data.
The thing is that Linux provides you with a few tools with which you can tell what’s going on with the box and its set of discs. Say vmstat provides you with a lot of information and various other files scattered in the /proc filesystem. But that information only tells us about the system globally, so it’s good for diagnosing if a high load on a box is due to some process chewing CPU cycles away or because of the hard disc being hammered and being painfully slow. But what about if you want to know what exactly is going on, which process or processes are responsible for the situation, how do you know? The answer is that Linux doesn’t provide you with tools for that, as far as I know (If you know of any, please leave a comment). There’s no such thing as a top utility for process I/O accounting. The situation is better in Linux 2.6 provided you activate the taskstats accounting module with which you can query information about the processes. The user-space utilities are somewhat scarce, but at least there’s something with which you can start playing.
However there are some tricks you can use to try to find out which process is the culprit when things go wrong. As usually, many of these tricks come from work where I keep learning from my colleagues, who, by the way, are much more intelligent than I am
, when things go wrong and some problem arises that needs immediate action.
So, let’s define the typical scenario on which we could apply these tricks. You’ve got a Linux box which has a high load average. Say 15, 20, etc. As you may know, the load average measures the number of processes that are waiting to be executed on the process queue. That doesn’t necessarily mean that the CPU is loaded when, for example, processes are blocked because of I/O, say a read from the disc because this is slow or something. The CPU would just sit there most of the time being idle. This number makes sense when you know the number of CPU the box has. If you have a loadavg of 2 in a two-CPU box, then you are just fine, ideally.
The number one tool for identifying what’s going on is vmstat, which would tell you a lot of things going on in the box, specially when you execute it periodically as vmstat 1. If you read the man page (and I do recommend you to read it), you can get an idea of all the information what would be going through the screen
. Click here to see a screen shot of the output of vmstat on 4 different boxes. Almost all of its output is useful for diagnosis, except the last column in the case of a Linux 2.4 box (that value is added to the idle column).
With this tool we can find out if the system is busy on I/O and how. For example by looking at the bo and bi columns. Swapping, when it’s happening, could also imply that the hard disc is being hammered but that would also mean that there’s not enough memory in the system for all the processes running at that very moment. Well, all of its output can be useful for identifying what’s going on.
Ok, so back to our problem, how do we start? Well, the first thing to do is to try to find out what’s on execution that could be causing this. Who are the usual suspects. By looking at ps output we could get an idea of which processes and/or application could be causing the disc I/O. The problem with this is that sometimes an application runs tens or hundreds of processes, each of which is serving a remote client (say Apache prefork) and maybe only some of them are causing the havoc, so killing all possible processes is not an option in a production environment (possibly killing the processes causing the problem is feasible because they might be wedged or something).
Finding the suspects
One way to find what processes are causing the writes and reads is to have a look at the processes in uninterruptible sleep state. As the box has a high load average because of I/O, surely there must be processes in such a state because they are waiting for the disc to return back the data and return from their system calls. And these processes are likely to be involved in the high load of the system. If you think that uninterruptible sleep processes cannot be killed you are right, but we are assuming that they are in this state briefly again and again because of reading and writing to the disc non-stop. If you have read the vmstat man page, you must have noticed that the column b tells us the number of processes in such a state.
golan@kore:~$ ps aux | grep " D" root 27351 2.9 0.2 11992 9160 ? DN 23:06 0:08 /usr/bin/python /usr/bin/rdiff-backup -v 5 --restrict-read-only /disk --server mail 28652 0.5 0.0 4948 1840 ? D 23:11 0:00 exim -bd -q5m golan 28670 0.0 0.0 2684 804 pts/23 S+ 23:11 0:00 grep --color D
Here we can see two processes in such a state (noted by D on ps output). Normally we don’t get to see many of these at the same time and if we issue the same command again, we are probably not going to see it again unless there is a problem which is why I’m writing this in the first place
.
Examining the suspects
Well, we now need to examine the suspects and filter them, because there might be perfectly valid processes that are in uninterruptible sleep state but are not responsible of the high load, so we need to find out. One thing that we could do is attach strace to a specific process and see how it’s doing. This can be easily achieved this way:
golan@kore:~$ strace -p 12766 Process 12766 attached - interrupt to quit write(1, "y\n", 2) = 2
Here we see the output of a process executing yes. So, what does this output tell us? It shows us all the system calls that the process is doing, so we can effectively see if it is reading or writing.
But all this can be very time consuming if we have quite a few processes to examine. What we could do is strace all of them and save their output to different files and then check them later:
If what we are examining is a process called command, we could do it this way:
# mkdir /tmp/strace
# cd /tmp/strace
# for i in `ps axuwf | grep command | awk '{ print $2 }'`; do (strace -p $i > command-$i.strace 2>&1)& done
What this would do is create a series of files called command-PID-strace, one for each of the processes that match the regular expression in the grep command. If we set this running for a while, we can now examine the contents of all the files. Even better if we display the files ordered by size we would have a pointer to the process that are doing the most system calls. All we would need to do is verify that those system calls are actually read and write system calls. And also, don’t forget to kill all the strace processes that we sent to the background by issuing a killall strace
So now we have a list of processes that are causing lots of reads and writes in the hard disc. What to do next depends on the situation and what you want to do. You might want to kill the processes, or find who (the person) who started them, in case they were started by someone. Or which network connection, IP address, etc. There are a bunch of utilities that you can use including strace, netstat, lsof, etc. It’s up to you what to do next.
And…
Well, This is me learning from my colleagues and from problems that arise when you don’t expect them. My understanding of the Linux kernel is not that good, but now many of the things that I studied in the Operating System class start to make a little bit more sense. So please, if you have experience with this, know of other ways to get this kind of information, please share it with me (as a comment or otherwise). I’m still learning

[...] Actualización 11 febrero: He publicado un artículo en mi otro blog sobre el tema: Trucos para detectar procesos bloqueados por I/O. [...]
Pingback by Terminus » Blog Archive » Pregunta de examen — February 11, 2008 @ 11:48
Finally some real information on finding the IO culprits. Thanks much!
Comment by roger — February 9, 2009 @ 19:59
thanks man,
Really good post, it will help to find the process stuck issue and for too increase performance
Comment by Ravi — September 10, 2009 @ 11:25
Hi,
Very good post, thank you.
One of my production Oracle database on Suse linux having the same issue. Some of the oracle process are going in to D state oftenly. So the database shutdown hangs every time. (root also not able to kill the process). I tried strace the pid and that too hangs. File system is Veritas.
If you have any idea about to trace this process or point this because of filesystem. Any suggestions appreciated.
Thanks in advance
Pratheep
n.pratheep@gmail.com
Comment by Pratheep — November 6, 2009 @ 08:42
I’m afraid this is difficult to diagnose. Are you using 2.6? 2.4? With 2.6 you have a wider range of IO schedulers you can use (under linux) that you can try. Then you have ionice, the linux utility that let you change the IO priority (doesn’t work with all the IO schedulers).
Having said that, when a process is in D state, there’s not much you can do about it. Not even root can kill it (uninterruptible sleep it is). This usually boils down to the process waiting for the OS to come back from an IO operation, normally disk IO).
Shutdown is probably taking a long time because it’s writing all the buffers and cache back to disk and finishing everything. This is usually normal (I’ve seen MySQL instances taking a good few minutes to shut down) and depends on how the filesytems is configured. There is a lot you can do here, depending on what filesystem is, how it’s mounted, if it’s got delay write, barriers, journaling, etc. I’m afraid I have no idea about the Veritas FS, so I can’t give you much advise on this front.
Comment by jroncero — November 6, 2009 @ 09:09
thanks for the post. unfortunately, when it happened to me on 2.6.31, even strace/gdb won’t display anything.
Comment by Gong Cheng — November 17, 2009 @ 20:09
Well, if you have 2.6.31 then you can easily find what’s using I/O. Just use iotop to find out which process is using it.
If strace (why gdb?) is not showing anything, then it’s possibly that the processes are blocked on IO but not doing anything at all apart from waiting from the disks or something and this is easy just by looking at the list of processes which are in D state. That’s easy. The difficult part is to actually find out which of your many processes in D state is actually causing problems, on linux 2.4.
Comment by jroncero — November 18, 2009 @ 00:08
Thanks for the tip! My system is not really a desktop but I will look to get iotop running on it.
As for gdb, normally I could attach gdb to a running process to see what it is doing, but I guess I was asking for too mucWill try that when it happens again.h when it is in D state.
Again, thanks! I googled and found a lot of links related to the D state topic, but not many provided me with useful debugging tips.
Comment by Gong Cheng — November 18, 2009 @ 18:01