Not Really a Blog

February 6, 2010

The effect of temporary tables on MySQL’s replication

Filed under: System Administration — Tags: , — jesus @ 23:05

The other day I needed to set up a new set of MySQL instances at work what would replicate from an existing node. I was setting these up because the master node is running out of disk space and is very slow.

Usually, when you need to restore a database you do it in three parts:

  1. Install the binaries
  2. Load the initial data from the most recent MySQL backup.
  3. Set it replicating from one of the nodes by specifying the binary log file and position from which you want it to replicate (which usually corresponds to the day you took the backup).

Now, because we usually compress the binary logs and because the master didn’t have enough disk space to have all these binary logs uncompressed (such that the new slave could replicate by connecting to the master and talking the MySQL protocol), I needed to transfer them to the new slave and pipe them into MySQL from there. Seems simple, huh?

Everything went fine on point 1 and 2. But then, while piping the contents of the MySQL binary logs into the new databases, it all went wrong. What I used to pipe them was:

for file in master-bin* ; do echo "processing $file" ;    ../mysql/bin/mysqlbinlog "$file" | ../mysql/bin/mysql -u root -ppassword  ; done

Which is how you usually do these things, but this is what I got:

db@slave:~/binlogs$ for i in master-bin* ; do echo "processing $file" ;    ../mysql/bin/mysqlbinlog "$file" | ../mysql/bin/mysql -u root -ppassword  ; done
processing master-bin.1853
processing master-bin.1854
processing master-bin.1855
processing master-bin.1856
processing master-bin.1857
processing master-bin.1858
processing master-bin.1859
processing master-bin.1860
ERROR 1146 at line 10024: Table 'av.a2' doesn't exist
processing master-bin.1861
ERROR 1216 at line 1378: Cannot add or update a child row: a foreign key constraint fails
processing master-bin.1862
ERROR 1216 at line 22825: Cannot add or update a child row: a foreign key constraint fails
processing master-bin.1863

So, table av.a2 does not exist. WTF?

Investigating a bit about this table, it seems there’s a script which executes the following stuff on it everyday:


if test $ZZTEST -lt 300000; then
 echo "ERROR: Less than 300k"
 exit 1
cat > sql << EOF
create temporary table a1 (mzi char(16) default null, key mzi(mzi));
create temporary table a2 (mzi char(16) default null, key mzi(mzi));
create temporary table a3 (mzi char(16) default null, key mzi(mzi));
cat v | grep ^447.........$ | awk '


Now, create temporary table, if you read about it on MySQL docs you’ll see that temporary tables are only visible to the current connection and are dropped automatically when that connection finishes. There are a few problems with replication and temporary tables, but this could not possibly be the same problem as these were the binary logs from the master. So, what’s going on here?

The problem here comes from the binary logs being rotated and the way I was inserting them. It just happened that the three SQL statements:

create temporary table a1 (mzi char(16) default null, key mzi(mzi));
create temporary table a2 (mzi char(16) default null, key mzi(mzi));
create temporary table a3 (mzi char(16) default null, key mzi(mzi));

were created at the end of binary log file master-bin.1859 and then there was a SQL statement which made it fail on file master-bin.1860 (inserting data into av.a2) because it was expecting those temporary tables to exist (and they didn’t). This happened  because we are using a for loop in bash to insert the binary logs, so there’s one mysql connection for each binary log file and thus, when file master-bin.1859 finished it automatically made MySQL drop the three temporary tables (that connection was finished) and then on the next connection (file master-bin.1860) these tables were missing.

There are a few ways in which you can work around this.

One approach is to get one big sql file and pipe that into MySQL, something like:

for file in master-bin.1* ; do echo "Using $file" ; ../mysql/bin/mysqlbinlog "$file" >> all.sql; date  ;  done
cat all.sql > ../mysql/bin/mysql -u root -ppassword

Alternatively, doing something like:

(for file in master-bin.1*; do echo “Using $file” 1>&2; ../mysql/bin/mysqlbinlog $file; date 1>&2; done) | ../mysql/bin/mysql -u root -ppassword

If you want to avoid creating one big fat file.

Which should work as in this case it’s only going to be one connection.

But, these ways have an obvious setback, which is that you cannot have a look at what failed (well, sort of, but extremely difficult) if something goes wrong; It’ll fail on one of the files and then will fail with the rest of them.

The better approach, as discussed on High Performance MySQL is to use a log server, that’s it, a MySQL server that would not use any storage but will only be used to replay binary logs, so you won’t have this problem and also, it will let you interact with the server and its diagnostic messages in case something goes awry.

Use temporary tables?

My advice here would be to encourage you not to use CREATE TEMPORARY TABLE because it can break replication in mysterious ways, but that could be too harsh. There are a few workarounds that can be done from an application level that you can read in which I think they could be worth thinking about.

Any experience with these problems?

November 17, 2009

XFS and barriers

Filed under: Linux, System Administration — Tags: , , — jesus @ 09:33

Lately at work, we’ve been trying to figure out what the deal with barriers are, either for XFS or EXT3, the two filesystems we like most. If you don’t know what barriers are, go and read a bit on the XFS FAQ. Short story, XFS comes with barriers enabled by default, EXT3 does not. Barriers make your system a lot more secure to data corruption, but it degrades performance a lot. Give that EXT3 does not do checksumming of the journal, you could also have lots of corruptions if it’s not enabled. Go and read on wikipedia itself.

If you google a bit you’ll see that there are lots of people who are talking about it, but definitely I haven’t found and answer to what is best and under which scenarios. So, starting with a little test on XFS, here are the results, totally arbitrary on a personal system of mine. System is an Intel Core 2 Duo CPU e4500 at 2.2 GHz, 2GB of RAM and 500GB of HD in a single one XFS partition. I’m testing it with bonnie++ and here are the results. First, mounting by default (that’s it, barriers enabled)

# time bonnie -d /tmp/bonnie/ -u root -f -n 1024
Using uid:0, gid:0.
Writing intelligently...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
kore             4G           54162   9 25726   7           60158   4 234.2   4
Latency                        5971ms    1431ms               233ms     251ms
Version  1.96       ------Sequential Create------ --------Random Create--------
kore                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
 files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 1024   260  11 489116  99   299   1   262  11 92672  56   165   0
Latency               411ms    3435us     595ms    2063ms     688ms   23969ms

real    303m50.878s
user    0m6.036s
sys    17m52.591s

Second time, after a few hours, doing a

mount -oremount,rw,nobarrier /

we get these results (barriers not enabled):

# date ;time bonnie -d /tmp/bonnie/ -u root -f -n 1024 ; date
Tue Nov 17 00:43:53 GMT 2009
Using uid:0, gid:0.
Writing intelligently...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
kore             4G           66059  12 30185  10           71108   5 238.6   3
Latency                        4480ms    3911ms               171ms     250ms
Version  1.96       ------Sequential Create------ --------Random Create--------
kore                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
1024  1830  85 490304  99  4234  17  3420  24 124090  78   402   1
Latency               434ms     165us     432ms    1790ms     311ms   26826ms

real    67m21.588s
user    0m5.668s
sys     11m30.123s
Tue Nov 17 01:51:15 GMT 2009

So, I think you can actually tell how different they behave. I haven’t created graphs from these results to show them graphically, but let me show you some monitoring graphs from these two experiments. The first test was run yesterday in the afternoon. The second one was run just after midnight. You can see the difference. Here showing you the CPU and load average graphs.

I’ll try to follow up shortly with more findings, if I find any ;-). Please, feel free to add any suggestion or comments about your own experiences with these problems.

CPU Usage

CPU Usage

Load Average

Load Average

November 15, 2009

Monitoring data with Collectd

Filed under: System Administration — Tags: — jesus @ 18:50

I’ve been using collectd for quite a while just to monitor the performance of my workstation. I’ve tried other solutions (cacti, munin, etc) but I didn’t like how it all worked or the graphs it created, the amount of work required to have it working, or any other reason, finding collectd to be overall a good solution for my monitoring needs (which are basically graphing and getting some alerts). I like it because it generates nice and good graphs (among other things):

But what I like the most about it is the architecture it’s got for sending data and its low memory footprint.

Today I’ve been playing with it to use the network plugin and I quite like it. The network plugin allows you to have clients sent the monitoring results to a central server, just like the picture below. It sends the data using UDP, which then is captured by the server which will store the data in rrd files and thus having all the data centrally stored. This way, it guarantees that the server it’s not going to block on sending this data. Obviously we want to make sure our connection is reliable.

Collectd Architecture

Which means that you can have collectd running on a number of computers and having them sending the data to a server which can be used to store all this data and display it. The magic about it is that the memory footprint is very small (it’s written in C) and that it can send the data to a single server or more than one, even sending them using a multicast group, which is very nice.

Things that it, allegedly, doesn’t do very well is monitoring and generating alerts (but last version it claims it can have simple thresholds). Also, the web interface collection3 written in perl is a major liability.

So, I’m planning on spending a few more hours playing with this and possibly coming up with an article on how to set it up integrated with my systems and trac. Such that I :

  1. Have a plugin to display graphs on a wiki page, on trac possibly.
  2. Have it sending the data via openvpn (although the latest version supports encryption and signing) for clients behind a firewall.
  3. Make the most use of the plugins.

Any suggestions for a better web interface for collectd?

October 15, 2009

Little surprises in HTTP Headers

Filed under: Internet, System Administration — Tags: — jesus @ 22:59

Last week I move a blog I’ve got in Spanish to Basically I really like and I believe it’s really worth it in terms of freeing my time from administering a wordpress installation and keeping up with the security fixes etc. And today, having a little bit of time I was tweaking my old website to redirect to the new site using an HTTP permanent redirect header. This is what I found in the HTTP headers:

[golan@mars ~] % HEAD
200 OK
Cache-Control: max-age=260, must-revalidate
Connection: close
Date: Thu, 15 Oct 2009 21:35:09 GMT
Server: nginx
Vary: Cookie
Content-Type: text/html; charset=UTF-8
Last-Modified: Thu, 15 Oct 2009 21:34:29 +0000
Client-Date: Thu, 15 Oct 2009 21:35:09 GMT
Client-Response-Num: 1
Link: ; rel=shortlink
X-Hacker: If you're reading this, you should visit and apply to join the fun, mention this header.
X-Nananana: Batcache

So, apart from various bits of information (nginx), what I really really liked was the X-Hacker header :-). Fancy a job?

February 10, 2008

Tricks to diagnose processes blocked on strong I/O in linux

Filed under: Linux, System Administration — Tags: , , — jesus @ 23:59

There’s one aspect of the Linux kernel and the GNU operating system and related tools in which it might be lacking behind, specially with kernel 2.4 series. I’m talking about I/O accounting or how to know what’s going on with the hard disk or other devices which are used to write and read data.

The thing is that Linux provides you with a few tools with which you can tell what’s going on with the box and its set of disks. Say vmstat provides you with a lot of information and various other files scattered in the /proc filesystem. But that information only tells us about the system globally, so it’s good for diagnosing if a high load on a box is due to some process chewing CPU cycles away or because of the hard disk being hammered and being painfully slow. But what about if you want to know what exactly is going on, which process or processes are responsible for the situation, how do you know? The answer is that Linux doesn’t provide you with tools for that, as far as I know (If you know of any, please leave a comment). There’s no such thing as a top utility for process I/O accounting. The situation is better in Linux 2.6 provided you activate the taskstats accounting module with which you can query information about the processes. The user-space utilities are somewhat scarce, but at least there’s something with which you can start playing.

However there are some tricks you can use to try to find out which process is the culprit when things go wrong. As usually, many of these tricks come from work where I keep learning from my colleagues, who, by the way, are much more intelligent than I am ;-), when things go wrong and some problem arises that needs immediate action.

So, let’s define the typical scenario on which we could apply these tricks. You’ve got a Linux box which has a high load average. Say 15, 20, etc. As you may know, the load average measures the number of processes that are waiting to be executed on the process queue. That doesn’t necessarily mean that the CPU is loaded when, for example, processes are blocked because of I/O, say a read from the disk because this is slow or something. The CPU would just sit there most of the time being idle. This number makes sense when you know the number of CPU the box has. If you have a loadavg of 2 in a two-CPU box, then you are just fine, ideally.

The number one tool for identifying what’s going on is vmstat, which would tell you a lot of things going on in the box, specially when you execute it periodically as vmstat 1. If you read the man page (and I do recommend you to read it), you can get an idea of all the information what would be going through the screen :-). Click here to see a screen shot of the output of vmstat on 4 different boxes. Almost all of its output is useful for diagnosis, except the last column in the case of a Linux 2.4 box (that value is added to the idle column).

With this tool we can find out if the system is busy on I/O and how. For example by looking at the bo and bi columns. Swapping, when it’s happening, could also imply that the hard disk is being hammered but that would also mean that there’s not enough memory in the system for all the processes running at that very moment. Well, all of its output can be useful for identifying what’s going on.

Ok, so back to our problem, how do we start? Well, the first thing to do is to try to find out what’s on execution that could be causing this. Who are the usual suspects. By looking at ps output we could get an idea of which processes and/or application could be causing the disk I/O. The problem with this is that sometimes an application runs tens or hundreds of processes, each of which is serving a remote client (say Apache prefork) and maybe only some of them are causing the havoc, so killing all possible processes is not an option in a production environment (possibly killing the processes causing the problem is feasible because they might be wedged or something).

Finding the suspects

One way to find what processes are causing the writes and reads is to have a look at the processes in uninterruptible sleep state. As the box has a high load average because of I/O, surely there must be processes in such a state because they are waiting for the disk to return back the data and return from their system calls. And these processes are likely to be involved in the high load of the system. If you think that uninterruptible sleep processes cannot be killed you are right, but we are assuming that they are in this state briefly again and again because of reading and writing to the disk non-stop. If you have read the vmstat man page, you must have noticed that the column b tells us the number of processes in such a state.

golan@kore:~$ ps aux | grep " D"
root     27351  2.9  0.2 11992 9160 ?        DN   23:06   0:08 /usr/bin/python /usr/bin/rdiff-backup -v 5 --restrict-read-only /disk --server
mail     28652  0.5  0.0  4948 1840 ?        D    23:11   0:00 exim -bd -q5m
golan    28670  0.0  0.0  2684  804 pts/23   S+   23:11   0:00 grep --color  D

Here we can see two processes in such a state (noted by D on ps output). Normally we don’t get to see many of these at the same time and if we issue the same command again, we are probably not going to see it again unless there is a problem which is why I’m writing this in the first place :-).

Examining the suspects

Well, we now need to examine the suspects and filter them, because there might be perfectly valid processes that are in uninterruptible sleep state but are not responsible of the high load, so we need to find out. One thing that we could do is attach strace to a specific process and see how it’s doing. This can be easily achieved this way:

golan@kore:~$ strace -p 12766
Process 12766 attached - interrupt to quit
write(1, "y\n", 2)                      = 2

Here we see the output of a process executing yes. So, what does this output tell us? It shows us all the system calls that the process is doing, so we can effectively see if it is reading or writing.

But all this can be very time consuming if we have quite a few processes to examine. What we could do is strace all of them and save their output to different files and then check them later:

If what we are examining is a process called command, we could do it this way:

# mkdir /tmp/strace
# cd /tmp/strace
# for i in `ps axuwf | grep command | awk '{ print $2 }'`; do (strace -p $i > command-$i.strace 2>&1)&  done

What this would do is create a series of files called command-PID-strace, one for each of the processes that match the regular expression in the grep command. If we set this running for a while, we can now examine the contents of all the files. Even better if we display the files ordered by size we would have a pointer to the process that are doing the most system calls. All we would need to do is verify that those system calls are actually read and write system calls. And also, don’t forget to kill all the strace processes that we sent to the background by issuing a killall strace :-)

So now we have a list of processes that are causing lots of reads and writes in the hard disk. What to do next depends on the situation and what you want to do. You might want to kill the processes, or find who (the person) who started them, in case they were started by someone. Or which network connection, IP address, etc. There are a bunch of utilities that you can use including strace, netstat, lsof, etc. It’s up to you what to do next.


Well, This is me learning from my colleagues and from problems that arise when you don’t expect them. My understanding of the Linux kernel is not that good, but now many of the things that I studied in the Operating System class start to make a little bit more sense. So please, if you have experience with this, know of other ways to get this kind of information, please share it with me (as a comment or otherwise). I’m still learning :)

January 21, 2008

Installing From Source, The Easy Way

Filed under: Linux, System Administration — Tags: — jesus @ 00:40

Installing software in any unix-like operating system these days has become very easy. Package managers such as dpkg, the one used by Debian or Ubuntu, take most of the hassle by dealing with all the dependencies and intricacies that modern software has nowadays. It’s just a matter of getting the package that some hard-working and/or generous developer has made and install it in our system. It’s straightforward compared with how things were a mere few years ago.

We’ve always had the possibility of installing from the software, provided we have resolved all the needed dependencies. Installing from the source can be handy and useful at times. We might want to change some options on compile time or we might want to have two versions of the same package installed on different locations for example.

The problem lies when we want to upgrade the software and we have different versions of it installed, we could end up in a very cluttered scenario, say, with files installed across the file system from different versions of it. Even more, we might not have an easy way to track down which files belong to which version, let alone uninstalling the software.

The Easy Way ® ;-)

As always there are simple solutions for complex problems. There is a nice piece of software which helps us to keep track of software packages installed from source in a clean way. It doesn’t work for all cases, but it does a pretty good job for most of them. I’m talking about epkg, The Encap Package Manager.

I’ll try to describe how it works in a not very technical or detailed way, just to get you going with it and then I’ll install it on my system so you’ll be able to see how handy it is.

Basically, all you have to do is install all software packages on a directory on /usr/local/encap, creating a directory for each of them in a package-version.revision fashion. Then we will use epkg to just create symlinks to the appropriate places, usually /usr/local.

So, say we’ve got:


epkg would create symlinks such as:

/usr/local/bin/mysoft  ->  /usr/local/encap/mysoft-1.1/bin/mysoft
/usr/local/lib/  -> /usr/local/encap/mysoft-1.1/lib/

and that’s it, pretty much. With more complex packages it can get more difficult, but you get the idea.

Let’s just see an example

First of all, we need to install epkg on our system. I will be using an Ubuntu 7.10 system, which, to date, doesn’t have epkg on it. So I will install it from source in the usual way, to /usr/local

root@kore:/usr/local/src# wget
           => `epkg-2.3.9.tar.gz'
23:47:32 (85.66 KB/s) - `epkg-2.3.9.tar.gz' saved [237232]

root@kore:/usr/local/src# tar xfz epkg-2.3.9.tar.gz
root@kore:/usr/local/src# cd epkg-2.3.9/
root@kore:/usr/local/src/epkg-2.3.9# ./configure --prefix=/usr/local
checking for epkg... no
checking for mkencap... no
checking for Encap source directory... /usr/local/encap
checking for Encap target directory... /usr/local
checking for Encap package directory... /usr/local/encap/epkg-2.3.9
checking for gcc... gcc
config.status: creating epkg/Makefile
config.status: creating mkencap/Makefile
config.status: creating mkencap/mkencap_environment
config.status: creating doc/Makefile
config.status: creating config.h

As we see, we’ll install the package with its default options, pointing to to /usr/local/encap as the encap directory. Please, see the help for more options.

We install it:

root@kore:/usr/local/src/epkg-2.3.9# make && make install
epkg: installing package epkg-2.3.9...
  > reading Encap source directory...
  > installing package epkg-2.3.9
    !  man: not an Encap link
    > executing postinstall script
installing: /usr/local/etc/mkencap_environment
    > installation partially successful

If we have a look at /usr/local/bin and /usr/local/encap it has installed itself as an encapped package :), and now we are ready to use it with a real example.

Installing GLE

Say we wanted to install The Graphics Layout Engine, or GLE on our computer and we don’t have a binary package at hand, or we want to control it, or whatever :), let’s just do it with epkg.

  1. Get the source
    root@kore:/usr/local/src# wget
    root@kore:/usr/local/src# unzip
    root@kore:/usr/local/src# cd gle4/
  2. Configure: We will be configuring the software to make it believe it is going to be installed on /usr/local but we will actually install it on /usr/local/encap/ instead, so epkg can deal with it. This is an important step, so let’s just do it by configuring it with those options and with any other that we might want to use:
    root@kore:/usr/local/src/gle4# aptitude install libpng12-dev libpng12-0 libtiff4-dev libtiff4 libjpeg62-dev libjpeg62
    root@kore:/usr/local/src/gle4# ./configure --with-qt=no --prefix=/usr/local
    root@kore:/usr/local/src/gle4# make

    As you can see, I installed some dev packages (using debian’s aptitude) because they are dependencies for GLE. After that, I configure the package without any graphical environment (based on Qt) and pointing to /usr/local. Then we compile it.

  3. Installing. Now, we will be installing it on /usr/local/encap. Bear with me now and I’ll explain what I did after doing it :)
    root@kore:/usr/local/src/gle4# make DESTDIR=/usr/local/encap/GLE-4.1.1 install
    root@kore:/usr/local/encap/GLE-4.1.1# mv usr/local/* .
    root@kore:/usr/local/encap/GLE-4.1.1# rm -rf usr
    root@kore:/usr/local/encap/GLE-4.1.1# ls
    bin  lib  share

    Ok, we what I’ve done is execute make install but setting the DESTDIR variable (which is supported by GLE’s Makefile) to install it on /usr/local/encap/GLE-4.1.1. But there, it is usually installed within its own “usr/local” directory, so to make it be as if it were installed on /usr/local hierarchy, we move it to the right place so now we have:


    and so on.

  4. Install it with epkg Now the final step is to call epkg to actually create the proper symlinks and that’s it:
    root@kore:/usr/local/encap/GLE-4.1.1# epkg GLE
    epkg: installing package GLE...
      > reading Encap source directory...
      > installing package GLE-4.1.1
        > installation successful
    root@kore:/usr/local/encap/GLE-4.1.1# gle
    GLE version 4.1.1
    Usage: gle [options]
    More information: gle -help

And that’s it, really. Now, two things,

  • If we install a newer version, say 4.2.0 whenever that’s ready, we just install it on /usr/local/encap/GLE-4.2.0 as we’ve seen before, and simple calling again
    # epkg GLE

    would create the right symlinks (that’s it, “deinstall” the previous version and install the new one.

  • If we want to uninstall it, that’s it, remove the symlinks, we simply issue this command:
    # epkg -r GLE

    and that’s all.


  1. epkg lets us install software from source having control over it, ie. Installing it in a clean way, being able to deinstall it and upgrade it without cluttering the file system
  2. All you have to do is install the software on /usr/local/encap/package-version as if were /usr/local. The variable DESTDIR on Makefiles helps us to do it in an easy way. If the software is too simple, you’ll have to do it manually.
  3. Remember /usr/local/encap/package-version/usr/local/bin must end up as /usr/local/encap/package-version/bin.
  4. Execute epkg package to install it and epkg -r package to uninstall it.
  5. If you have problems with the libraries, try executing ldconfig.
  6. Be Careful I usually make mistakes, overwrite things and delete files, so take care with what you do and do it under your own responsibility ;-).

December 19, 2007

Modifying a live linux kernel

Filed under: Linux, Programming, System Administration — Tags: , , — jesus @ 17:22

Before reading this, I just need to say something:

I’ve no idea of linux, I’ve no idea of programming, I’ve no idea of computers… Everything you read here might have been invented, so, please, do not reproduce what I write here. If you do, bear in mind that you do it under your own responsibility. In fact, what is a computer anyway?

The other day we were having issues with a box that was used as a NFS box among other things. These issues appeared on upgrading this box from kernel 2.4 to kernel These issues were related to locking on the NFS server, because of changing the behaviour of the flock system call on linux 2.6.11-18. From the NFS FAQ:

The NFS client in 2.6.12 provides support for flock()/BSD locks on NFS files by emulating the BSD-style locks in terms of POSIX byte range locks. Other NFS clients that use the same emulation mechanism, or that use fcntl()/POSIX locks, will then see the same locks that the Linux NFS client sees.

The problems we had are related to using NFS and Samba for exporting the same file system and locking not working properly.

SMB supports two types of locks – file-wide locks and byte-range locks.

  • File-wide locks
    • Called ‘share modes’ in SMB parlance
    • Also known as ‘BSD-style locks’
    • provided by flock() in Linux
    • provided by a ‘share mode’ flag when opening a file under Win32
    • Supported primarily by samba within samba itself by storing in a TDB – get listed under ‘Locked Files’ at the bottom of smbstatus
    • May also be enforced in the kernel using flock() if HAVE_KERNEL_SHARE_MODES is 1.
  • Byte-range locks
    • Called ‘POSIX-style’ locks.
    • provided by fcntl(fd, F_GETLK) in POSIX.
    • provided by _locking() in Win32
    • lockf() is a Linux wrapper around fcntl() for locking the whole file.
    • Supported by samba by a ‘Windows to POSIX byte-range overlap conversion layer’ and then fcntl().

Windows applications appear to use both share modes and byterange locks.

In Linux, flock() and fcntl() locks are oblivious to each other, as per

NFSv3 (as a protocol) only supports byte-range locks. However, nfsd does flock() locks on files on the server taken out by other processes – although clients cannot set them themselves. See

Unfortunately, linux 2.6.12 adds flock() emulation to the Linux NFS client by translating it into a file-wide fcntl(). This means that flock()s and fcntl()s *do collide* on remote NFS shares, which introduces all the potential application race conditions which Linux avoided by having them oblivious to each other locally. The practical upshot of this is that if you re-share an NFS share via samba, then if a Windows client (e.g. Outlook opening a PST file) opens a file with a share mode, then byte-range locking operations will fail as the lock has already been acquired. (The fact that NFS doesn’t realise the same PID has both locks and allow them both is probably an even bigger problem). The solution for this is to revert bits of the patch responsible: Disabling share modes in samba is not an option, as it also disables the application-layer TDB support for them – and disabling HAVE_KERNEL_SHARE_MODES will stop other programs (e.g. nfsd) on dump being aware of what’s been flock()ed.

So our solution for our server was reverting this patch on 2.6.11-18 and apply this patch:

--- fs/nfs/file.c       2007-07-10 19:56:30.000000000 +0100
+++ fs/nfs/file.c.nfs_flock_fix 2007-11-13 13:40:06.000000000 +0000
@@ -543,10 +543,24 @@
         * Not sure whether that would be unique, though, or whether
         * that would break in other places.
-       if (!(fl->fl_flags & FL_FLOCK))
+       /**
+        * Don't simulate flock() using posix locks, as they appear to collide with
+        * legitimate posix locks from the same process.
+        */
+       if (fl->fl_flags & FL_FLOCK)
                return -ENOLCK;

        /* We're simulating flock() locks using posix locks on the server */
+       /* ...except we shouldn't get here, due to the above patch. */
        fl->fl_owner = (fl_owner_t)filp;
        fl->fl_start = 0;
        fl->fl_end = OFFSET_MAX;

So, for us, recompiling the kernel with such patch on the production server fixes all our problems. But, what if we wanted to do this live, could such a subtle change be done without rebooting? You might be thinking right now about the different options that you have on /proc about changing the behaviour of the kernel live, but what if you don’t have such option? What if we wanted to change something and there was no way to do this because it is not implemented or not possible?

Let’s see.

So, from an academic point of view we wanted to see if this could really be done. If the linux kernel would let us do that, if it was feasible. So, we set up a testing box in which we would try to modify the running kernel. How hard would be to do it on the testing box?

It seems there’s a way to do it which my colleague Matthew came up with (all credit to him, I’m just telling the story). Let’s examine the piece of code that we want to change. The offending code lives on the file fs/nfs/file.c of the linux kernel

 * Lock a (portion of) a file
static int nfs_flock(struct file *filp, int cmd, struct file_lock *fl)
        dprintk("NFS: nfs_flock(f=%s/%ld, t=%x, fl=%x)n",
                        fl->fl_type, fl->fl_flags);
         * No BSD flocks over NFS allowed.
         * Note: we could try to fake a POSIX lock request here by
         * using ((u32) filp | 0x80000000) or some such as the pid.
         * Not sure whether that would be unique, though, or whether
         * that would break in other places.

        if (!(fl->fl_flags & FL_FLOCK))
                return -ENOLCK;

        /* We're simulating flock() locks using posix locks on the server */
        fl->fl_owner = (fl_owner_t)filp;
        fl->fl_start = 0;
        fl->fl_end = OFFSET_MAX;

        if (fl->fl_type == F_UNLCK)
                return do_unlk(filp, cmd, fl);

        return do_setlk(filp, cmd, fl);


What we wanted to do is to change the behaviour of the previous function such that the if condition would be:

        if (fl->fl_flags & FL_FLOCK)
                return -ENOLCK;</pre>

That means that the change is fairly trivial and that it would result in switching an operation, basically changing the way the branching (in the machine code) is done on the if instruction. If we disassemble such object file, file.o, we get something like

# objdump -d fs/nfs/file.o
00000885 :
 885:   57                      push   %edi
 886:   89 d7                   mov    %edx,%edi
 888:   56                      push   %esi
 889:   89 c6                   mov    %eax,%esi
 88b:   53                      push   %ebx
 88c:   89 cb                   mov    %ecx,%ebx
 88e:   83 ec 14                sub    $0x14,%esp
 891:   f6 05 00 00 00 00 40    testb  $0x40,0x0
 898:   74 3b                   je     8d5
 89a:   0f b6 41 2c             movzbl 0x2c(%ecx),%eax
 89e:   89 44 24 10             mov    %eax,0x10(%esp)
 8a2:   0f b6 41 2d             movzbl 0x2d(%ecx),%eax
 8a6:   89 44 24 0c             mov    %eax,0xc(%esp)
 8aa:   8b 56 0c                mov    0xc(%esi),%edx
 8ad:   8b 42 0c                mov    0xc(%edx),%eax
 8b0:   8b 40 20                mov    0x20(%eax),%eax
 8b3:   89 44 24 08             mov    %eax,0x8(%esp)
 8b7:   8b 42 0c                mov    0xc(%edx),%eax
 8ba:   8b 80 9c 00 00 00       mov    0x9c(%eax),%eax
 8c0:   c7 04 24 20 01 00 00    movl   $0x120,(%esp)
 8c7:   05 40 01 00 00          add    $0x140,%eax
 8cc:   89 44 24 04             mov    %eax,0x4(%esp)
 8d0:   e8 fc ff ff ff          call   8d1
 8d5:   f6 43 2c 02             testb  $0x2,0x2c(%ebx)
 8d9:   74 47                   je     922
 8db:   80 7b 2d 02             cmpb   $0x2,0x2d(%ebx)
 8df:   89 73 14                mov    %esi,0x14(%ebx)
 8e2:   c7 43 30 00 00 00 00    movl   $0x0,0x30(%ebx)
 8e9:   c7 43 34 00 00 00 00    movl   $0x0,0x34(%ebx)
 8f0:   c7 43 38 ff ff ff ff    movl   $0xffffffff,0x38(%ebx)
 8f7:   c7 43 3c ff ff ff 7f    movl   $0x7fffffff,0x3c(%ebx)
 8fe:   75 11                   jne    911
 900:   83 c4 14                add    $0x14,%esp
 903:   89 d9                   mov    %ebx,%ecx
 905:   89 f0                   mov    %esi,%eax

We can actually have a look at the disassembled code of the function flock(). If you have a look at address 0x8d9, there’s an instruction that looks suspiciously similar to the test carried on the if instructions. If you know assembly and know how a compiler works, you could find out that this jump instruction (JE) is just the one we want to change, exactly to a JNE instruction (I am not going to extend here on assembly and compilers and the like. I guess that if you wanted to do this, you should already know this. And, by the way, I’ve no idea of those concepts either, seriously).

If you are not sure if that’s the right instruction, you could recompile the kernel, get the assembly out of the same file.o object and compare it to see what is what changed.

Also, if you look at address 0x898, there another JE instruction which may look like the one we are looking, but this belongs to dprintk as we have debug enabled on that kernel.

If we have a look at the instructions on the IA32 manual, we see that the opcodes for the interesting instructions are:

  • JE: 74
  • JNE: 75

Ok, so right now, we know that we want to change an JE (with opcode 74) instruction for a JNE instruction (with opcode 75) on address 0x8d9 of the object file “file.o”.

The problem now is to find out where on the kernel memory this piece of code lives. One approach that you might think of doing is grepping the whole memory for a particular sequence of instructions. This is not recommended and I will explain why later on. First, let’s see how we can have access to the kernel memory, where we could possibly modify the data…

If you have a look at your unix system, you’ll see that you have a /dev/kmem special file, with which you can access the memory from the kernel’s point of view. This is quite useful as you can access it in read mode and, more interestingly, write mode. However, doing stuff with it might be a bit dangerous, as you might have guessed. It even seems that some vendors will disable this special file.

Anyway, as I said before, you cannot and don’t want to read or grep the whole memory at /dev/kmem. It seems that the reason is having write-only registers mapped into memory, so a read would crash the system. (You might think that that message is very old, from a distant time, but believe me, it crashes linux if you do so. We tried reading the whole of /dev/mem and the network driver crashed among other little pieces, so don’t bother).

So basically the thing boils down to:

  1. Finding out exactly where we have to change the kernel (ie, at which memory address)
  2. Open /dev/kmem and changing it with the right tool
  3. Hope everything went fine :-)

Finding out where

We need to find out where on the kernel we want to change it. This can be easily do by using tools and information the kernel provides us.

First, we need to find out where the kernel function nfs_flock starts, and that can be done by having a look at the file that is generated every time we compile a kernel. The file is a file that helps kernel developers to debug their code by mapping kernel functions to memory addresses, so it is actually much easier to find stuff. It contains the kernel symbol table with all symbols.

Ours looks like:

c01a9b3f t do_unlk
c01a9b97 t do_setlk
c01a9c2b t nfs_lock
c01a9d21 t nfs_flock
c01a9dcc T nfs_get_root
c01a9f3c T nfs_write_inode

So now, we know that nfs_flock is located at 0xc01a9d21. We’ll use this in a minute.

We saw that we were having a look at the instruction located at 0x8D9 on the object file file.o (got by using objdump before). We also know that, on such object file, nfs_flock starts at 0x885, right?. That means that, the byte we want to change, is located exactly:

0x8D9 - 0x885 = 0x54

at nfs_flock + 0x54.

Well, as you might know, those adresses (on the object file) are relative to such file, and that, when being linked into the actual kernel, the addresses are all relocated and recalculated. So, basically the right point is on

0xc01a9d21 + 0x54

Opening /dev/kmem and modifying it

Opening /dev/kmem needs to be done with the right tools. This is basically because we need LARGE_FILE support on whichever tool we use to modify it, as the /dev/kmem special file is a representation of the kernel memory and we need access it with a tool that supports large files.

In our system, the easiest way to do it is with perl. First, we double check that perl was compiled with LARGE_FILE support:

root@devbox:~# perl -V  | grep LARGE
    cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
  Compile-time options: USE_LARGE_FILES

and, now, we can modify the kernel in a oneliner such as:

root@devbox:~# perl -e 'open (KERNEL, "+</dev/kmem") || die $!; seek(KERNEL, 0xc01a9d21 + 0x54, 0); syswrite(KERNEL, chr(0x75)); close(KERNEL);'

where we are writing 0x75 in the right position (calculated previously).

And that’s all there is to it. If everthing went fine, the kernel behavior has been modified and nothing has crashed.

Again, only do this under a dev box, under your responsibility and if you really really know what you are doing. And only only only if you want to play with a live kernel. Remember, I know nothing about computers.

July 23, 2007


Filed under: Programming, System Administration — Tags: — jesus @ 13:24

Some time ago I discovered Multitail, a tool for displaying in a tail-like fashion any kind of information. I works by splitting the console multitail window in many parts and displaying the info you want on each of those screens, whether it is tailing a file or the output of a command via a ssh session. It also has coloring support (which you can extend using regular expressions) to tailor your needs.

I found it really handy when I have to monitor many servers. Just by using some bash power, you can get very nice outputs just by using something like this:

if [[ -z $rest ]] ;then
  echo "You need to specify at least one server"
  exit 1
command="multitail -s 2 "
for server in $rest
  command="$command -CS vmstat -t $server -l  \"ssh $server vmstat 1 \" "
eval $command

The Shocking Blue Green Theme. Create a free website or blog at


Get every new post delivered to your Inbox.

Join 2,855 other followers