Modifying a live linux kernel
19/Dec 2007
Before reading this, I just need to say something:
I’ve no idea of linux, I’ve no idea of programming, I’ve no idea of computers… Everything you read here might have been invented, so, please, do not reproduce what I write here. If you do, bear in mind that you do it under your own responsibility. In fact, what is a computer anyway?
The other day we were having issues with a box that was used as a NFS box among other things. These issues appeared on upgrading this box from kernel 2.4 to kernel 2.6.22.1. These issues were related to locking on the NFS server, because of changing the behaviour of the flock system call on linux 2.6.11-18. From the NFS FAQ:
The NFS client in 2.6.12 provides support for flock()/BSD locks on NFS files by emulating the BSD-style locks in terms of POSIX byte range locks. Other NFS clients that use the same emulation mechanism, or that use fcntl()/POSIX locks, will then see the same locks that the Linux NFS client sees.
The problems we had are related to using NFS and Samba for exporting the same file system and locking not working properly.
SMB supports two types of locks - file-wide locks and byte-range locks.
File-wide locks
Called ‘share modes’ in SMB parlance
Also known as ‘BSD-style locks’
provided by flock() in Linux
provided by a ‘share mode’ flag when opening a file under Win32
Supported primarily by samba within samba itself by storing in a TDB - get listed under ‘Locked Files’ at the bottom of smbstatus
May also be enforced in the kernel using flock() if HAVE_KERNEL_SHARE_MODES is 1.
Byte-range locks
- Called ‘POSIX-style’ locks.
provided by fcntl(fd, F_GETLK) in POSIX.
provided by _locking() in Win32
lockf() is a Linux wrapper around fcntl() for locking the whole file.
Supported by samba by a ‘Windows to POSIX byte-range overlap conversion layer’ and then fcntl().
Windows applications appear to use both share modes and byterange locks.
In Linux, flock() and fcntl() locks are oblivious to each other, as per http://lxr.linux.no/source/Documentation/locks.txt.
NFSv3 (as a protocol) only supports byte-range locks. However, nfsd does flock() locks on files on the server taken out by other processes - although clients cannot set them themselves. See http://nfs.sourceforge.net/#faq_d10
Unfortunately, linux 2.6.12 adds flock() emulation to the Linux NFS client by translating it into a file-wide fcntl(). This means that flock()s and fcntl()s do collide on remote NFS shares, which introduces all the potential application race conditions which Linux avoided by having them oblivious to each other locally. The practical upshot of this is that if you re-share an NFS share via samba, then if a Windows client (e.g. Outlook opening a PST file) opens a file with a share mode, then byte-range locking operations will fail as the lock has already been acquired. (The fact that NFS doesn’t realise the same PID has both locks and allow them both is probably an even bigger problem). The solution for this is to revert bits of the patch responsible: http://www.linux-nfs.org/Linux-2.6.x/2.6.11/linux-2.6.11-18-flock.dif. Disabling share modes in samba is not an option, as it also disables the application-layer TDB support for them - and disabling HAVE_KERNEL_SHARE_MODES will stop other programs (e.g. nfsd) on dump being aware of what’s been flock()ed.
So our solution for our server was reverting this patch on 2.6.11-18 and apply this patch:
--- fs/nfs/file.c 2007-07-10 19:56:30.000000000 +0100
+++ fs/nfs/file.c.nfs_flock_fix 2007-11-13 13:40:06.000000000 +0000
@@ -543,10 +543,24 @@
* Not sure whether that would be unique, though, or whether
* that would break in other places.
*/
- if (!(fl->fl_flags & FL_FLOCK))
+
+ /**
+ * Don't simulate flock() using posix locks, as they appear to collide with
+ * legitimate posix locks from the same process.
+ */
+ if (fl->fl_flags & FL_FLOCK)
return -ENOLCK;
/* We're simulating flock() locks using posix locks on the server */
+ /* ...except we shouldn't get here, due to the above patch. */
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
So, for us, recompiling the kernel with such patch on the production server fixes all our problems. But, what if we wanted to do this live, could such a subtle change be done without rebooting? You might be thinking right now about the different options that you have on /proc about changing the behaviour of the kernel live, but what if you don’t have such option? What if we wanted to change something and there was no way to do this because it is not implemented or not possible?
Let’s see.
So, from an academic point of view we wanted to see if this could really be done. If the linux kernel would let us do that, if it was feasible. So, we set up a testing box in which we would try to modify the running kernel. How hard would be to do it on the testing box?
It seems there’s a way to do it which my colleague Matthew came up with (all credit to him, I’m just telling the story). Let’s examine the piece of code that we want to change. The offending code lives on the file fs/nfs/file.c of the linux kernel
*
* Lock a (portion of) a file
*/
static int nfs_flock(struct file *filp, int cmd, struct file_lock *fl)
{
dprintk("NFS: nfs_flock(f=%s/%ld, t=%x, fl=%x)n",
filp->f_path.dentry->d_inode->i_sb->s_id,
filp->f_path.dentry->d_inode->i_ino,
fl->fl_type, fl->fl_flags);
/*
* No BSD flocks over NFS allowed.
* Note: we could try to fake a POSIX lock request here by
* using ((u32) filp | 0x80000000) or some such as the pid.
* Not sure whether that would be unique, though, or whether
* that would break in other places.
*/
if (!(fl->fl_flags & FL_FLOCK))
return -ENOLCK;
/* We're simulating flock() locks using posix locks on the server */
fl->fl_owner = (fl_owner_t)filp;
fl->fl_start = 0;
fl->fl_end = OFFSET_MAX;
if (fl->fl_type == F_UNLCK)
return do_unlk(filp, cmd, fl);
return do_setlk(filp, cmd, fl);
}
What we wanted to do is to change the behaviour of the previous function such that the if condition would be:
if (fl->fl_flags & FL_FLOCK)
return -ENOLCK;</pre>
That means that the change is fairly trivial and that it would result in switching an operation, basically changing the way the branching (in the machine code) is done on the if instruction. If we disassemble such object file, file.o, we get something like
# objdump -d fs/nfs/file.o
...
00000885 :
885: 57 push %edi
886: 89 d7 mov %edx,%edi
888: 56 push %esi
889: 89 c6 mov %eax,%esi
88b: 53 push %ebx
88c: 89 cb mov %ecx,%ebx
88e: 83 ec 14 sub $0x14,%esp
891: f6 05 00 00 00 00 40 testb $0x40,0x0
898: 74 3b je 8d5
89a: 0f b6 41 2c movzbl 0x2c(%ecx),%eax
89e: 89 44 24 10 mov %eax,0x10(%esp)
8a2: 0f b6 41 2d movzbl 0x2d(%ecx),%eax
8a6: 89 44 24 0c mov %eax,0xc(%esp)
8aa: 8b 56 0c mov 0xc(%esi),%edx
8ad: 8b 42 0c mov 0xc(%edx),%eax
8b0: 8b 40 20 mov 0x20(%eax),%eax
8b3: 89 44 24 08 mov %eax,0x8(%esp)
8b7: 8b 42 0c mov 0xc(%edx),%eax
8ba: 8b 80 9c 00 00 00 mov 0x9c(%eax),%eax
8c0: c7 04 24 20 01 00 00 movl $0x120,(%esp)
8c7: 05 40 01 00 00 add $0x140,%eax
8cc: 89 44 24 04 mov %eax,0x4(%esp)
8d0: e8 fc ff ff ff call 8d1
8d5: f6 43 2c 02 testb $0x2,0x2c(%ebx)
8d9: 74 47 je 922
8db: 80 7b 2d 02 cmpb $0x2,0x2d(%ebx)
8df: 89 73 14 mov %esi,0x14(%ebx)
8e2: c7 43 30 00 00 00 00 movl $0x0,0x30(%ebx)
8e9: c7 43 34 00 00 00 00 movl $0x0,0x34(%ebx)
8f0: c7 43 38 ff ff ff ff movl $0xffffffff,0x38(%ebx)
8f7: c7 43 3c ff ff ff 7f movl $0x7fffffff,0x3c(%ebx)
8fe: 75 11 jne 911
900: 83 c4 14 add $0x14,%esp
903: 89 d9 mov %ebx,%ecx
905: 89 f0 mov %esi,%eax
...
We can actually have a look at the disassembled code of the function flock(). If you have a look at address 0x8d9, there’s an instruction that looks suspiciously similar to the test carried on the if instructions. If you know assembly and know how a compiler works, you could find out that this jump instruction (JE) is just the one we want to change, exactly to a JNE instruction (I am not going to extend here on assembly and compilers and the like. I guess that if you wanted to do this, you should already know this. And, by the way, I’ve no idea of those concepts either, seriously).
If you are not sure if that’s the right instruction, you could recompile the kernel, get the assembly out of the same file.o object and compare it to see what is what changed.
Also, if you look at address 0x898, there another JE instruction which may look like the one we are looking, but this belongs to dprintk as we have debug enabled on that kernel.
If we have a look at the instructions on the IA32 manual, we see that the opcodes for the interesting instructions are:
JE: 74
JNE: 75
Ok, so right now, we know that we want to change an JE (with opcode 74) instruction for a JNE instruction (with opcode 75) on address 0x8d9 of the object file “file.o”.
The problem now is to find out where on the kernel memory this piece of code lives. One approach that you might think of doing is grepping the whole memory for a particular sequence of instructions. This is not recommended and I will explain why later on. First, let’s see how we can have access to the kernel memory, where we could possibly modify the data…
If you have a look at your unix system, you’ll see that you have a /dev/kmem special file, with which you can access the memory from the kernel’s point of view. This is quite useful as you can access it in read mode and, more interestingly, write mode. However, doing stuff with it might be a bit dangerous, as you might have guessed. It even seems that some vendors will disable this special file.
Anyway, as I said before, you cannot and don’t want to read or grep the whole memory at /dev/kmem. It seems that the reason is having write-only registers mapped into memory, so a read would crash the system. (You might think that that message is very old, from a distant time, but believe me, it crashes linux if you do so. We tried reading the whole of /dev/mem and the network driver crashed among other little pieces, so don’t bother).
So basically the thing boils down to:
Finding out exactly where we have to change the kernel (ie, at which memory address)
Open /dev/kmem and changing it with the right tool
Hope everything went fine :-)
** Finding out where **
We need to find out where on the kernel we want to change it. This can be easily do by using tools and information the kernel provides us.
First, we need to find out where the kernel function nfs_flock starts, and that can be done by having a look at the System.map file that is generated every time we compile a kernel. The System.map file is a file that helps kernel developers to debug their code by mapping kernel functions to memory addresses, so it is actually much easier to find stuff. It contains the kernel symbol table with all symbols.
Ours looks like:
...
c01a9b3f t do_unlk
c01a9b97 t do_setlk
c01a9c2b t nfs_lock
c01a9d21 t nfs_flock
c01a9dcc T nfs_get_root
c01a9f3c T nfs_write_inode
...
...
So now, we know that nfs_flock is located at 0xc01a9d21. We’ll use this in a minute.
We saw that we were having a look at the instruction located at 0x8D9 on the object file file.o (got by using objdump before). We also know that, on such object file, nfs_flock starts at 0x885, right?. That means that, the byte we want to change, is located exactly:
0x8D9 - 0x885 = 0x54
at nfs_flock + 0x54.
Well, as you might know, those adresses (on the object file) are relative to such file, and that, when being linked into the actual kernel, the addresses are all relocated and recalculated. So, basically the right point is on
0xc01a9d21 + 0x54
Opening /dev/kmem and modifying it
Opening /dev/kmem needs to be done with the right tools. This is basically because we need LARGE_FILE support on whichever tool we use to modify it, as the /dev/kmem special file is a representation of the kernel memory and we need access it with a tool that supports large files.
In our system, the easiest way to do it is with perl. First, we double check that perl was compiled with LARGE_FILE support:
root@devbox:~# perl -V | grep LARGE
cc='cc', ccflags ='-fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
Compile-time options: USE_LARGE_FILES
root@devbox:~#
and, now, we can modify the kernel in a oneliner such as:
root@devbox:~# perl -e 'open (KERNEL, "+</dev/kmem") || die $!; seek(KERNEL, 0xc01a9d21 + 0x54, 0); syswrite(KERNEL, chr(0x75)); close(KERNEL);'
root@devbox:~#
where we are writing 0x75 in the right position (calculated previously).
And that’s all there is to it. If everthing went fine, the kernel behavior has been modified and nothing has crashed.
Again, only do this under a dev box, under your responsibility and if you really really know what you are doing. And only only only if you want to play with a live kernel. Remember, I know nothing about computers.