Commit 9c296b46 authored by Jonathan Corbet's avatar Jonathan Corbet

docs: sphinxify kmemcheck.txt and move to dev-tools

Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parent ca90a7a3
GETTING STARTED WITH KMEMCHECK Getting started with kmemcheck
============================== ==============================
Vegard Nossum <vegardno@ifi.uio.no> Vegard Nossum <vegardno@ifi.uio.no>
Contents Introduction
======== ------------
0. Introduction
1. Downloading
2. Configuring and compiling
3. How to use
3.1. Booting
3.2. Run-time enable/disable
3.3. Debugging
3.4. Annotating false positives
4. Reporting errors
5. Technical description
0. Introduction
===============
kmemcheck is a debugging feature for the Linux Kernel. More specifically, it kmemcheck is a debugging feature for the Linux Kernel. More specifically, it
is a dynamic checker that detects and warns about some uses of uninitialized is a dynamic checker that detects and warns about some uses of uninitialized
...@@ -40,21 +26,20 @@ as much memory as normal. For this reason, kmemcheck is strictly a debugging ...@@ -40,21 +26,20 @@ as much memory as normal. For this reason, kmemcheck is strictly a debugging
feature. feature.
1. Downloading Downloading
============== -----------
As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel.
2. Configuring and compiling Configuring and compiling
============================ -------------------------
kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of
configuration variables must have specific settings in order for the kmemcheck configuration variables must have specific settings in order for the kmemcheck
menu to even appear in "menuconfig". These are: menu to even appear in "menuconfig". These are:
o CONFIG_CC_OPTIMIZE_FOR_SIZE=n - ``CONFIG_CC_OPTIMIZE_FOR_SIZE=n``
This option is located under "General setup" / "Optimize for size". This option is located under "General setup" / "Optimize for size".
Without this, gcc will use certain optimizations that usually lead to Without this, gcc will use certain optimizations that usually lead to
...@@ -63,13 +48,11 @@ menu to even appear in "menuconfig". These are: ...@@ -63,13 +48,11 @@ menu to even appear in "menuconfig". These are:
16 bits. kmemcheck sees only the 32-bit load, and may trigger a 16 bits. kmemcheck sees only the 32-bit load, and may trigger a
warning for the upper 16 bits (if they're uninitialized). warning for the upper 16 bits (if they're uninitialized).
o CONFIG_SLAB=y or CONFIG_SLUB=y - ``CONFIG_SLAB=y`` or ``CONFIG_SLUB=y``
This option is located under "General setup" / "Choose SLAB This option is located under "General setup" / "Choose SLAB
allocator". allocator".
o CONFIG_FUNCTION_TRACER=n - ``CONFIG_FUNCTION_TRACER=n``
This option is located under "Kernel hacking" / "Tracers" / "Kernel This option is located under "Kernel hacking" / "Tracers" / "Kernel
Function Tracer" Function Tracer"
...@@ -80,12 +63,11 @@ menu to even appear in "menuconfig". These are: ...@@ -80,12 +63,11 @@ menu to even appear in "menuconfig". These are:
modifies memory that was tracked by kmemcheck, the result is an modifies memory that was tracked by kmemcheck, the result is an
endless recursive page fault. endless recursive page fault.
o CONFIG_DEBUG_PAGEALLOC=n - ``CONFIG_DEBUG_PAGEALLOC=n``
This option is located under "Kernel hacking" / "Memory Debugging" This option is located under "Kernel hacking" / "Memory Debugging"
/ "Debug page memory allocations". / "Debug page memory allocations".
In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also In addition, I highly recommend turning on ``CONFIG_DEBUG_INFO=y``. This is also
located under "Kernel hacking". With this, you will be able to get line number located under "Kernel hacking". With this, you will be able to get line number
information from the kmemcheck warnings, which is extremely valuable in information from the kmemcheck warnings, which is extremely valuable in
debugging a problem. This option is not mandatory, however, because it slows debugging a problem. This option is not mandatory, however, because it slows
...@@ -95,12 +77,10 @@ Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory ...@@ -95,12 +77,10 @@ Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory
Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows
a description of the kmemcheck configuration variables: a description of the kmemcheck configuration variables:
o CONFIG_KMEMCHECK - ``CONFIG_KMEMCHECK``
This must be enabled in order to use kmemcheck at all... This must be enabled in order to use kmemcheck at all...
o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT - ``CONFIG_KMEMCHECK_``[``DISABLED`` | ``ENABLED`` | ``ONESHOT``]``_BY_DEFAULT``
This option controls the status of kmemcheck at boot-time. "Enabled" This option controls the status of kmemcheck at boot-time. "Enabled"
will enable kmemcheck right from the start, "disabled" will boot the will enable kmemcheck right from the start, "disabled" will boot the
kernel as normal (but with the kmemcheck code compiled in, so it can kernel as normal (but with the kmemcheck code compiled in, so it can
...@@ -125,8 +105,7 @@ a description of the kmemcheck configuration variables: ...@@ -125,8 +105,7 @@ a description of the kmemcheck configuration variables:
time overhead is not incurred, and the kernel will be almost as fast time overhead is not incurred, and the kernel will be almost as fast
as normal. as normal.
o CONFIG_KMEMCHECK_QUEUE_SIZE - ``CONFIG_KMEMCHECK_QUEUE_SIZE``
Select the maximum number of error reports to store in an internal Select the maximum number of error reports to store in an internal
(fixed-size) buffer. Since errors can occur virtually anywhere and in (fixed-size) buffer. Since errors can occur virtually anywhere and in
any context, we need a temporary storage area which is guaranteed not any context, we need a temporary storage area which is guaranteed not
...@@ -147,8 +126,7 @@ a description of the kmemcheck configuration variables: ...@@ -147,8 +126,7 @@ a description of the kmemcheck configuration variables:
will get lost in that way instead. Try setting this to 10 or so on will get lost in that way instead. Try setting this to 10 or so on
such a setup. such a setup.
o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT - ``CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT``
Select the number of shadow bytes to save along with each entry of the Select the number of shadow bytes to save along with each entry of the
error-report queue. These bytes indicate what parts of an allocation error-report queue. These bytes indicate what parts of an allocation
are initialized, uninitialized, etc. and will be displayed when an are initialized, uninitialized, etc. and will be displayed when an
...@@ -161,8 +139,7 @@ a description of the kmemcheck configuration variables: ...@@ -161,8 +139,7 @@ a description of the kmemcheck configuration variables:
The default value should be fine for debugging most problems. It also The default value should be fine for debugging most problems. It also
fits nicely within 80 columns. fits nicely within 80 columns.
o CONFIG_KMEMCHECK_PARTIAL_OK - ``CONFIG_KMEMCHECK_PARTIAL_OK``
This option (when enabled) works around certain GCC optimizations that This option (when enabled) works around certain GCC optimizations that
produce 32-bit reads from 16-bit variables where the upper 16 bits are produce 32-bit reads from 16-bit variables where the upper 16 bits are
thrown away afterwards. thrown away afterwards.
...@@ -171,8 +148,7 @@ a description of the kmemcheck configuration variables: ...@@ -171,8 +148,7 @@ a description of the kmemcheck configuration variables:
some real errors, but disabling it would probably produce a lot of some real errors, but disabling it would probably produce a lot of
false positives. false positives.
o CONFIG_KMEMCHECK_BITOPS_OK - ``CONFIG_KMEMCHECK_BITOPS_OK``
This option silences warnings that would be generated for bit-field This option silences warnings that would be generated for bit-field
accesses where not all the bits are initialized at the same time. This accesses where not all the bits are initialized at the same time. This
may also hide some real bugs. may also hide some real bugs.
...@@ -184,36 +160,36 @@ a description of the kmemcheck configuration variables: ...@@ -184,36 +160,36 @@ a description of the kmemcheck configuration variables:
Now compile the kernel as usual. Now compile the kernel as usual.
3. How to use How to use
============= ----------
3.1. Booting Booting
============ ~~~~~~~
First some information about the command-line options. There is only one First some information about the command-line options. There is only one
option specific to kmemcheck, and this is called "kmemcheck". It can be used option specific to kmemcheck, and this is called "kmemcheck". It can be used
to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT to override the default mode as chosen by the ``CONFIG_KMEMCHECK_*_BY_DEFAULT``
option. Its possible settings are: option. Its possible settings are:
o kmemcheck=0 (disabled) - ``kmemcheck=0`` (disabled)
o kmemcheck=1 (enabled) - ``kmemcheck=1`` (enabled)
o kmemcheck=2 (one-shot mode) - ``kmemcheck=2`` (one-shot mode)
If SLUB debugging has been enabled in the kernel, it may take precedence over If SLUB debugging has been enabled in the kernel, it may take precedence over
kmemcheck in such a way that the slab caches which are under SLUB debugging kmemcheck in such a way that the slab caches which are under SLUB debugging
will not be tracked by kmemcheck. In order to ensure that this doesn't happen will not be tracked by kmemcheck. In order to ensure that this doesn't happen
(even though it shouldn't by default), use SLUB's boot option "slub_debug", (even though it shouldn't by default), use SLUB's boot option ``slub_debug``,
like this: slub_debug=- like this: ``slub_debug=-``
In fact, this option may also be used for fine-grained control over SLUB vs. In fact, this option may also be used for fine-grained control over SLUB vs.
kmemcheck. For example, if the command line includes "kmemcheck=1 kmemcheck. For example, if the command line includes
slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" ``kmemcheck=1 slub_debug=,dentry``, then SLUB debugging will be used only
slab cache, and with kmemcheck tracking all the other caches. This is advanced for the "dentry" slab cache, and with kmemcheck tracking all the other
usage, however, and is not generally recommended. caches. This is advanced usage, however, and is not generally recommended.
3.2. Run-time enable/disable Run-time enable/disable
============================ ~~~~~~~~~~~~~~~~~~~~~~~
When the kernel has booted, it is possible to enable or disable kmemcheck at When the kernel has booted, it is possible to enable or disable kmemcheck at
run-time. WARNING: This feature is still experimental and may cause false run-time. WARNING: This feature is still experimental and may cause false
...@@ -221,36 +197,36 @@ positive warnings to appear. Therefore, try not to use this. If you find that ...@@ -221,36 +197,36 @@ positive warnings to appear. Therefore, try not to use this. If you find that
it doesn't work properly (e.g. you see an unreasonable amount of warnings), I it doesn't work properly (e.g. you see an unreasonable amount of warnings), I
will be happy to take bug reports. will be happy to take bug reports.
Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: Use the file ``/proc/sys/kernel/kmemcheck`` for this purpose, e.g.::
$ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck
The numbers are the same as for the kmemcheck= command-line option. The numbers are the same as for the ``kmemcheck=`` command-line option.
3.3. Debugging Debugging
============== ~~~~~~~~~
A typical report will look something like this: A typical report will look something like this::
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
80000000000000000000000000000000000000000088ffff0000000000000000 80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
^ ^
Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A
RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002
RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84
RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000
R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e
R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8
FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
[<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
[<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
[<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0
...@@ -261,8 +237,8 @@ The single most valuable information in this report is the RIP (or EIP on 32- ...@@ -261,8 +237,8 @@ The single most valuable information in this report is the RIP (or EIP on 32-
bit) value. This will help us pinpoint exactly which instruction that caused bit) value. This will help us pinpoint exactly which instruction that caused
the warning. the warning.
If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do If your kernel was compiled with ``CONFIG_DEBUG_INFO=y``, then all we have to do
is give this address to the addr2line program, like this: is give this address to the addr2line program, like this::
$ addr2line -e vmlinux -i ffffffff8104ede8 $ addr2line -e vmlinux -i ffffffff8104ede8
arch/x86/include/asm/string_64.h:12 arch/x86/include/asm/string_64.h:12
...@@ -270,71 +246,73 @@ is give this address to the addr2line program, like this: ...@@ -270,71 +246,73 @@ is give this address to the addr2line program, like this:
kernel/signal.c:380 kernel/signal.c:380
kernel/signal.c:410 kernel/signal.c:410
The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must The "``-e vmlinux``" tells addr2line which file to look in. **IMPORTANT:**
be the vmlinux of the kernel that produced the warning in the first place! If This must be the vmlinux of the kernel that produced the warning in the
not, the line number information will almost certainly be wrong. first place! If not, the line number information will almost certainly be
wrong.
The "-i" tells addr2line to also print the line numbers of inlined functions.
In this case, the flag was very important, because otherwise, it would only The "``-i``" tells addr2line to also print the line numbers of inlined
have printed the first line, which is just a call to memcpy(), which could be functions. In this case, the flag was very important, because otherwise,
called from a thousand places in the kernel, and is therefore not very useful. it would only have printed the first line, which is just a call to
These inlined functions would not show up in the stack trace above, simply ``memcpy()``, which could be called from a thousand places in the kernel, and
because the kernel doesn't load the extra debugging information. This is therefore not very useful. These inlined functions would not show up in
technique can of course be used with ordinary kernel oopses as well. the stack trace above, simply because the kernel doesn't load the extra
debugging information. This technique can of course be used with ordinary
In this case, it's the caller of memcpy() that is interesting, and it can be kernel oopses as well.
found in include/asm-generic/siginfo.h, line 287:
In this case, it's the caller of ``memcpy()`` that is interesting, and it can be
281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) found in ``include/asm-generic/siginfo.h``, line 287::
282 {
283 if (from->si_code < 0) 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from)
284 memcpy(to, from, sizeof(*to)); 282 {
285 else 283 if (from->si_code < 0)
286 /* _sigchld is currently the largest know union member */ 284 memcpy(to, from, sizeof(*to));
287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); 285 else
288 } 286 /* _sigchld is currently the largest know union member */
287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld));
288 }
Since this was a read (kmemcheck usually warns about reads only, though it can Since this was a read (kmemcheck usually warns about reads only, though it can
warn about writes to unallocated or freed memory as well), it was probably the warn about writes to unallocated or freed memory as well), it was probably the
"from" argument which contained some uninitialized bytes. Following the chain "from" argument which contained some uninitialized bytes. Following the chain
of calls, we move upwards to see where "from" was allocated or initialized, of calls, we move upwards to see where "from" was allocated or initialized,
kernel/signal.c, line 380: ``kernel/signal.c``, line 380::
359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info)
360 { 360 {
... ...
367 list_for_each_entry(q, &list->list, list) { 367 list_for_each_entry(q, &list->list, list) {
368 if (q->info.si_signo == sig) { 368 if (q->info.si_signo == sig) {
369 if (first) 369 if (first)
370 goto still_pending; 370 goto still_pending;
371 first = q; 371 first = q;
... ...
377 if (first) { 377 if (first) {
378 still_pending: 378 still_pending:
379 list_del_init(&first->list); 379 list_del_init(&first->list);
380 copy_siginfo(info, &first->info); 380 copy_siginfo(info, &first->info);
381 __sigqueue_free(first); 381 __sigqueue_free(first);
... ...
392 } 392 }
393 } 393 }
Here, it is &first->info that is being passed on to copy_siginfo(). The Here, it is ``&first->info`` that is being passed on to ``copy_siginfo()``. The
variable "first" was found on a list -- passed in as the second argument to variable ``first`` was found on a list -- passed in as the second argument to
collect_signal(). We continue our journey through the stack, to figure out ``collect_signal()``. We continue our journey through the stack, to figure out
where the item on "list" was allocated or initialized. We move to line 410: where the item on "list" was allocated or initialized. We move to line 410::
395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask,
396 siginfo_t *info) 396 siginfo_t *info)
397 { 397 {
... ...
410 collect_signal(sig, pending, info); 410 collect_signal(sig, pending, info);
... ...
414 } 414 }
Now we need to follow the "pending" pointer, since that is being passed on to Now we need to follow the ``pending`` pointer, since that is being passed on to
collect_signal() as "list". At this point, we've run out of lines from the ``collect_signal()`` as ``list``. At this point, we've run out of lines from the
"addr2line" output. Not to worry, we just paste the next addresses from the "addr2line" output. Not to worry, we just paste the next addresses from the
kmemcheck stack dump, i.e.: kmemcheck stack dump, i.e.::
[<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170
[<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390
...@@ -351,36 +329,36 @@ kmemcheck stack dump, i.e.: ...@@ -351,36 +329,36 @@ kmemcheck stack dump, i.e.:
Remember that since these addresses were found on the stack and not as the Remember that since these addresses were found on the stack and not as the
RIP value, they actually point to the _next_ instruction (they are return RIP value, they actually point to the _next_ instruction (they are return
addresses). This becomes obvious when we look at the code for line 446: addresses). This becomes obvious when we look at the code for line 446::
422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info)
423 { 423 {
... ...
431 signr = __dequeue_signal(&tsk->signal->shared_pending, 431 signr = __dequeue_signal(&tsk->signal->shared_pending,
432 mask, info); 432 mask, info);
433 /* 433 /*
434 * itimer signal ? 434 * itimer signal ?
435 * 435 *
436 * itimers are process shared and we restart periodic 436 * itimers are process shared and we restart periodic
437 * itimers in the signal delivery path to prevent DoS 437 * itimers in the signal delivery path to prevent DoS
438 * attacks in the high resolution timer case. This is 438 * attacks in the high resolution timer case. This is
439 * compliant with the old way of self restarting 439 * compliant with the old way of self restarting
440 * itimers, as the SIGALRM is a legacy signal and only 440 * itimers, as the SIGALRM is a legacy signal and only
441 * queued once. Changing the restart behaviour to 441 * queued once. Changing the restart behaviour to
442 * restart the timer in the signal dequeue path is 442 * restart the timer in the signal dequeue path is
443 * reducing the timer noise on heavy loaded !highres 443 * reducing the timer noise on heavy loaded !highres
444 * systems too. 444 * systems too.
445 */ 445 */
446 if (unlikely(signr == SIGALRM)) { 446 if (unlikely(signr == SIGALRM)) {
... ...
489 } 489 }
So instead of looking at 446, we should be looking at 431, which is the line So instead of looking at 446, we should be looking at 431, which is the line
that executes just before 446. Here we see that what we are looking for is that executes just before 446. Here we see that what we are looking for is
&tsk->signal->shared_pending. ``&tsk->signal->shared_pending``.
Our next task is now to figure out which function that puts items on this Our next task is now to figure out which function that puts items on this
"shared_pending" list. A crude, but efficient tool, is git grep: ``shared_pending`` list. A crude, but efficient tool, is ``git grep``::
$ git grep -n 'shared_pending' kernel/ $ git grep -n 'shared_pending' kernel/
... ...
...@@ -390,109 +368,110 @@ Our next task is now to figure out which function that puts items on this ...@@ -390,109 +368,110 @@ Our next task is now to figure out which function that puts items on this
There were more results, but none of them were related to list operations, There were more results, but none of them were related to list operations,
and these were the only assignments. We inspect the line numbers more closely and these were the only assignments. We inspect the line numbers more closely
and find that this is indeed where items are being added to the list: and find that this is indeed where items are being added to the list::
816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
817 int group) 817 int group)
818 { 818 {
... ...
828 pending = group ? &t->signal->shared_pending : &t->pending; 828 pending = group ? &t->signal->shared_pending : &t->pending;
... ...
851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
852 (is_si_special(info) || 852 (is_si_special(info) ||
853 info->si_code >= 0))); 853 info->si_code >= 0)));
854 if (q) { 854 if (q) {
855 list_add_tail(&q->list, &pending->list); 855 list_add_tail(&q->list, &pending->list);
... ...
890 } 890 }
and: and::
1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
1310 { 1310 {
.... ....
1339 pending = group ? &t->signal->shared_pending : &t->pending; 1339 pending = group ? &t->signal->shared_pending : &t->pending;
1340 list_add_tail(&q->list, &pending->list); 1340 list_add_tail(&q->list, &pending->list);
.... ....
1347 } 1347 }
In the first case, the list element we are looking for, "q", is being returned In the first case, the list element we are looking for, ``q``, is being
from the function __sigqueue_alloc(), which looks like an allocation function. returned from the function ``__sigqueue_alloc()``, which looks like an
Let's take a look at it: allocation function. Let's take a look at it::
187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags,
188 int override_rlimit) 188 int override_rlimit)
189 { 189 {
190 struct sigqueue *q = NULL; 190 struct sigqueue *q = NULL;
191 struct user_struct *user; 191 struct user_struct *user;
192 192
193 /* 193 /*
194 * We won't get problems with the target's UID changing under us 194 * We won't get problems with the target's UID changing under us
195 * because changing it requires RCU be used, and if t != current, the 195 * because changing it requires RCU be used, and if t != current, the
196 * caller must be holding the RCU readlock (by way of a spinlock) and 196 * caller must be holding the RCU readlock (by way of a spinlock) and
197 * we use RCU protection here 197 * we use RCU protection here
198 */ 198 */
199 user = get_uid(__task_cred(t)->user); 199 user = get_uid(__task_cred(t)->user);
200 atomic_inc(&user->sigpending); 200 atomic_inc(&user->sigpending);
201 if (override_rlimit || 201 if (override_rlimit ||
202 atomic_read(&user->sigpending) <= 202 atomic_read(&user->sigpending) <=
203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur)
204 q = kmem_cache_alloc(sigqueue_cachep, flags); 204 q = kmem_cache_alloc(sigqueue_cachep, flags);
205 if (unlikely(q == NULL)) { 205 if (unlikely(q == NULL)) {
206 atomic_dec(&user->sigpending); 206 atomic_dec(&user->sigpending);
207 free_uid(user); 207 free_uid(user);
208 } else { 208 } else {
209 INIT_LIST_HEAD(&q->list); 209 INIT_LIST_HEAD(&q->list);
210 q->flags = 0; 210 q->flags = 0;
211 q->user = user; 211 q->user = user;
212 } 212 }
213 213
214 return q; 214 return q;
215 } 215 }
We see that this function initializes q->list, q->flags, and q->user. It seems We see that this function initializes ``q->list``, ``q->flags``, and
that now is the time to look at the definition of "struct sigqueue", e.g.: ``q->user``. It seems that now is the time to look at the definition of
``struct sigqueue``, e.g.::
14 struct sigqueue {
15 struct list_head list; 14 struct sigqueue {
16 int flags; 15 struct list_head list;
17 siginfo_t info; 16 int flags;
18 struct user_struct *user; 17 siginfo_t info;
19 }; 18 struct user_struct *user;
19 };
And, you might remember, it was a memcpy() on &first->info that caused the
warning, so this makes perfect sense. It also seems reasonable to assume that And, you might remember, it was a ``memcpy()`` on ``&first->info`` that
it is the caller of __sigqueue_alloc() that has the responsibility of filling caused the warning, so this makes perfect sense. It also seems reasonable
out (initializing) this member. to assume that it is the caller of ``__sigqueue_alloc()`` that has the
responsibility of filling out (initializing) this member.
But just which fields of the struct were uninitialized? Let's look at But just which fields of the struct were uninitialized? Let's look at
kmemcheck's report again: kmemcheck's report again::
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024)
80000000000000000000000000000000000000000088ffff0000000000000000 80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
^ ^
These first two lines are the memory dump of the memory object itself, and the These first two lines are the memory dump of the memory object itself, and
shadow bytemap, respectively. The memory object itself is in this case the shadow bytemap, respectively. The memory object itself is in this case
&first->info. Just beware that the start of this dump is NOT the start of the ``&first->info``. Just beware that the start of this dump is NOT the start
object itself! The position of the caret (^) corresponds with the address of of the object itself! The position of the caret (^) corresponds with the
the read (ffff88003e4a2024). address of the read (ffff88003e4a2024).
The shadow bytemap dump legend is as follows: The shadow bytemap dump legend is as follows:
i - initialized - i: initialized
u - uninitialized - u: uninitialized
a - unallocated (memory has been allocated by the slab layer, but has not - a: unallocated (memory has been allocated by the slab layer, but has not
yet been handed off to anybody) yet been handed off to anybody)
f - freed (memory has been allocated by the slab layer, but has been freed - f: freed (memory has been allocated by the slab layer, but has been freed
by the previous owner) by the previous owner)
In order to figure out where (relative to the start of the object) the In order to figure out where (relative to the start of the object) the
uninitialized memory was located, we have to look at the disassembly. For uninitialized memory was located, we have to look at the disassembly. For
that, we'll need the RIP address again: that, we'll need the RIP address again::
RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
$ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8:
ffffffff8104edc8: mov %r8,0x8(%r8) ffffffff8104edc8: mov %r8,0x8(%r8)
...@@ -513,36 +492,36 @@ RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 ...@@ -513,36 +492,36 @@ RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190
ffffffff8104edf5: mov %r8,%rdi ffffffff8104edf5: mov %r8,%rdi
ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free>
As expected, it's the "rep movsl" instruction from the memcpy() that causes As expected, it's the "``rep movsl``" instruction from the ``memcpy()``
the warning. We know about REP MOVSL that it uses the register RCX to count that causes the warning. We know about ``REP MOVSL`` that it uses the register
the number of remaining iterations. By taking a look at the register dump ``RCX`` to count the number of remaining iterations. By taking a look at the
again (from the kmemcheck report), we can figure out how many bytes were left register dump again (from the kmemcheck report), we can figure out how many
to copy: bytes were left to copy::
RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009
By looking at the disassembly, we also see that %ecx is being loaded with the By looking at the disassembly, we also see that ``%ecx`` is being loaded
value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind with the value ``$0xc`` just before (ffffffff8104edd8), so we are very
that this is the number of iterations, not bytes. And since this is a "long" lucky. Keep in mind that this is the number of iterations, not bytes. And
operation, we need to multiply by 4 to get the number of bytes. So this means since this is a "long" operation, we need to multiply by 4 to get the
that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes number of bytes. So this means that the uninitialized value was encountered
from the start of the object. at 4 * (0xc - 0x9) = 12 bytes from the start of the object.
We can now try to figure out which field of the "struct siginfo" that was not We can now try to figure out which field of the "``struct siginfo``" that
initialized. This is the beginning of the struct: was not initialized. This is the beginning of the struct::
40 typedef struct siginfo { 40 typedef struct siginfo {
41 int si_signo; 41 int si_signo;
42 int si_errno; 42 int si_errno;
43 int si_code; 43 int si_code;
44 44
45 union { 45 union {
.. ..
92 } _sifields; 92 } _sifields;
93 } siginfo_t; 93 } siginfo_t;
On 64-bit, the int is 4 bytes long, so it must the union member that has On 64-bit, the int is 4 bytes long, so it must the union member that has
not been initialized. We can verify this using gdb: not been initialized. We can verify this using gdb::
$ gdb vmlinux $ gdb vmlinux
... ...
...@@ -550,82 +529,83 @@ not been initialized. We can verify this using gdb: ...@@ -550,82 +529,83 @@ not been initialized. We can verify this using gdb:
$1 = (union {...} *) 0x10 $1 = (union {...} *) 0x10
Actually, it seems that the union member is located at offset 0x10 -- which Actually, it seems that the union member is located at offset 0x10 -- which
means that gcc has inserted 4 bytes of padding between the members si_code means that gcc has inserted 4 bytes of padding between the members ``si_code``
and _sifields. We can now get a fuller picture of the memory dump: and ``_sifields``. We can now get a fuller picture of the memory dump::
_----------------------------=> si_code _----------------------------=> si_code
/ _--------------------=> (padding) / _--------------------=> (padding)
| / _------------=> _sifields(._kill._pid) | / _------------=> _sifields(._kill._pid)
| | / _----=> _sifields(._kill._uid) | | / _----=> _sifields(._kill._uid)
| | | / | | | /
-------|-------|-------|-------| -------|-------|-------|-------|
80000000000000000000000000000000000000000088ffff0000000000000000 80000000000000000000000000000000000000000088ffff0000000000000000
i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u
This allows us to realize another important fact: si_code contains the value This allows us to realize another important fact: ``si_code`` contains the
0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are value 0x80. Remember that x86 is little endian, so the first 4 bytes
really the number 0x00000080. With a bit of research, we find that this is "80000000" are really the number 0x00000080. With a bit of research, we
actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: find that this is actually the constant ``SI_KERNEL`` defined in
``include/asm-generic/siginfo.h``::
144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */
144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */
This macro is used in exactly one place in the x86 kernel: In send_signal()
in kernel/signal.c: This macro is used in exactly one place in the x86 kernel: In ``send_signal()``
in ``kernel/signal.c``::
816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
817 int group) 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t,
818 { 817 int group)
... 818 {
828 pending = group ? &t->signal->shared_pending : &t->pending; ...
... 828 pending = group ? &t->signal->shared_pending : &t->pending;
851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && ...
852 (is_si_special(info) || 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN &&
853 info->si_code >= 0))); 852 (is_si_special(info) ||
854 if (q) { 853 info->si_code >= 0)));
855 list_add_tail(&q->list, &pending->list); 854 if (q) {
856 switch ((unsigned long) info) { 855 list_add_tail(&q->list, &pending->list);
... 856 switch ((unsigned long) info) {
865 case (unsigned long) SEND_SIG_PRIV: ...
866 q->info.si_signo = sig; 865 case (unsigned long) SEND_SIG_PRIV:
867 q->info.si_errno = 0; 866 q->info.si_signo = sig;
868 q->info.si_code = SI_KERNEL; 867 q->info.si_errno = 0;
869 q->info.si_pid = 0; 868 q->info.si_code = SI_KERNEL;
870 q->info.si_uid = 0; 869 q->info.si_pid = 0;
871 break; 870 q->info.si_uid = 0;
... 871 break;
890 } ...
890 }
Not only does this match with the .si_code member, it also matches the place
Not only does this match with the ``.si_code`` member, it also matches the place
we found earlier when looking for where siginfo_t objects are enqueued on the we found earlier when looking for where siginfo_t objects are enqueued on the
"shared_pending" list. ``shared_pending`` list.
So to sum up: It seems that it is the padding introduced by the compiler So to sum up: It seems that it is the padding introduced by the compiler
between two struct fields that is uninitialized, and this gets reported when between two struct fields that is uninitialized, and this gets reported when
we do a memcpy() on the struct. This means that we have identified a false we do a ``memcpy()`` on the struct. This means that we have identified a false
positive warning. positive warning.
Normally, kmemcheck will not report uninitialized accesses in memcpy() calls Normally, kmemcheck will not report uninitialized accesses in ``memcpy()`` calls
when both the source and destination addresses are tracked. (Instead, we copy when both the source and destination addresses are tracked. (Instead, we copy
the shadow bytemap as well). In this case, the destination address clearly the shadow bytemap as well). In this case, the destination address clearly
was not tracked. We can dig a little deeper into the stack trace from above: was not tracked. We can dig a little deeper into the stack trace from above::
arch/x86/kernel/signal.c:805 arch/x86/kernel/signal.c:805
arch/x86/kernel/signal.c:871 arch/x86/kernel/signal.c:871
arch/x86/kernel/entry_64.S:694 arch/x86/kernel/entry_64.S:694
And we clearly see that the destination siginfo object is located on the And we clearly see that the destination siginfo object is located on the
stack: stack::
782 static void do_signal(struct pt_regs *regs) 782 static void do_signal(struct pt_regs *regs)
783 { 783 {
784 struct k_sigaction ka; 784 struct k_sigaction ka;
785 siginfo_t info; 785 siginfo_t info;
... ...
804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL);
... ...
854 } 854 }
And this &info is what eventually gets passed to copy_siginfo() as the And this ``&info`` is what eventually gets passed to ``copy_siginfo()`` as the
destination argument. destination argument.
Now, even though we didn't find an actual error here, the example is still a Now, even though we didn't find an actual error here, the example is still a
...@@ -633,31 +613,30 @@ good one, because it shows how one would go about to find out what the report ...@@ -633,31 +613,30 @@ good one, because it shows how one would go about to find out what the report
was all about. was all about.
3.4. Annotating false positives Annotating false positives
=============================== ~~~~~~~~~~~~~~~~~~~~~~~~~~
There are a few different ways to make annotations in the source code that There are a few different ways to make annotations in the source code that
will keep kmemcheck from checking and reporting certain allocations. Here will keep kmemcheck from checking and reporting certain allocations. Here
they are: they are:
o __GFP_NOTRACK_FALSE_POSITIVE - ``__GFP_NOTRACK_FALSE_POSITIVE``
This flag can be passed to ``kmalloc()`` or ``kmem_cache_alloc()``
(therefore also to other functions that end up calling one of
these) to indicate that the allocation should not be tracked
because it would lead to a false positive report. This is a "big
hammer" way of silencing kmemcheck; after all, even if the false
positive pertains to particular field in a struct, for example, we
will now lose the ability to find (real) errors in other parts of
the same struct.
This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore Example::
also to other functions that end up calling one of these) to indicate
that the allocation should not be tracked because it would lead to
a false positive report. This is a "big hammer" way of silencing
kmemcheck; after all, even if the false positive pertains to
particular field in a struct, for example, we will now lose the
ability to find (real) errors in other parts of the same struct.
Example:
/* No warnings will ever trigger on accessing any part of x */ /* No warnings will ever trigger on accessing any part of x */
x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE);
o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and - ``kmemcheck_bitfield_begin(name)``/``kmemcheck_bitfield_end(name)`` and
kmemcheck_annotate_bitfield(ptr, name) ``kmemcheck_annotate_bitfield(ptr, name)``
The first two of these three macros can be used inside struct The first two of these three macros can be used inside struct
definitions to signal, respectively, the beginning and end of a definitions to signal, respectively, the beginning and end of a
bitfield. Additionally, this will assign the bitfield a name, which bitfield. Additionally, this will assign the bitfield a name, which
...@@ -667,7 +646,7 @@ they are: ...@@ -667,7 +646,7 @@ they are:
kmemcheck_annotate_bitfield() at the point of allocation, to indicate kmemcheck_annotate_bitfield() at the point of allocation, to indicate
which parts of the allocation is part of a bitfield. which parts of the allocation is part of a bitfield.
Example: Example::
struct foo { struct foo {
int x; int x;
...@@ -685,13 +664,13 @@ they are: ...@@ -685,13 +664,13 @@ they are:
/* No warnings will trigger on accessing the bitfield of x */ /* No warnings will trigger on accessing the bitfield of x */
kmemcheck_annotate_bitfield(x, flags); kmemcheck_annotate_bitfield(x, flags);
Note that kmemcheck_annotate_bitfield() can be used even before the Note that ``kmemcheck_annotate_bitfield()`` can be used even before the
return value of kmalloc() is checked -- in other words, passing NULL return value of ``kmalloc()`` is checked -- in other words, passing NULL
as the first argument is legal (and will do nothing). as the first argument is legal (and will do nothing).
4. Reporting errors Reporting errors
=================== ----------------
As we have seen, kmemcheck will produce false positive reports. Therefore, it As we have seen, kmemcheck will produce false positive reports. Therefore, it
is not very wise to blindly post kmemcheck warnings to mailing lists and is not very wise to blindly post kmemcheck warnings to mailing lists and
...@@ -710,8 +689,8 @@ available) are of course a great help too. ...@@ -710,8 +689,8 @@ available) are of course a great help too.
Happy hacking! Happy hacking!
5. Technical description Technical description
======================== ---------------------
kmemcheck works by marking memory pages non-present. This means that whenever kmemcheck works by marking memory pages non-present. This means that whenever
somebody attempts to access the page, a page fault is generated. The page somebody attempts to access the page, a page fault is generated. The page
......
...@@ -21,3 +21,4 @@ whole; patches welcome! ...@@ -21,3 +21,4 @@ whole; patches welcome!
kasan kasan
ubsan ubsan
kmemleak kmemleak
kmemcheck
...@@ -6803,7 +6803,7 @@ KMEMCHECK ...@@ -6803,7 +6803,7 @@ KMEMCHECK
M: Vegard Nossum <vegardno@ifi.uio.no> M: Vegard Nossum <vegardno@ifi.uio.no>
M: Pekka Enberg <penberg@kernel.org> M: Pekka Enberg <penberg@kernel.org>
S: Maintained S: Maintained
F: Documentation/kmemcheck.txt F: Documentation/dev-tools/kmemcheck.rst
F: arch/x86/include/asm/kmemcheck.h F: arch/x86/include/asm/kmemcheck.h
F: arch/x86/mm/kmemcheck/ F: arch/x86/mm/kmemcheck/
F: include/linux/kmemcheck.h F: include/linux/kmemcheck.h
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment