perf lock contention: Do not try to update if hash map is full

It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

perf lock contention: Do not try to update if hash map is full
It doesn't delete data in the task_data and lock_stat maps. The data is kept there until it's consumed by userspace at the end. But it calls bpf_map_update_elem() again and again, and the data will be discarded if the map is full. This is not good. Worse, in the bpf_map_update_elem(), it keeps trying to get a new node even if the map was full. I guess it makes sense if it deletes some node like in the tstamp map (that's why I didn't make the change there). In a pre-allocated hash map, that means it'd iterate all CPU to check the freelist. And it has a bad performance impact on large machines. I've checked it on my 64 CPU machine with this. $ perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 2.825 [sec] And I used the task mode, so that it can guarantee the map is full. The default map entry size is 16K and this workload has 40K tasks. Before: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 11.299 [sec] contended total wait max wait avg wait pid comm 19284 3.51 s 3.70 ms 181.91 us 1305863 sched-messaging 243 84.09 ms 466.67 us 346.04 us 1336608 sched-messaging 177 66.35 ms 12.08 ms 374.88 us 1220416 node For some reason, it didn't report the data failures. But you can see the total time in the workload is increased a lot (2.8 -> 11.3). If it fails early when the map is full, it goes back to normal. After: $ sudo ./perf lock con -abt -E3 -- perf bench sched messaging -g 1000 # Running 'sched/messaging' benchmark: # 20 sender and receiver processes per group # 1000 groups == 40000 processes run Total time: 3.044 [sec] contended total wait max wait avg wait pid comm 18743 591.92 ms 442.96 us 31.58 us 1431454 sched-messaging 51 210.64 ms 207.45 ms 4.13 ms 1468724 sched-messaging 81 68.61 ms 65.79 ms 847.07 us 1463183 sched-messaging === output for debug === bad: 1164137, total: 2253341 bad rate: 51.66 % histogram of failure reasons task: 0 stack: 0 time: 0 data: 1164137 Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Hao Luo <haoluo@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <song@kernel.org> Cc: bpf@vger.kernel.org Link: https://lore.kernel.org/r/20230406210611.1622492-2-namhyung@kernel.orgSigned-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
222de5e5 · Namhyung Kim · Arnaldo Carvalho de Melo · 0fba2265 · 222de5e5
Commit 222de5e5 authored Apr 06, 2023 by Namhyung Kim Committed by Arnaldo Carvalho de Melo Apr 06, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 19 additions and 3 deletions

tools/perf/util/bpf_skel/lock_contention.bpf.c tools/perf/util/bpf_skel/lock_contention.bpf.c +19 -3

No files found.
--- a/tools/perf/util/bpf_skel/lock_contention.bpf.c
+++ b/tools/perf/util/bpf_skel/lock_contention.bpf.c
@@ -4,6 +4,7 @@
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_tracing.h>
 #include <bpf/bpf_core_read.h>
+#include <asm-generic/errno-base.h>

 #include "lock_data.h"

@@ -126,6 +127,9 @@ int stack_fail;
 int time_fail;
 int data_fail;

+int task_map_full;
+int data_map_full;
+
 static inline int can_record(u64 *ctx)
 {
 	if (has_cpu) {
@@ -177,11 +181,12 @@ static inline int update_task_data(struct task_struct *task)
 		return -1;

 	p = bpf_map_lookup_elem(&task_data, &pid);
-	if (p == NULL) {
+	if (p == NULL && !task_map_full) {
 		struct contention_task_data data = {};

 		BPF_CORE_READ_STR_INTO(&data.comm, task, comm);
-		bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST);
+		if (bpf_map_update_elem(&task_data, &pid, &data, BPF_NOEXIST) == -E2BIG)
+			task_map_full = 1;
 	}

 	return 0;
@@ -370,6 +375,12 @@ int contention_end(u64 *ctx)

 	data = bpf_map_lookup_elem(&lock_stat, &key);
 	if (!data) {
+		if (data_map_full) {
+			bpf_map_delete_elem(&tstamp, &pid);
+			__sync_fetch_and_add(&data_fail, 1);
+			return 0;
+		}
+
 		struct contention_data first = {
 			.total_time = duration,
 			.max_time = duration,
@@ -377,12 +388,17 @@ int contention_end(u64 *ctx)
 			.count = 1,
 			.flags = pelem->flags,
 		};
+		int err;

 		if (aggr_mode == LOCK_AGGR_ADDR)
 			first.flags |= check_lock_type(pelem->lock, pelem->flags);

-		if (bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST) < 0)
+		err = bpf_map_update_elem(&lock_stat, &key, &first, BPF_NOEXIST);
+		if (err < 0) {
+			if (err == -E2BIG)
+				data_map_full = 1;
 			__sync_fetch_and_add(&data_fail, 1);
+		}
 		bpf_map_delete_elem(&tstamp, &pid);
 		return 0;
 	}