Commit dd140c87 authored by Ingo Molnar's avatar Ingo Molnar Committed by Jeff Garzik

[PATCH] smptimers, old BH removal, tq-cleanup

This is the smptimers patch plus the removal of old BHs and a rewrite of
task-queue handling.

Basically with the removal of TIMER_BH i think the time is right to get
rid of old BHs forever, and to do a massive cleanup of all related
fields.  The following five basic 'execution context' abstractions are
supported by the kernel:

  - hardirq
  - softirq
  - tasklet
  - keventd-driven task-queues
  - process contexts

I've done the following cleanups/simplifications to task-queues:

 - removed the ability to define your own task-queue, what can be done is
   to schedule_task() a given task to keventd, and to flush all pending
   tasks.

This is actually a quite easy transition, since 90% of all task-queue
users in the kernel used BH_IMMEDIATE - which is very similar in
functionality to keventd.

I believe task-queues should not be removed from the kernel altogether.
It's true that they were written as a candidate replacement for BHs
originally, but they do make sense in a different way: it's perhaps the
easiest interface to do deferred processing from IRQ context, in
performance-uncritical code areas.  They are easier to use than
tasklets.

code that cares about performance should convert to tasklets - as the
timer code and the serial subsystem has done already. For extreme
performance softirqs should be used - the net subsystem does this.

and we can do this for 2.6 - there are only a couple of areas left after
fixing all the BH_IMMEDIATE places.

i have moved all the taskqueue handling code into kernel/context.c, and
only kept the basic 'queue a task' definitions in include/linux/tqueue.h.
I've converted three of the most commonly used BH_IMMEDIATE users:
tty_io.c, floppy.c and random.c. [random.c might need more thought
though.]

i've also cleaned up kernel/timer.c over that of the stock smptimers
patch: privatized the timer-vec definitions (nothing needs it,
init_timer() used it mistakenly) and cleaned up the code. Plus i've moved
some code around that does not belong into timer.c, and within timer.c
i've organized data and functions along functionality and further
separated the base timer code from the NTP bits.

net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?

scalable timers: i've further improved the patch ported to 2.5 by wli and
Dipankar. There is only one pending issue i can see, the question of
whether to migrate timers in mod_timer() or not. I'm quite convinced that
they should be migrated, but i might be wrong. It's a 10 lines change to
switch between migrating and non-migrating timers, we can do performance
tests later on. The current, more complex migration code is pretty fast
and has been stable under extremely high networking loads in the past 2
years, so we can immediately switch to the simpler variant if someone
proves it improves performance. (I'd say if non-migrating timers improve
Apache performance on one of the bigger NUMA boxes then the point is
proven, no further though will be needed.)
parent 5a5ec729
......@@ -99,18 +99,14 @@ int __verify_write(const void * addr, unsigned long size)
goto bad_area;
}
extern spinlock_t timerlist_lock;
/*
* Unlock any spinlocks which will prevent us from getting the
* message out (timerlist_lock is acquired through the
* console unblank code)
* message out
*/
void bust_spinlocks(int yes)
{
int loglevel_save = console_loglevel;
spin_lock_init(&timerlist_lock);
if (yes) {
oops_in_progress = 1;
return;
......
......@@ -1009,8 +1009,7 @@ static struct tq_struct floppy_tq;
static void schedule_bh( void (*handler)(void*) )
{
floppy_tq.routine = (void *)(void *) handler;
queue_task(&floppy_tq, &tq_immediate);
mark_bh(IMMEDIATE_BH);
schedule_task(&floppy_tq);
}
static struct timer_list fd_timer;
......@@ -4361,7 +4360,7 @@ int __init floppy_init(void)
if (have_no_fdc)
{
DPRINT("no floppy controllers found\n");
run_task_queue(&tq_immediate);
flush_scheduled_tasks();
if (usage_count)
floppy_release_irq_and_dma();
blk_cleanup_queue(BLK_DEFAULT_QUEUE(MAJOR_NR));
......
......@@ -649,7 +649,7 @@ static int __init batch_entropy_init(int size, struct entropy_store *r)
* Changes to the entropy data is put into a queue rather than being added to
* the entropy counts directly. This is presumably to avoid doing heavy
* hashing calculations during an interrupt in add_timer_randomness().
* Instead, the entropy is only added to the pool once per timer tick.
* Instead, the entropy is only added to the pool by keventd.
*/
void batch_entropy_store(u32 a, u32 b, int num)
{
......@@ -664,7 +664,8 @@ void batch_entropy_store(u32 a, u32 b, int num)
new = (batch_head+1) & (batch_max-1);
if (new != batch_tail) {
queue_task(&batch_tqueue, &tq_timer);
// FIXME: is this correct?
schedule_task(&batch_tqueue);
batch_head = new;
} else {
DEBUG_ENT("batch entropy buffer full\n");
......
......@@ -1265,7 +1265,6 @@ static void release_dev(struct file * filp)
/*
* Make sure that the tty's task queue isn't activated.
*/
run_task_queue(&tq_timer);
flush_scheduled_tasks();
/*
......@@ -1876,7 +1875,6 @@ static void __do_SAK(void *arg)
/*
* The tq handling here is a little racy - tty->SAK_tq may already be queued.
* But there's no mechanism to fix that without futzing with tqueue_lock.
* Fortunately we don't need to worry, because if ->SAK_tq is already queued,
* the values which we write to it will be identical to the values which it
* already has. --akpm
......@@ -1902,7 +1900,7 @@ static void flush_to_ldisc(void *private_)
unsigned long flags;
if (test_bit(TTY_DONT_FLIP, &tty->flags)) {
queue_task(&tty->flip.tqueue, &tq_timer);
schedule_task(&tty->flip.tqueue);
return;
}
if (tty->flip.buf_num) {
......@@ -1979,7 +1977,7 @@ void tty_flip_buffer_push(struct tty_struct *tty)
if (tty->low_latency)
flush_to_ldisc((void *) tty);
else
queue_task(&tty->flip.tqueue, &tq_timer);
schedule_task(&tty->flip.tqueue);
}
/*
......
......@@ -1210,9 +1210,6 @@ static void speedo_timer(unsigned long data)
/* We must continue to monitor the media. */
sp->timer.expires = RUN_AT(2*HZ); /* 2.0 sec. */
add_timer(&sp->timer);
#if defined(timer_exit)
timer_exit(&sp->timer);
#endif
}
static void speedo_show_state(struct net_device *dev)
......
......@@ -25,6 +25,9 @@ static LIST_HEAD(free_list);
/* public *and* exported. Not pretty! */
spinlock_t files_lock = SPIN_LOCK_UNLOCKED;
/* file version */
unsigned long event;
/* Find an unused file structure and return a pointer to it.
* Returns NULL, if there are no more free file structures or
* we run out of memory.
......
......@@ -22,25 +22,6 @@ struct irqaction {
struct irqaction *next;
};
/* Who gets which entry in bh_base. Things which will occur most often
should come first */
enum {
TIMER_BH = 0,
TQUEUE_BH = 1,
DIGI_BH = 2,
SERIAL_BH = 3,
RISCOM8_BH = 4,
SPECIALIX_BH = 5,
AURORA_BH = 6,
ESP_BH = 7,
IMMEDIATE_BH = 9,
CYCLADES_BH = 10,
MACSERIAL_BH = 13,
ISICOM_BH = 14
};
#include <asm/hardirq.h>
#include <asm/softirq.h>
......@@ -218,23 +199,6 @@ static void name (unsigned long dummy) \
#endif /* CONFIG_SMP */
/* Old BH definitions */
extern struct tasklet_struct bh_task_vec[];
/* It is exported _ONLY_ for wait_on_irq(). */
extern spinlock_t global_bh_lock;
static inline void mark_bh(int nr)
{
tasklet_hi_schedule(bh_task_vec+nr);
}
extern void init_bh(int nr, void (*routine)(void));
extern void remove_bh(int nr);
/*
* Autoprobing for irqs:
*
......
......@@ -172,7 +172,6 @@ extern unsigned long cache_decay_ticks;
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);
extern void flush_scheduled_tasks(void);
extern int start_context_thread(void);
extern int current_is_keventd(void);
......
......@@ -2,11 +2,15 @@
#define _LINUX_TIMER_H
#include <linux/config.h>
#include <linux/smp.h>
#include <linux/stddef.h>
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/cache.h>
struct tvec_t_base_s;
/*
* In Linux 2.4, static timers have been removed from the kernel.
* Timers may be dynamically created and destroyed, and should be initialized
* by a call to init_timer() upon creation.
*
......@@ -14,22 +18,31 @@
* timeouts. You can use this field to distinguish between the different
* invocations.
*/
struct timer_list {
typedef struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
};
extern void add_timer(struct timer_list * timer);
extern int del_timer(struct timer_list * timer);
struct tvec_t_base_s *base;
} timer_t;
extern void add_timer(timer_t * timer);
extern int del_timer(timer_t * timer);
#ifdef CONFIG_SMP
extern int del_timer_sync(struct timer_list * timer);
extern int del_timer_sync(timer_t * timer);
extern void sync_timers(void);
#define timer_enter(base, t) do { base->running_timer = t; mb(); } while (0)
#define timer_exit(base) do { base->running_timer = NULL; } while (0)
#define timer_is_running(base,t) (base->running_timer == t)
#define timer_synchronize(base,t) while (timer_is_running(base,t)) barrier()
#else
#define del_timer_sync(t) del_timer(t)
#define sync_timers() do { } while (0)
#define timer_enter(base,t) do { } while (0)
#define timer_exit(base) do { } while (0)
#endif
/*
* mod_timer is a more efficient way to update the expire field of an
* active timer (if the timer is inactive it will be activated)
......@@ -37,16 +50,20 @@ extern int del_timer_sync(struct timer_list * timer);
* If the timer is known to be not pending (ie, in the handler), mod_timer
* is less efficient than a->expires = b; add_timer(a).
*/
int mod_timer(struct timer_list *timer, unsigned long expires);
int mod_timer(timer_t *timer, unsigned long expires);
extern void it_real_fn(unsigned long);
static inline void init_timer(struct timer_list * timer)
extern void init_timers(void);
extern void run_local_timers(void);
static inline void init_timer(timer_t * timer)
{
timer->list.next = timer->list.prev = NULL;
timer->base = NULL;
}
static inline int timer_pending (const struct timer_list * timer)
static inline int timer_pending(const timer_t * timer)
{
return timer->list.next != NULL;
}
......
/*
* tqueue.h --- task queue handling for Linux.
*
* Mostly based on a proposed bottom-half replacement code written by
* Kai Petzke, wpp@marie.physik.tu-berlin.de.
* Modified version of previous incarnations of task-queues,
* written by:
*
* (C) 1994 Kai Petzke, wpp@marie.physik.tu-berlin.de
* Modified for use in the Linux kernel by Theodore Ts'o,
* tytso@mit.edu. Any bugs are my fault, not Kai's.
*
* The original comment follows below.
* tytso@mit.edu.
*/
#ifndef _LINUX_TQUEUE_H
......@@ -18,25 +17,8 @@
#include <linux/bitops.h>
#include <asm/system.h>
/*
* New proposed "bottom half" handlers:
* (C) 1994 Kai Petzke, wpp@marie.physik.tu-berlin.de
*
* Advantages:
* - Bottom halfs are implemented as a linked list. You can have as many
* of them, as you want.
* - No more scanning of a bit field is required upon call of a bottom half.
* - Support for chained bottom half lists. The run_task_queue() function can be
* used as a bottom half handler. This is for example useful for bottom
* halfs, which want to be delayed until the next clock tick.
*
* Notes:
* - Bottom halfs are called in the reverse order that they were linked into
* the list.
*/
struct tq_struct {
struct list_head list; /* linked list of active bh's */
struct list_head list; /* linked list of active tq's */
unsigned long sync; /* must be initialized to zero */
void (*routine)(void *); /* function to call */
void *data; /* argument to function */
......@@ -61,68 +43,13 @@ struct tq_struct {
PREPARE_TQUEUE((_tq), (_routine), (_data)); \
} while (0)
typedef struct list_head task_queue;
#define DECLARE_TASK_QUEUE(q) LIST_HEAD(q)
#define TQ_ACTIVE(q) (!list_empty(&q))
extern task_queue tq_timer, tq_immediate;
/*
* To implement your own list of active bottom halfs, use the following
* two definitions:
*
* DECLARE_TASK_QUEUE(my_tqueue);
* struct tq_struct my_task = {
* routine: (void (*)(void *)) my_routine,
* data: &my_data
* };
*
* To activate a bottom half on a list, use:
*
* queue_task(&my_task, &my_tqueue);
*
* To later run the queued tasks use
*
* run_task_queue(&my_tqueue);
*
* This allows you to do deferred processing. For example, you could
* have a task queue called tq_timer, which is executed within the timer
* interrupt.
*/
extern spinlock_t tqueue_lock;
/*
* Queue a task on a tq. Return non-zero if it was successfully
* added.
*/
static inline int queue_task(struct tq_struct *bh_pointer, task_queue *bh_list)
{
int ret = 0;
if (!test_and_set_bit(0,&bh_pointer->sync)) {
unsigned long flags;
spin_lock_irqsave(&tqueue_lock, flags);
list_add_tail(&bh_pointer->list, bh_list);
spin_unlock_irqrestore(&tqueue_lock, flags);
ret = 1;
}
return ret;
}
/* Schedule a tq to run in process context */
extern int schedule_task(struct tq_struct *task);
/*
* Call all "bottom halfs" on a given list.
*/
extern void __run_task_queue(task_queue *list);
/* finish all currently pending tasks - do not call from irq context */
extern void flush_scheduled_tasks(void);
static inline void run_task_queue(task_queue *list)
{
if (TQ_ACTIVE(*list))
__run_task_queue(list);
}
#endif
#endif /* _LINUX_TQUEUE_H */
......@@ -19,7 +19,7 @@ _INLINE_ void tty_insert_flip_char(struct tty_struct *tty,
_INLINE_ void tty_schedule_flip(struct tty_struct *tty)
{
queue_task(&tty->flip.tqueue, &tq_timer);
schedule_task(&tty->flip.tqueue);
}
#undef _INLINE_
......
......@@ -28,6 +28,60 @@ static DECLARE_WAIT_QUEUE_HEAD(context_task_done);
static int keventd_running;
static struct task_struct *keventd_task;
static spinlock_t tqueue_lock __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
typedef struct list_head task_queue;
/*
* Queue a task on a tq. Return non-zero if it was successfully
* added.
*/
static inline int queue_task(struct tq_struct *tq, task_queue *list)
{
int ret = 0;
unsigned long flags;
if (!test_and_set_bit(0, &tq->sync)) {
spin_lock_irqsave(&tqueue_lock, flags);
list_add_tail(&tq->list, list);
spin_unlock_irqrestore(&tqueue_lock, flags);
ret = 1;
}
return ret;
}
#define TQ_ACTIVE(q) (!list_empty(&q))
static inline void run_task_queue(task_queue *list)
{
struct list_head head, *next;
unsigned long flags;
if (!TQ_ACTIVE(*list))
return;
spin_lock_irqsave(&tqueue_lock, flags);
list_add(&head, list);
list_del_init(list);
spin_unlock_irqrestore(&tqueue_lock, flags);
next = head.next;
while (next != &head) {
void (*f) (void *);
struct tq_struct *p;
void *data;
p = list_entry(next, struct tq_struct, list);
next = next->next;
f = p->routine;
data = p->data;
wmb();
p->sync = 0;
if (f)
f(data);
}
}
static int need_keventd(const char *who)
{
if (keventd_running == 0)
......
......@@ -420,12 +420,9 @@ EXPORT_SYMBOL(probe_irq_off);
EXPORT_SYMBOL(del_timer_sync);
#endif
EXPORT_SYMBOL(mod_timer);
EXPORT_SYMBOL(tq_timer);
EXPORT_SYMBOL(tq_immediate);
EXPORT_SYMBOL(tvec_bases);
#ifdef CONFIG_SMP
/* Various random spinlocks we want to export */
EXPORT_SYMBOL(tqueue_lock);
/* Big-Reader lock implementation */
EXPORT_SYMBOL(__brlock_array);
......
......@@ -29,6 +29,7 @@
#include <linux/security.h>
#include <linux/notifier.h>
#include <linux/delay.h>
#include <linux/timer.h>
/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
......@@ -860,6 +861,7 @@ void scheduler_tick(int user_ticks, int sys_ticks)
runqueue_t *rq = this_rq();
task_t *p = current;
run_local_timers();
if (p == rq->idle) {
/* note: this timer irq context must be accounted for as well */
if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
......@@ -2101,10 +2103,7 @@ __init int migration_init(void)
spinlock_t kernel_flag __cacheline_aligned_in_smp = SPIN_LOCK_UNLOCKED;
#endif
extern void init_timervecs(void);
extern void timer_bh(void);
extern void tqueue_bh(void);
extern void immediate_bh(void);
extern void init_timers(void);
void __init sched_init(void)
{
......@@ -2140,10 +2139,7 @@ void __init sched_init(void)
set_task_cpu(current, smp_processor_id());
wake_up_process(current);
init_timervecs();
init_bh(TIMER_BH, timer_bh);
init_bh(TQUEUE_BH, tqueue_bh);
init_bh(IMMEDIATE_BH, immediate_bh);
init_timers();
/*
* The boot idle thread does lazy MMU switching as well:
......
......@@ -3,21 +3,15 @@
*
* Copyright (C) 1992 Linus Torvalds
*
* Fixed a disable_bh()/enable_bh() race (was causing a console lockup)
* due bh_mask_count not atomic handling. Copyright (C) 1998 Andrea Arcangeli
*
* Rewritten. Old one was good in 2.2, but in 2.3 it was immoral. --ANK (990903)
*/
#include <linux/config.h>
#include <linux/mm.h>
#include <linux/kernel_stat.h>
#include <linux/interrupt.h>
#include <linux/smp_lock.h>
#include <linux/init.h>
#include <linux/tqueue.h>
#include <linux/percpu.h>
#include <linux/notifier.h>
#include <linux/percpu.h>
#include <linux/init.h>
#include <linux/mm.h>
/*
- No shared variables, all the data are CPU local.
......@@ -35,7 +29,6 @@
it is logically serialized per device, but this serialization
is invisible to common code.
- Tasklets: serialized wrt itself.
- Bottom halves: globally serialized, grr...
*/
irq_cpustat_t irq_stat[NR_CPUS];
......@@ -115,10 +108,10 @@ inline void cpu_raise_softirq(unsigned int cpu, unsigned int nr)
__cpu_raise_softirq(cpu, nr);
/*
* If we're in an interrupt or bh, we're done
* (this also catches bh-disabled code). We will
* If we're in an interrupt or softirq, we're done
* (this also catches softirq-disabled code). We will
* actually run the softirq once we return from
* the irq or bh.
* the irq or softirq.
*
* Otherwise we wake up ksoftirqd to make sure we
* schedule the softirq soon.
......@@ -267,91 +260,12 @@ void tasklet_kill(struct tasklet_struct *t)
clear_bit(TASKLET_STATE_SCHED, &t->state);
}
/* Old style BHs */
static void (*bh_base[32])(void);
struct tasklet_struct bh_task_vec[32];
/* BHs are serialized by spinlock global_bh_lock.
It is still possible to make synchronize_bh() as
spin_unlock_wait(&global_bh_lock). This operation is not used
by kernel now, so that this lock is not made private only
due to wait_on_irq().
It can be removed only after auditing all the BHs.
*/
spinlock_t global_bh_lock = SPIN_LOCK_UNLOCKED;
static void bh_action(unsigned long nr)
{
if (!spin_trylock(&global_bh_lock))
goto resched;
if (bh_base[nr])
bh_base[nr]();
hardirq_endlock();
spin_unlock(&global_bh_lock);
return;
spin_unlock(&global_bh_lock);
resched:
mark_bh(nr);
}
void init_bh(int nr, void (*routine)(void))
{
bh_base[nr] = routine;
mb();
}
void remove_bh(int nr)
{
tasklet_kill(bh_task_vec+nr);
bh_base[nr] = NULL;
}
void __init softirq_init()
{
int i;
for (i=0; i<32; i++)
tasklet_init(bh_task_vec+i, bh_action, i);
open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
}
void __run_task_queue(task_queue *list)
{
struct list_head head, *next;
unsigned long flags;
spin_lock_irqsave(&tqueue_lock, flags);
list_add(&head, list);
list_del_init(list);
spin_unlock_irqrestore(&tqueue_lock, flags);
next = head.next;
while (next != &head) {
void (*f) (void *);
struct tq_struct *p;
void *data;
p = list_entry(next, struct tq_struct, list);
next = next->next;
f = p->routine;
data = p->data;
wmb();
p->sync = 0;
if (f)
f(data);
}
}
static int ksoftirqd(void * __bind_cpu)
{
int cpu = (int) (long) __bind_cpu;
......
This diff is collapsed.
......@@ -14,11 +14,9 @@
#include <linux/wait.h>
#include <linux/vt_kern.h>
extern spinlock_t timerlist_lock;
void bust_spinlocks(int yes)
{
spin_lock_init(&timerlist_lock);
if (yes) {
oops_in_progress = 1;
} else {
......
......@@ -1296,7 +1296,6 @@ int netif_rx(struct sk_buff *skb)
static int deliver_to_old_ones(struct packet_type *pt,
struct sk_buff *skb, int last)
{
static spinlock_t net_bh_lock = SPIN_LOCK_UNLOCKED;
int ret = NET_RX_DROP;
if (!last) {
......@@ -1307,20 +1306,13 @@ static int deliver_to_old_ones(struct packet_type *pt,
if (skb_is_nonlinear(skb) && skb_linearize(skb, GFP_ATOMIC))
goto out_kfree;
/* The assumption (correct one) is that old protocols
did not depened on BHs different of NET_BH and TIMER_BH.
#if CONFIG_SMP
/* Old protocols did not depened on BHs different of NET_BH and
TIMER_BH - they need to be fixed for the new assumptions.
*/
/* Emulate NET_BH with special spinlock */
spin_lock(&net_bh_lock);
/* Disable timers and wait for all timers completion */
tasklet_disable(bh_task_vec+TIMER_BH);
print_symbol("fix old protocol handler %s!\n", (unsigned long)pt->func);
#endif
ret = pt->func(skb, skb->dev, pt);
tasklet_hi_enable(bh_task_vec+TIMER_BH);
spin_unlock(&net_bh_lock);
out:
return ret;
out_kfree:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment