Sat, 02 Sep 2017

Allan twisted my arm in BSDNow episode 209 so I'll have to talk about

First things first, so I have to point out that I think Allan misremembered things. The heroic debugging story is PR 219251, which I'll try to write about later. is an issue that affected some FreeBSD 11.x systems, where FreeBSD would panic at startup. There were no reports for CURRENT. That looked like this:

Crash information

There's very little to go on here, but we do now the cause of the panic ("integer divide fault"), and that the current process was "pf purge". The pf purge thread is part of the pf housekeeping infrastructure. It's a housekeeping kernel thread which cleans up things like old states and expired fragments.

Here's the core loop:
pf_purge_thread(void *unused __unused)
        u_int idx = 0;

        while (pf_end_threads == 0) {
                sx_sleep(pf_purge_thread, &pf_end_lock, 0, "pftm", hz / 10);

                VNET_FOREACH(vnet_iter) {

                         *  Process 1/interval fraction of the state
                         * table every run.
                        idx = pf_purge_expired_states(idx, pf_hashmask /
                            (V_pf_default_rule.timeout[PFTM_INTERVAL] * 10));

                         * Purge other expired types every
                         * PFTM_INTERVAL seconds.
                        if (idx == 0) {
                                 * Order is important:
                                 * - states and src nodes reference rules
                                 * - states and rules reference kifs


The lack of mention of pf functions in the backtrace is a hint unto itself. It suggests that the error is probably directly in pf_purge_thread(). It might also be in one of the static functions it calls, because compilers often just inline those so they don't generate stack frames.

Remember that the problem is an "integer divide fault". How can integer divisions be a problem? Well, you can try to divide by zero. The most obvious suspect for this is this code:

	idx = pf_purge_expired_states(idx, pf_hashmask /
		(V_pf_default_rule.timeout[PFTM_INTERVAL] * 10));

However, this variable is both correctly initialised (in pfattach_vnet()) and can only be modified through the DIOCSETTIMEOUT ioctl() call and that one checks for zero.

At that point I had no idea how this could happen, but because the problem did not affect CURRENT I looked at the commit history and found this commit from Luiz Otavio O Souza:

	Do not run the pf purge thread while the VNET variables are not
	initialized, this can cause a divide by zero (if the VNET initialization
	takes to long to complete).

	Obtained from:	pfSense
	Sponsored by:	Rubicon Communications, LLC (Netgate)

That sounds very familiar, and indeed, applying the patch fixed the problem. Luiz explained it well: it's possible to use V_pf_default_rule.timeout before it's initialised, which caused this panic.

To me, this reaffirms the importance of writing good commit messages: because Luiz mentioned both the pf purge thread and the division by zero I was easily able to find the relevant commit. If I hadn't found it this fix would have taken a lot longer.

posted at: 19:00 | path: /freebsd | [ 0 comments ]





Just type 'no', without quotes. First an 'n', then an 'o'. You can do it.
Are you a spammer?