Opened 19 years ago
Closed 18 years ago
#1720 closed defect (wontfix)
On 2.6.15.4/UDEV085, swapon sometimes fails
Reported by: | Owned by: | ||
---|---|---|---|
Priority: | highest | Milestone: | 6.2 |
Component: | Bootscripts | Version: | SVN |
Severity: | blocker | Keywords: | |
Cc: |
Description
I'm up on 2.6.15.4 and UDEV 085. However the swapon usually fails with a device not present. A "sleep 2" before swapon fixes it, which sort of confirms it is a race with the new udev getting to create it. However this is not a particularly elegant solution.
- Marty Jack
Attachments (4)
Change History (31)
comment:1 by , 19 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:2 by , 19 years ago
Let's make sure that the loop indeed works.
1) Compile a simple uevent logger:
cat >bug.c <<"EOF" /* Simple event recorder */ #define _GNU_SOURCE #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <argz.h> int main(int argc, char * argv[]) { char * envz; size_t len; int bug; bug = open("/dev/bug", O_WRONLY | O_APPEND); if (bug == -1) return 0; setenv("_SEPARATOR", "--------------------------------------", 1); argz_create(environ, &envz, &len); argz_stringify(envz, len, '\n'); envz[len-1]='\n'; write(bug, envz, len); close(bug); free(envz); return 0; } EOF gcc -o /lib/udev/bug bug.c
2) Add a logging rule to a separate file:
cat >/etc/udev/rules.d/90-bug.rules <<"EOF" ACTION=="add", RUN+="bug" EOF
3) Modify the udev initscript not as Matthew said in the previous comment, but in the following way (that still incorporates his wishes):
At the end of walk_sysfs(), add:
# until we know how to do better, just wait for _all_ events to finish loop=300 while test -d /dev/.udev/queue; do sleep 0.1 test "$loop" -gt 0 || break loop=$(($loop - 1)) done >/dev/bug test "$loop" -gt 0 evaluate_retval sleep 5 if test -s /dev/bug; then mv /dev/bug /dev/bugreport boot_mesg "Please paste the /dev/bugreport file to" ${WARNING} boot_mesg "http://wiki.linuxfromscratch.org/lfs/ticket/1720" boot_mesg "Otherwise, the next version of LFS may be unbootable on your system!" echo_failure sleep 10 else rm -f /dev/bug fi
by , 19 years ago
comment:4 by , 19 years ago
Component: | Book → Bootscripts |
---|
comment:5 by , 19 years ago
Version: | SVN → udev_update |
---|
comment:6 by , 19 years ago
Severity: | normal → blocker |
---|
Archaic,
could you please retry with all three remaining combinations of the following changes:
- Just after "udevd --daemon", add "mkdir -p /dev/.udev/queue"
- Upgrade to Udev-086
and see if the bugreport is still there.
Thanks
comment:7 by , 19 years ago
Moving the code snippet at the end of the function or after the call, using the mkdir, not using the mkdir, upgrading to 086 and altering the previous things to achieve all combinations still produces a bugreport file. Not every boot, though. For each change it might take 2-5 reboots before the bugreport was created. The rest of the time it was appeared as if all was well. Dunno where to go from here.
comment:8 by , 19 years ago
Does the bugreport file contain many uevents? Please paste the example with both the mkdir and udev-086. Upstream is going to ignore some kinds of uevent leaks, e.g., anything caused by USB storage.
But that doesn't change the fact that upstream doesn't have a fully working mechanism for waiting for all uevents to be processed.
by , 19 years ago
Attachment: | bugreport-mkdir added |
---|
by , 19 years ago
Attachment: | bugreport-086 added |
---|
comment:9 by , 19 years ago
Yes, all of my bug reports are 32-34 KB. Most are not USB devices, but ttys. 2 more bugreports attached.
comment:10 by , 19 years ago
I will be trying again today with 087. Will upload the bugreport (if there is one) tonight (tomorrow for you, Alex).
comment:11 by , 19 years ago
You have not added a bugreport with both the mkdir and udev-086. If that combination produces the bugreport, read further. Otherwise, ignore.
No need to try udev-087. Even if this "fixes" the bug, please revert to 086, because that's accidental.
I think that the only way to handle this bug is to change the waiting part. But that's a big sledgehammer that only reduces the probability of the failure to something like 10-10 instead of completely eliminating it.
Instead of
loop=300 while test -d /dev/.udev/queue; do sleep 0.1 test "$loop" -gt 0 || break loop=$(($loop - 1)) done
try (completely untested)
loop=300 confirm=0 while true ; do sleep 0.1 test -d /dev/.udev/queue && confirm=0 || confirm=$(( $confirm + 1 )) loop=$(( $loop - 1 )) test $loop -gt 0 || break test $confirm -lt 10 || break done
This is supposed to exit the loop not when the /dev/.udev/queue directory disappears for a moment, but when it also doesn't reapear 10 times in succession.
comment:12 by , 19 years ago
The bugreports are indeed what you had asked for. There was very little change in all of the different combinations you asked for. Anyway, with the new loop instructions things became much better. After 7 boots without a usb device plugged in (but with usbcore and hid still loading) there were no failures. Once I plugged in my mouse I got 5 boots without and 4 boots with the error. Everytime there was an error, it appears the problem was with the kernel assigning a location for the device. Here's the output:
uhci_hcd 0000:00:10.2: UHCI Host Controller uhci_hcd 0000:00:10.2: new USB bus registered, assigned bus number 3 uhci_hcd 0000:00:10.2: irq 9, io base 0x00001c40 hub 3-0:1.0: USB hub found hub 3-0:1.0: 2 ports detected usb 1-2: new low speed USB device using uhci_hcd and address 2 ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 11 PCI: setting IRQ 11 as level-triggered ACPI: PCI Interrupt 0000:00:10.3[D] -> Link [LNKD] -> GSI 11 (level, low) -> IRQ 11 PCI: Via IRQ fixup for 0000:00:10.3, from 0 to 11 ehci_hcd 0000:00:10.3: EHCI Host Controller ehci_hcd 0000:00:10.3: new USB bus registered, assigned bus number 4 ehci_hcd 0000:00:10.3: irq 11, io mem 0xd0004800 ehci_hcd 0000:00:10.3: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004 usb 1-2: unable to read config index 0 descriptor/start usb 1-2: can't read configurations, error -71
After this, the kernel reassigns usb 1-2 to another address successfully. Again, it seems this delay is prompting the bugreport because on the boots which assign the device correctly the 1st time, there is no bugreport. Also, the bugreport is now reduced to just usb stuff, but I am attaching it in case you want to see it.
by , 19 years ago
Attachment: | bugreport-new_loop added |
---|
comment:13 by , 19 years ago
Try replacing "10" with "60" as the limit for $confirm (we should probably make this configurable). Yes, this means that the script will wait at least 6 seconds. This is still better than FreeBSD that waits 15 seconds unconditionally for SCSI devices to settle (even if there are none).
comment:14 by , 19 years ago
One more ghost effect: the old loop with
echo -en '*'
just after "sleep 0.1" also reduces the size of the bugreport here.
comment:15 by , 19 years ago
For more than a week now, with udev-087, linux-2.6.15.6, and with the modified code that added the "confirm" tests, but without the latest "echo -en '*'" change, there has not been a bugreport. I have tested with every iteration of the following devices: external HD, USB mouse, USB keyboard, PTP2 camera (not mass storage). The same setup as this produced sporadic bugreports with udev-086.
Alex, please advise what further testing you want me to do. Should I revert udev or add the "echo -en '*'" snippet, or anything like that? And is anyone else testing this?
Matt, you were getting bugreports, too. What about with the newer kernel and udev? Are you still getting them? I will try a new build using the new udev_update bootscripts and report back. I also have a thumbdrive I can add to the mix. If I get a chance, I might even have time to throw a spare SCSI array into my gateway and kick off a build on that to test SCSI devices, but no promises at this point.
comment:16 by , 19 years ago
Disregard my last comment. After looking around at why my new build wasn't having problems, it was because I forgot the rules addition to call the bug binary. The problem still exists, it is only USB leakage that I am seeing. I'm at work and can't do a lot of reboot testing, so I haven't had a chance to get anything but preliminary info.
comment:17 by , 19 years ago
Please edit this line:
test $confirm -lt 60 || break
Replace 60 with some bigger number that makes the bugreport to go away reliably, and report that number here. Disregard the "echo -en '*'" comment.
comment:18 by , 19 years ago
FWIW, I've just seen this on 20060311 udev branch (that is, with udev-087 and bootscripts from 20051223), kernel is 2.6.16-rc6. This is on what I regard as a *fast* machine, I don't think I've had swapon fail before, so this would be 1 failure in about 10 or 15 boots.
I'll try adding the mkdir -p /dev/.udev/queue, but I don't expect to keep this build around for very much longer.
comment:19 by , 19 years ago
In fact, the failure happened on the very next boot, then 7 or 8 boots without error, then the next failure. Looks to be sufficiently random to piss people off.
I wondered about moving swap to after checkfs (on the assumptions that nobody will need swap when they run fsck, and that the time to check a clean journalled '/' will allow the device to appear). Seems iffy, but in my case I've got partitions up to sda15, with /home on sda12 and swap on sda13, and checking /home works even when mounting swap failed.
Or, does this just mean that something in the udev rules is delaying it too much ? The only extra rule I used to have was for my cd/DVD drive, but last night I added a rule for my memory stick (check it is usb, check the serial number). Perhaps this relates to the comment for 088 (ticket 1751), Provide "udevtrigger" program to request events on coldplug. The shell script is much too slow with thousends of devices. - I don't have thousands of devices, only 650+ with all those tty variants, but maybe it's a similar problem.
comment:20 by , 19 years ago
The bug is not about rules, but about the initscript. The whole issue is that devices are created asynchronously in the background (no way to change this), and the bootscripts continue without sufficient waiting. Reordering of the bootscripts only hides but doesn't fix the problem.
Comments from people with LFS-Bootscripts < udev_update-20060321 and the book < 20060322 are useless and will be ignored. The bug (if anything remains) is both in our bootscript and in udevd itself (because upstream doesn't provide any other way to wait for udevd to process all uevents).
comment:21 by , 19 years ago
The modified script is in the book, but it only hides the problem by sleeping for at least 6 seconds instead of fixing its origin. A mere "sleep 6" would "work" just as well instead of the whole loop, as indicated by the original report.
Therefore, given that upstream ignores the bug, it is a good candidate for WONTFIX resolution (this implies complete removal of udev from the book).
comment:22 by , 18 years ago
Same problem using bsd-init scripts. Using udev-0.89. Sleeping is pretty ugly but slackware is also using it so I don't know if there's an alternative.
echo "Starting udev" /sbin/udevd --daemon /sbin/udevtrigger mkdir -p /dev/.udev/queue while test -d /dev/.udev/queue; do
sleep 0.1 echo -n '.'
done
comment:23 by , 18 years ago
Milestone: | → 6.2 |
---|---|
Owner: | changed from | to
Priority: | normal → highest |
Status: | assigned → new |
Version: | udev_update → SVN |
comment:24 by , 18 years ago
The bug is about the fact that this doesn't work reliably:
echo "Starting udev" /sbin/udevd --daemon /sbin/udevtrigger mkdir -p /dev/.udev/queue while test -d /dev/.udev/queue; do sleep 0.1 echo -n '.' done
So Slackware is buggy too.
comment:25 by , 18 years ago
Yeah it may be buggy but I is there another way?
This is from gentoo bootscripts:
# loop until everything is finished # there's gotta be a better way... ebegin "Letting udev process events" loop=0 while test -d /dev/.udev/queue; do sleep 0.1; test "$loop" -gt 300 && break loop=$(($loop + 1)) done
Same kind of loop although they admit there's gotta be a better way!
I think this is related to how udev handles child processes and I doubt if it can be fixed.
comment:26 by , 18 years ago
The official LFS loop is:
loop=300 confirm=0 while true ; do sleep 0.1 test -d /dev/.udev/queue && confirm=0 || confirm=$(( $confirm + 1 )) loop=$(( $loop - 1 )) test $loop -gt 0 || break test $confirm -lt 60 || break done
It differs from Slackware and Gentoo by exiting not when the queue disappears, but when it doesn't reappear after 60 retries. The bug in Gentoo and Slackware is exactly that the queue sometimes disappears for a moment, and their script exits prematurely.
comment:27 by , 18 years ago
Resolution: | → wontfix |
---|---|
Status: | new → closed |
Udev-090 includes the "udevsettle" program that is supposed to do the same as our loop. Upstream has added logic for the case when uevents are sitting in the kernel netlink socket buffer waiting for udevd to grab them and fooling our old loop.
So, any further discussion belongs to Ticket #1769, because this can no longer be classified as a bug in bootscripts.
Thanks for the report, Marty. I've seen this a couple of times too. Alexander mentioned that we need a loop to wait for uevents to be processed before continuing with the boot process at http://www.linuxfromscratch.org/pipermail/lfs-dev/2006-February/055958.html. There's a shell script snippet there that should be placed at the end of the walk_sysfs() function in /etc/rc.d/init.d/udev. Would you be able to test and see if that fixes your problem?