Ticket #1638 (closed defect: wontfix)

Opened 3 years ago

Last modified 2 years ago

LiveCD fails, Hanging at "Starting init..."

Reported by: Pete@drunkpian.org.uk Assigned to: jhuntwork@linuxfromscratch.org
Priority: high Milestone: 6.2
Component: CD Version: x86-6.1.1-3
Keywords: heisenbug Cc:

Description

I have downloaded the latest LiveCD 6.1.1-3 so I can build my first LFS system. Unfortunately I can't boot from the CD as it hangs at "Starting init".

My system:

AMD K6-2 500mhz RAM 256mb Motherboard FIC - VA-503+ AWARD Bios. LITE-ON CD-RW drive

Action taken so far: 1) Checked ISO MD5 - ok!

2) Tried the following cheatcodes:

linux expert linux nolapci

3) Burnt new CD - Same error.

I have run the following linux distros on this machine without problems:

Slackware 10.2 Debian Woody Kanotix Knoppix Vector Linux 4.3

But had a similar problem with Gentoo that I never got to the bottom of.

Change History

02/12/06 08:14:31 changed by jhuntwork@linuxfromscratch.org

  • owner changed from livecd@linuxfromscratch.org to jhuntwork@linuxfromscratch.org.

The adjusted init.c in trunk seems to have fixed the problem. However, a review of the init.c code in general is probably in order. IIRC, there was also trouble booting the CD from the second occupied CD drive in a system. Leaving this as open until we release 6.1.1-4 and we have reviewd init.c

02/23/06 09:08:34 changed by jhuntwork@linuxfromscratch.org

  • milestone set to 6.2.

02/23/06 09:12:34 changed by jhuntwork@linuxfromscratch.org

  • keywords set to init.

03/09/06 06:15:22 changed by alexander@linuxfromscratch.org

  • keywords changed from init to init unionfs.

Change of init.c did not fix the problem.

It will not be fixed until Justin R. Knierim replies to http://archives.linuxfromscratch.org/mail-archives/livecd/2006-March/003162.html

03/09/06 19:52:39 changed by justin@linuxfromscratch.org

I'm working on it. The notebook which has this problem doesn't belong to me (as pointed out by Jeremy who asked who James is ;) ). I promise I'll get to testing this CD with the patch from that message by tomorrow evening.

03/10/06 19:24:45 changed by justin@linuxfromscratch.org

I re-burned the 6.1.1-3 LiveCD after applying the patch to init.c. The same happens as before, Starting Init... and it stops. Nothing printed after it.

While at work this morning, I had my server build completely from scratch a 6.1.1-3 with the patch applied, same result.

To be 100% sure I didn't fsck it up, I used both CD's in a i686 system (where this problem never happens), the CD's booted, went past Starting init..., had a ton of can't write to...read-only file system....and finally can't open /dev/tty1: no such file or directory, for all tty's, and init respawning too fast.

03/10/06 19:48:20 changed by justin@linuxfromscratch.org

I wanted to add some more information to help us find the problem. I created a 6.1.1-4-dm (meaning current /livecd/branches/6.1.1 with the patch available here http://linuxfromscratch.org/~justin/livecd-6.1.1-r1445-dm-1.patch). This LiveCD works fine on any i686 system, I tried it, started Xorg, no problems. It also fails to boot on my i586. This time instead of freezing at "Starting init...", it freezes at "Setting up the loopback devices..." and doesn't continue.

In case I really fscked up the patch, here is the tar/bz2'ed lfs-livecd Makefiles. http://linuxfromscratch.org/~justin/livecd-6.1.1-4-dm.tar.bz2

03/11/06 23:23:40 changed by alexander@linuxfromscratch.org

Please verify that the statements below correctly identify the state of the problem.

  1. 6.1.1-X (X>=2) fails to boot on some old computers
  2. 6.1.1-1 (which, by mistake, contains a 2.6.12.x kernel) also fails to boot
  3. 6.2-preX seems to fix (or hide) the problem for people with unbootable 6.1.1-X CDs
  4. disabling /sbin/hotplug seems to help
  5. replacing unionfs with a bind mount while keeping everything as on the original CD doesn't get past "Starting init..."
  6. switching to dm (which, basically, contanis the same init but no unionfs) doesn't help
  7. the problem is reproducible even with a shell-based init, see http://archives.linuxfromscratch.org/mail-archives/livecd/2006-March/003179.html. Need to verify, since Brandon made too many typos during this simple task, and this result is, therefore, not trusted.
  8. There is a different kind of hang with LiteOn? drive here, solved by disabling DMA and waiting 5 minutes for the kernel to recognize the broken CD-ROM drive.

Points 5 and 6 rule out unionfs as the cause of this bug. Point 7 seems to rule out the binary init. Point 4 seems to suggest some kind of OOM situation or filesystem/block layer stress.

03/12/06 10:31:29 changed by justin@linuxfromscratch.org

1. Yes, these computers tend to be i586 and lower machines. Haven't heard any reports for i686 machines. 2. Correct, 6.1.1-1 fails to boot as well. 3. Correct, all the 6.2-pre{1,2,3} worked for me. 4. I don't remember the outcome of this test, I thought I would have written to the list or something. Sorry, can check it out again after work. 5. Correct, the bind mount didn't help the situation. 6. Correct, the DM CD I make and tested worked on a i686 but pauses with "Setting up loopback devices...", even before printing "Starting init..." 7. I will need to try it again tonight. I didn't examine the Makefile before running it and then going to visit the old computer. 6.1.1 doesn't have LFS-ARCH defined and therefore the initramfs didn't get copied to the CD. I have it remaking a iso now and can test after work. 8. Bummer

I'll get back to you tonight about the shell based init.

03/12/06 21:59:57 changed by justin@linuxfromscratch.org

Ok, my test results from the shell-based init as refered to in http://archives.linuxfromscratch.org/mail-archives/livecd/2006-March/003179.html. It does not boot on a i586. This time stopping with the following text:

ip_tables: (C) 2000-2002 Netfilter core team ipt_recent v0.3.1: Stephen Frost <sfrost@snowman.net>. http://snowman.net/projects/ipt_recent/ arp_tables: (C) 2002 David S. Miller NET: Registered protocol family 1 NET: Registered protocol family 17 Freeing unused kernel memory: 336k freed

That is all. There is a crlf after 336k freed for the cursor.

The results of booting this same CD on a i686 machine: It boots fine, gets to the prompt, I can start X and use it, shutdown was clean.

03/13/06 06:06:44 changed by alexander@linuxfromscratch.org

Many thanks for the confirmation.

Let's test if this is an OOM-like situation caused by incorrectly detected memory size. Please boot the CDs (original, with bind mount, and with shell-based init) on both i586 and i686 with the following line:

linux noapic pci=noacpi mem=64M

Then try once again with "64M" replaced with the real RAM size.

03/16/06 19:49:49 changed by alexander@linuxfromscratch.org

hlenderk@bcpl.net says that he was able to extract the following oops from the kernel on a failing PC:

Oops: 0002 [#1]
PREEEMPT SMP
Modules linked in:
CPU: 0
EIP: 0060:[<00000003>] Not tainted VLI
EFLAGS: 00010002 (2.6.11.12)
EIP is at 0x3
eax:00000001  ebx:c06e8900  ecx:00000000  edx:0000003e
esi:c013b464  edi:00000000  epb:00000000  esp:c06f3f84
ds:007b  es:007b  ss:0068
Process swapper (pid: 0, threadinfo=c06f2000  task=c05a4ba0)
Stack:
00000000  c051fb10  00000a50  c06f3fb4  c06f3fb4  00000000  c074a140
007ac007
co105851  00000000  00039100  c010400a  00000000  c06f3fe8  c053c9b8
00039100
c074a140  007ac007  c053c9b8  0000007b  0000007b  ffffff00  c06f4141
00000060
Call Trace:
[<c051fb10>] _spin_unlock_irqrestore+0x10/0x30
[<c0105851>] do_IRQ+0x21/0x30
[<c0104000>] common_interrupt+0x1a/0x20
[<c06f4141>] check_hlt+0x21/0x30
[<c06f4199>] check_bugs+0x19/0x40
[<c06f48b0>] start_kernel+0x180/0x1c0
Code: Bad EIP value 

This suggests that booting with the following command line may be a workaround:

linux noapic pci=noacpi nohlt

If that doesn't help, add: mem=nopentium

03/16/06 20:43:00 changed by justin@linuxfromscratch.org

Just wanted to add, I tried both "linux noapic pci=noacpi nohlt" and the same + mem=nopentium with a 6.1.1-3 (no debugging stuff, plain LiveCD) and the same as before. Halts at Starting init...

Not sure if you meant your last comment for everyone or just the person with the kernel OOPS.

03/17/06 06:24:11 changed by alexander@linuxfromscratch.org

This issue has something to do with the real hardware, since it is not reproducible in Bochs configured for i586 emulation.

03/24/06 22:21:32 changed by justin@linuxfromscratch.org

Still trying to get this problem solved, although motivation is getting low because of so little (seemingly) progress. I tried 6.1-3 LiveCd?, it boots fine on a i586. So somewhere between 6.1-3 and 6.1.1-1 something went wrong, but the diff is 500K, there are version differences in unionfs and squashfs packages, and probably other differences. I am now building a test 6.1.1 branch CD with the kernel config from the 6.1-3 version (there are a few differences, one being for squashfs, added option to build in unionfs) to see if there is any change. The i586 notebook I have tested on will be with me all weekend, in case there are other suggestions of things to try to hopefully solve this problem. Thanks guys.

03/25/06 03:41:29 changed by alexander@linuxfromscratch.org

You could also help by adding the following to the kernel config:

CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_KERNEL=y
CONFIG_DETECT_SOFTLOCKUP=y
CONFIG_DEBUG_SLAB=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_SPINLOCK_SLEEP=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACK_USAGE=y

(note: not all of the above applies to linux-2.6.11)

03/25/06 03:42:59 changed by alexander@linuxfromscratch.org

When the thing hangs, pressing SysRq?+T several times may help debugging: we will know where it sits.

03/25/06 21:38:46 changed by justin@linuxfromscratch.org

Thanks for the tips Alexander. I added the mentioned kernel config options, ran make oldconfig, and built a CD with that kernel. It failed to boot as well. Pressing SysRq?+T showed lots of data, scrolling past the screen buffer. I took pictures with a digital camera and put them on my site below (about 1MB total in pictures) as there was way too much text to type accurately. All the pictures are available here:

http://knierim.org/~justin/debug/

The scrolled-up buffer starts at picture screen01 and ends at screen08. That final picture is here if that is the only relevant picture:

http://knierim.org/~justin/debug/screen08.jpg

03/27/06 19:18:13 changed by alexander@linuxfromscratch.org

Many thanks for the screenshots, almost all of them are relevant. In fact, there are too many suspicious things so that there are now too many ways to dig.

  • on screen01.jpg, you caught call_usermodehelper (i.e., trying to call /sbin/hotplug).
  • on screen02.jpg, you caught blk_unplug_work (some block device disappeared, huh?)
  • on screen05.jpg, there is ret_from_fork (presumably, /sbin/hotplug returned)
  • on screen0{6,7}.jpg, there are three SCSI error handlers. They do match what I have on my linux-2.6.16-ide1 system, so if there is indeed something SCSI-related in the laptop, they are normal. But is SCSI (or USB storage) really there? And why three devices?

So please debug further in the following order:

  • Boot with the "usb-handoff" parameter
  • If that doesn't help, try removing SCSI drivers and USB host controllers from the kernel config one-by-one, rebuilding the kernel and rebooting. The end result should be one of the following:
    • SCSI is red herring here, removing all of SCSI and USB didn't help
    • Driver FOO kills the boot process, let's remove it.

03/27/06 20:26:13 changed by justin@linuxfromscratch.org

That is great to hear that the screenshots were relevant. Too bad there are too many things that look suspicious. I will try out the debug options you pointed out and get back to you.

To answer the questions from the screenshots: screen02.jpg: I didn't have any usb storage device or anything else strange plugged in, and didn't detach anything as far as I could tell. screen0{6,7}.jpg: The notebook is ide only (hdd and cdrom). There is 1 usb port (1.1 USB) but nothing was plugged into it and no usb-storage device for sure.

I'll give the debug items a try and get back to you. Thanks for your help Alexander.

04/01/06 22:35:46 changed by justin@linuxfromscratch.org

As requested by Alexander, I have done the following:

1) Started the LiveCD with the "usb-handoff" parameter

This didn't help and the LiveCD still froze at 'Starting init...' on a i586 machine.

2) Removed all USB and SCSI items from the kernel.

This resolving in the LiveCD failing to boot even before starting init. The process to get the CD was rm -fr ../iso, then rm prepiso. I then changed the config to the kernel by removing USB ad SCSI support completely, then adding in the lines for squashfs and unionfs to yes.

The output was:

Mounting unionfs... Failed to mount unionfs: No such device Kernel panic - not syncing: Attempted to kill init

This could be a mistake in my iso creation but the kernel config and so on looks fine. I will investigate this problem further tomorrow after work.

04/19/06 21:07:03 changed by justin@linuxfromscratch.org

Alexander requested that I build a LiveCD 6.1.1-4 (from /branches/6.1.1) with the modified initramfs file from a 6.1-3 LiveCD, which still boots fine on a i586. I just tested it, and the results were the same, unfortunately. The system stops before starting init:

Freeing unused kernel memory: 336k freed
LFS Live CD is /dev/hdb

That is all. The search continues...

07/23/06 20:50:12 changed by alexander@linuxfromscratch.org

  • keywords changed from init unionfs to heisenbug.
  • status changed from new to closed.
  • resolution set to wontfix.

Nobody among developers has both the needed hardware and skills to diagnose the bug. My own impression is that this bug is caused by broken optimization for size in gcc-3.x in some kernel code that applies to some ancient motherboard. However:

  • my request to rebuild the CD without -Os flag is still unanswered
  • the 6.1.1 book contains a root hole (LFS ticket 1831, wontfix) and thus should not be used
  • this bug was never present on 6.2-preX CDs

Thus, I am closing this as wontfix and recommend everyone who sees this bug to use the 6.2-pre5 CD (or 6.2-1 when this comes out).