Opened 9 years ago
Closed 9 years ago
#7407 closed enhancement (fixed)
mdadm-3.4
Reported by: | Fernando de Oliveira | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.9 |
Component: | BOOK | Version: | SVN |
Severity: | normal | Keywords: | |
Cc: |
Description ¶
https://www.kernel.org/pub/linux/utils/raid/mdadm/mdadm-3.4.tar.xz
https://www.kernel.org/pub/linux/utils/raid/mdadm/mdadm-3.4.tar.sign
https://www.kernel.org/pub/linux/utils/raid/mdadm/ANNOUNCE
{{{ Subject: ANNOUNCE: mdadm 3.4 - A tool for managing md Soft RAID under Linux
I am pleased to announce the availability of
mdadm version 3.4
It is available at the usual places:
and via git at
git://github.com/neilbrown/mdadm git://neil.brown.name/mdadm http://git.neil.brown.name/git/mdadm
The new second-level version number reflects significant new functionality, particular support for journalled RAID5/6 and clustered RAID1. This new support is probably still buggy. Please report bugs.
There are also a number of fixes for Intel's IMSM metadata support, and an assortment of minor bug fixes.
I plan for this to be the last release of mdadm that I provide as I am retiring from MD and mdadm maintenance. Jes Sorensen has volunteered to oversee mdadm for the next while. Thanks Jes!
NeilBrown 28th January 2016 }}}
Change History (31)
comment:1 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:2 by , 9 years ago
follow-up: 5 comment:4 by , 9 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
I give up. test 12imsm-r5_3d-grow-r5_4d hangs. Did nothing for more than eight hours. Less than one hour to reach that point.
Giving back to the book.
I don't like at all this package. Don't understand it, need to reboot at each failure, because the mdadm processes do not get killed.
Think I'm starting to get tired again.
See you tomorrow.
comment:5 by , 9 years ago
Replying to fo:
I don't like at all this package. Don't understand it, need to reboot at each failure, because the mdadm processes do not get killed.
Forgot to tell that I cannot soft reboot: always have to press the power button to hard reset.
Cannot feel safe with this test suite, run as root. If it depended only on me, I would recommend not to run the test suite.
Turned on the machine just to add that.
comment:6 by , 9 years ago
For me, it is hanging at tests/01replace - ps aux shows the first invocation (of 3) of ./test was over two hours ago, and the latest was about 9 minutes later. It appears to be hanging in sha1sum (status D+). I agree this is not killable. Looking in /var/tmp, the log was last updated when the last invocation started.
I then typed Ctrl-C, and it reported FAILED. It then referred me to tests/logs - 01replace.log (and dmesg) shows a series of hung task messages - from dmesg the relevant part begins
[ 1125.612910] --- wd:4 rd:4 [ 1125.612915] disk 0, wo:0, o:1, dev:loop0 [ 1125.612918] disk 1, wo:0, o:1, dev:loop1 [ 1125.612922] disk 2, wo:0, o:1, dev:loop2 [ 1125.612924] disk 3, wo:0, o:1, dev:loop3 [ 1320.265694] INFO: task md0_raid1:9152 blocked for more than 120 seconds. [ 1320.265702] Not tainted 4.4.1 #5 [ 1320.265704] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1320.265707] md0_raid1 D ffff8801fbe27bb0 0 9152 2 0x00080000 [ 1320.265715] ffff8801fbe27bb0 ffff880236972f00 ffff88022e225240 ffffffff810bbd55 [ 1320.265720] ffff8801fbe27b90 ffff8801fbe28000 ffff88022e321988 0000000000000000 [ 1320.265724] ffff88022e321970 ffff88022e321900 ffff8801fbe27bc8 ffffffff8196e93f [ 1320.265729] Call Trace: [ 1320.265739] [<ffffffff810bbd55>] ? preempt_count_add+0x85/0xd0 [ 1320.265745] [<ffffffff8196e93f>] schedule+0x3f/0x90 [ 1320.265751] [<ffffffffa003c796>] freeze_array+0x76/0xd0 [raid1] [ 1320.265755] [<ffffffff810d4bd0>] ? wake_atomic_t_function+0x60/0x60 [ 1320.265760] [<ffffffffa003c831>] raid1_quiesce+0x41/0x50 [raid1] [ 1320.265768] [<ffffffffa00129ba>] mddev_suspend.part.30+0x7a/0x90 [md_mod] [ 1320.265772] [<ffffffffa003cf75>] ? print_conf+0x85/0x100 [raid1] [ 1320.265778] [<ffffffffa000d8f8>] ? md_wakeup_thread+0x28/0x30 [md_mod] [ 1320.265786] [<ffffffffa00129ec>] mddev_suspend+0x1c/0x20 [md_mod] [ 1320.265790] [<ffffffffa003d3c9>] raid1_add_disk+0xd9/0x1e0 [raid1] [ 1320.265797] [<ffffffffa001739d>] remove_and_add_spares+0x26d/0x340 [md_mod] [ 1320.265805] [<ffffffffa001c8db>] md_check_recovery+0x3fb/0x4c0 [md_mod] [ 1320.265809] [<ffffffffa003fd05>] raid1d+0x55/0x1010 [raid1] [ 1320.265813] [<ffffffff8196e949>] ? schedule+0x49/0x90 [ 1320.265817] [<ffffffff8197271e>] ? schedule_timeout+0x19e/0x260 [ 1320.265822] [<ffffffff810bbd55>] ? preempt_count_add+0x85/0xd0 [ 1320.265826] [<ffffffff81973358>] ? _raw_write_unlock_irqrestore+0x18/0x30 [ 1320.265830] [<ffffffff8197337e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 1320.265836] [<ffffffffa0010e22>] md_thread+0x112/0x120 [md_mod] [ 1320.265840] [<ffffffff810d4bd0>] ? wake_atomic_t_function+0x60/0x60 [ 1320.265846] [<ffffffffa0010d10>] ? find_pers+0x70/0x70 [md_mod] [ 1320.265851] [<ffffffff810b6dc9>] kthread+0xc9/0xe0 [ 1320.265855] [<ffffffff8197333e>] ? _raw_spin_unlock_irq+0xe/0x10 [ 1320.265859] [<ffffffff810b6d00>] ? kthread_worker_fn+0x170/0x170 [ 1320.265864] [<ffffffff81973c5f>] ret_from_fork+0x3f/0x70 [ 1320.265868] [<ffffffff810b6d00>] ? kthread_worker_fn+0x170/0x170 and then the next hung message shows up. After the Ctrl-C, the third invocation is still running, and sha1sum status is D.
comment:7 by , 9 years ago
I've contacted the old and new maintainers - I'm not sure about the transitional arrangements, so I haven't reported the bug at github.
comment:9 by , 9 years ago
Dunno - and for me it fails on an earlier test than for Fernando. I prefer to wait for a bit, to see if there is any response from upstream - hardly anybody runs tests, I don't think there will be a catastrophic problem if we leave the instructions for running ./test in the book for a few days (anyone running it must already NOT be using mdadm for real).
I guess that either an extra kernel config entry is (now) required, or perhaps there was a kernel regression (I was going to say 'or a gcc issue', but sha1sum on a small file seems to work ok).
OTOH, I have no objection to recommending NOT to run the tests at the moment because some hang and cannot be killed. I admit to being slightly concerned that some tests in the previous version failed for unexplained reasons (was hoping to look at that if the tests completed), because on my server I rely on mdadm working in RAID-1.
On _this_ test machine I'm tempted to add two small drives to test software RAID, but I've only just moved to one bigger drive. When I added another OS on a second drive, grub and linux disagreed about the primary drive (in linux, sda became sdb, and then when I added a third drive to copy everything to, linux thought the original system was sdc) - so, I'm not looking forward to trying to add drives. Nor do I yet have any thoughts about a practical set of "does it work' tests - I would only be able to do RAID 0 and RAID1 at the moment (two drives), and anyway I'm hoping to get another (extra) test machine delivered in a few hours. Getting that tested and into use will take time.
comment:10 by , 9 years ago
I got a reply from Neil Brown, this problem has possibly been around for some time. He said the machine can be unblocked by running cat suspend_lo > suspend_hi (in the /sys/block/mdXXX/md directory) - that was in the context of the replace test which hung for Fernando, I'm not sure if my problem is identical.
I had powered it off in the meantime (couldn't s2ram). He'll try to find some time to look at this later in the month, and he thinks mdadm is losing track of where it is up to.
I'll try to get back to this to see how consistent it is, and perhaps try 3.3.4 for comparison.
comment:11 by , 9 years ago
Replying to bdubbs@…:
Can we disable this test in the test suite?
I tried the testsuite too, and it is another test that hangs for me. ps aux found that it stopped while running mkfs.ext3.
Thanks to Ken for looking at this.
comment:12 by , 9 years ago
Thanks Ken and Pierre, for confirming that it is not my problem only.
It did not hang with 3.3.4 and I considered the results acceptable, 9 out of 122 failing < 7.4%:
Number of succeeded tests: 113 Number of FAILED tests: 9 FAILED tests: tests/00raid1... FAILED - see ... tests/02r1add... FAILED - see ... tests/07autoassemble... FAILED - see ... tests/07changelevels... FAILED - see ... tests/07revert-grow... FAILED - see ... tests/07revert-inplace... FAILED - see ... tests/07revert-shrink... FAILED - see ... tests/07testreshape5... FAILED - see ... tests/10ddf-incremental-wrong-order... FAILED - see ... real 62m5.807s user 1m56.335s sys 0m58.734s 1,9M ../DEST-mdadm-3.3.4 9,4M ../mdadm-3.3.4 12M total
(1 SBU = 151 s)
(replace see test-logs/log-00raid1 for details and similar by see ...)
Date test was executed:
2015.08.05-09h48m59s
gcc-5.1.0
linux-4.1.4
I'm attaching the test log.
When I finish my work today, plan to run mdadm-3.3.4 tests in this new system, following your idea.
by , 9 years ago
Attachment: | mdadm-3.3.4-make-k-test-2015.08.05-09h48m59s.log added |
---|
Tests run on LFS-7.7, at 2015.08.05
comment:13 by , 9 years ago
Gave 3.4 another go, it hangs in what seems to be the same place. So, I took a look in /sys/block : the directory there is md0/ BUT both md0/md/suspend_lo and md0/md/suspend_hi are 0!
On the assumption they should have some numeric value, I tried echoing 3000 and then '3000' to them, but those commands hung until I hit Ctrl-C and then I got a message about an interrupted system call.
I also looked at my server which is running 3.3.4, and there both are also 0.
Looks like I need to reboot to try testing 3.3.4. And then the box locked up.
comment:14 by , 9 years ago
I'm running the tests. They just passed through
tests/10ddf-incremental-wrong-order... FAILED - see ...
Failed, but did not hung
Executing now tests/11spare-migration...
Semms to have less Failures than on LFS-7.7:
tests/07revert-shrink... succeeded
but above it failed for 3.4 (commeny:12)
comment:16 by , 9 years ago
I've just run the tests on 3.3.4 : again, they hang for me in 01replace with a dead sha1sum process. LOL, I suspect a kernel .config difference might account for some of this - or else it must be something really weird : do I need to sacrifice kittens to get the tests to run ?
I have not tested in the past because my desktop machines did not have RAID enabled. Yesterday, I added almost all the BLFS config options throughout the book, except for nfs v4 and 4.1 (tried those, could not mount my nfs v3 shares in fstab, and it did not like ,nfsver=3 so I took them out again).
For MD and its neighbour DM I have the following :
CONFIG_MD=y CONFIG_BLK_DEV_MD=m # CONFIG_MD_LINEAR is not set CONFIG_MD_RAID0=m CONFIG_MD_RAID1=m CONFIG_MD_RAID10=m CONFIG_MD_RAID456=m # CONFIG_MD_MULTIPATH is not set # CONFIG_MD_FAULTY is not set # CONFIG_BCACHE is not set CONFIG_BLK_DEV_DM_BUILTIN=y CONFIG_BLK_DEV_DM=m # CONFIG_DM_MQ_DEFAULT is not set # CONFIG_DM_DEBUG is not set CONFIG_DM_BUFIO=m CONFIG_DM_BIO_PRISON=m CONFIG_DM_PERSISTENT_DATA=m # CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set CONFIG_DM_CRYPT=m CONFIG_DM_SNAPSHOT=m CONFIG_DM_THIN_PROVISIONING=m # CONFIG_DM_CACHE is not set # CONFIG_DM_ERA is not set CONFIG_DM_MIRROR=m # CONFIG_DM_LOG_USERSPACE is not set CONFIG_DM_RAID=m # CONFIG_DM_ZERO is not set # CONFIG_DM_MULTIPATH is not set # CONFIG_DM_DELAY is not set # CONFIG_DM_UEVENT is not set # CONFIG_DM_FLAKEY is not set # CONFIG_DM_VERITY is not set # CONFIG_DM_SWITCH is not set # CONFIG_DM_LOG_WRITES is not set # CONFIG_TARGET_CORE is not set # CONFIG_FUSION is not set
I've already noted from dmesg or the system log that /dev/loop0..5 get used, so I suppose those need to be added to the mdadm page if the tests are to be run. Anything in my MD or DM stuff above which looks wrong ?
comment:17 by , 9 years ago
I am almost blind a this time, so, will give perhaps good news, first being 3.3.4 does not hang.
Just finished the tests:
real 63m28.510s user 1m58.382s sys 1m35.447s 2,0M ../DEST-mdadm-3.3.4 9,8M ../mdadm-3.3.4 12M total tests/00raid1... FAILED - see ... tests/02r1add... FAILED - see ... tests/07autoassemble... FAILED - see ... tests/07changelevels... FAILED - see ... tests/07revert-grow... FAILED - see ... tests/07revert-inplace... FAILED - see ... tests/10ddf-fail-two-spares... FAILED - see ... tests/10ddf-incremental-wrong-order... FAILED - see ...
1 SBU = 150s
comment:18 by , 9 years ago
Number of succeeded tests: 114
Number of FAILED tests: 8
Failed ≈ 6.557377 %
follow-up: 21 comment:19 by , 9 years ago
I have many more CONFIG_DM_* things enabled on my kernel, because I needed them when I tested LVM, back in October or November. I have CONFIG_MD_MULTIPATH set, and my tests hang at a multipath test (do not know exactly which, be I erased the log, sorry).
Ken, I see you have a lot of CONFIG_ switches enabled as modules. Have you tried to load those modules before starting the tests? I remember I had to do that before starting the tests for LVM: not all tests load the relevant module (or check that it is loaded).
follow-up: 22 comment:20 by , 9 years ago
$ grep -E 'MD_|DM_' /boot/config-4.2.3 | grep -vE '^\#|AMD|PMD' CONFIG_MD_AUTODETECT=y CONFIG_MD_LINEAR=y CONFIG_MD_RAID0=y CONFIG_MD_RAID1=y CONFIG_MD_RAID10=y CONFIG_MD_RAID456=y CONFIG_BLK_DEV_DM_BUILTIN=y CONFIG_DM_BUFIO=y CONFIG_DM_BIO_PRISON=y CONFIG_DM_PERSISTENT_DATA=y CONFIG_DM_CRYPT=y CONFIG_DM_SNAPSHOT=y CONFIG_DM_THIN_PROVISIONING=y CONFIG_DM_MIRROR=y CONFIG_DM_LOG_USERSPACE=y CONFIG_DM_ZERO=y CONFIG_DM_MULTIPATH=y CONFIG_DM_UEVENT=y
comment:21 by , 9 years ago
Replying to pierre.labastie:
I have many more CONFIG_DM_* things enabled on my kernel, because I needed them when I tested LVM, back in October or November. I have CONFIG_MD_MULTIPATH set, and my tests hang at a multipath test (do not know exactly which, be I erased the log, sorry).
Ken, I see you have a lot of CONFIG_ switches enabled as modules. Have you tried to load those modules before starting the tests? I remember I had to do that before starting the tests for LVM: not all tests load the relevant module (or check that it is loaded).
I assumed the kernel would know what it was doing and load them when necessary. Certainly, some were still loaded when I shutdown. I'll try turning things on when I come back to this, might take a day or so - but I suppose I'll not turn on MULTIPATH. Thanks.
comment:22 by , 9 years ago
Replying to fo:
$ grep -E 'MD_|DM_' /boot/config-4.2.3 | grep -vE '^\#|AMD|PMD' CONFIG_MD_AUTODETECT=y CONFIG_MD_LINEAR=y CONFIG_MD_RAID0=y CONFIG_MD_RAID1=y CONFIG_MD_RAID10=y CONFIG_MD_RAID456=y CONFIG_BLK_DEV_DM_BUILTIN=y CONFIG_DM_BUFIO=y CONFIG_DM_BIO_PRISON=y CONFIG_DM_PERSISTENT_DATA=y CONFIG_DM_CRYPT=y CONFIG_DM_SNAPSHOT=y CONFIG_DM_THIN_PROVISIONING=y CONFIG_DM_MIRROR=y CONFIG_DM_LOG_USERSPACE=y CONFIG_DM_ZERO=y CONFIG_DM_MULTIPATH=y CONFIG_DM_UEVENT=y
And thanks for those. Again, I'll play with my .config but it might be a day or two - I want to get my LFS scripts up to date (still in December), then use my main monitor to see if the new box works.
comment:25 by , 9 years ago
Hmm, I tried changing my .config last night, and I now have the following for MD :
ken@deluxe /scratch/ken/linux-4.4 $grep _MD_ .config CONFIG_MD_AUTODETECT=y CONFIG_MD_LINEAR=y CONFIG_MD_RAID0=y CONFIG_MD_RAID1=y CONFIG_MD_RAID10=y CONFIG_MD_RAID456=y # CONFIG_MD_MULTIPATH is not set # CONFIG_MD_FAULTY is not set
But 3.3.4 still hangs at the sha1sum in 01replace, although the first two failures (00linear, 00names) now pass. For the moment, I'm giving up on this. The status field in /sys implied that whatever part of mdadm had been last used had completed.
comment:26 by , 9 years ago
I'm in the process of getting several new drives for my -dev system so I can properly test mdadm, lvm, btrfs, jfs, xfs, etc. properly. Not sure when it will be ready. I have to figure out how to get power to the new drives (I think I need a custom cable).
Once I get things working, I'll post results to the -dev mailing list.
comment:27 by , 9 years ago
Neil has posted a fix to linux-raid, meanwhile he tells me that the hang in tests01/replace was a kernel problem, fixed in 4.4.2 by
Commit: 1501efadc524 ("md/raid: only permit hot-add of compatible integrity profiles")
comment:28 by , 9 years ago
Good to know Ken. I've been looking at the mdadm tests and made a couple of posts to the linux-raid mailing lists. I was able to work around at least one test failure, but I need to put 4.4.2 into LFS and test. I do plan on using 4.4.2 in lfs-7.9-rc2.
comment:29 by , 9 years ago
I have tested mdadm on a 4.4.2 kernel. The good news is that the hangs are gone. The bad news is that there are a lot of failures: 72 tests succeed, but 51 fail.
At least two tests fail because I do not have strace on the system.
The tests take a little over an hour. I think the test time is only partially processor dependent.
One change I have suggested is:
--- test.orig 2016-02-20 17:37:53.521197001 -0600 +++ test 2016-02-20 20:02:25.083787550 -0600 @@ -332,7 +332,7 @@ echo "FAILED - see $logdir/$log for details" _fail=1 fi - if [ "$savelogs" == "1" ]; then + if [ "$savelogs" == "1" -a -e $targetdir/log ]; then cp $targetdir/log $logdir/$_basename.log fi if [ "$_fail" == "1" -a "$exitonerror" == "1" ]; then
That eliminates a bogus error when trying to copy a file when it has already been mv'ed after a test failure.
I'm going to go ahead and tag this for 7.9 with some additional comments, but will leave the ticket open.
comment:30 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
No new developments here. Marking as fixed.
Sorry. problem with the tests will update today, but leave open to try again the tests tomorrow.