Opened 9 years ago

Closed 9 years ago

#7407 closed enhancement (fixed)

mdadm-3.4

Reported by: Fernando de Oliveira Owned by: blfs-book@…
Priority: normal Milestone: 7.9
Component: BOOK Version: SVN
Severity: normal Keywords:
Cc:

Description

https://www.kernel.org/pub/linux/utils/raid/mdadm/mdadm-3.4.tar.xz

https://www.kernel.org/pub/linux/utils/raid/mdadm/mdadm-3.4.tar.sign

https://www.kernel.org/pub/linux/utils/raid/mdadm/ANNOUNCE

{{{ Subject: ANNOUNCE: mdadm 3.4 - A tool for managing md Soft RAID under Linux

I am pleased to announce the availability of

mdadm version 3.4

It is available at the usual places:

http://www.kernel.org/pub/linux/utils/raid/mdadm/

and via git at

git://github.com/neilbrown/mdadm git://neil.brown.name/mdadm http://git.neil.brown.name/git/mdadm

The new second-level version number reflects significant new functionality, particular support for journalled RAID5/6 and clustered RAID1. This new support is probably still buggy. Please report bugs.

There are also a number of fixes for Intel's IMSM metadata support, and an assortment of minor bug fixes.

I plan for this to be the last release of mdadm that I provide as I am retiring from MD and mdadm maintenance. Jes Sorensen has volunteered to oversee mdadm for the next while. Thanks Jes!

NeilBrown 28th January 2016 }}}

Change History (31)

comment:1 by Fernando de Oliveira, 9 years ago

Owner: changed from blfs-book@… to Fernando de Oliveira
Status: newassigned

comment:2 by Fernando de Oliveira, 9 years ago

Sorry. problem with the tests will update today, but leave open to try again the tests tomorrow.

comment:3 by Fernando de Oliveira, 9 years ago

Partially fixed at r16865.

Tests not yet run.

comment:4 by Fernando de Oliveira, 9 years ago

Owner: changed from Fernando de Oliveira to blfs-book@…
Status: assignednew

I give up. test 12imsm-r5_3d-grow-r5_4d hangs. Did nothing for more than eight hours. Less than one hour to reach that point.

Giving back to the book.

I don't like at all this package. Don't understand it, need to reboot at each failure, because the mdadm processes do not get killed.

Think I'm starting to get tired again.

See you tomorrow.

in reply to:  4 comment:5 by Fernando de Oliveira, 9 years ago

Replying to fo:

I don't like at all this package. Don't understand it, need to reboot at each failure, because the mdadm processes do not get killed.

Forgot to tell that I cannot soft reboot: always have to press the power button to hard reset.

Cannot feel safe with this test suite, run as root. If it depended only on me, I would recommend not to run the test suite.

Turned on the machine just to add that.

comment:6 by ken@…, 9 years ago

For me, it is hanging at tests/01replace - ps aux shows the first invocation (of 3) of ./test was over two hours ago, and the latest was about 9 minutes later. It appears to be hanging in sha1sum (status D+). I agree this is not killable. Looking in /var/tmp, the log was last updated when the last invocation started.

I then typed Ctrl-C, and it reported FAILED. It then referred me to tests/logs - 01replace.log (and dmesg) shows a series of hung task messages - from dmesg the relevant part begins

[ 1125.612910]  --- wd:4 rd:4
[ 1125.612915]  disk 0, wo:0, o:1, dev:loop0
[ 1125.612918]  disk 1, wo:0, o:1, dev:loop1
[ 1125.612922]  disk 2, wo:0, o:1, dev:loop2
[ 1125.612924]  disk 3, wo:0, o:1, dev:loop3
[ 1320.265694] INFO: task md0_raid1:9152 blocked for more than 120 seconds.
[ 1320.265702]       Not tainted 4.4.1 #5
[ 1320.265704] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1320.265707] md0_raid1       D ffff8801fbe27bb0     0  9152      2 0x00080000
[ 1320.265715]  ffff8801fbe27bb0 ffff880236972f00 ffff88022e225240 ffffffff810bbd55
[ 1320.265720]  ffff8801fbe27b90 ffff8801fbe28000 ffff88022e321988 0000000000000000
[ 1320.265724]  ffff88022e321970 ffff88022e321900 ffff8801fbe27bc8 ffffffff8196e93f
[ 1320.265729] Call Trace:
[ 1320.265739]  [<ffffffff810bbd55>] ? preempt_count_add+0x85/0xd0
[ 1320.265745]  [<ffffffff8196e93f>] schedule+0x3f/0x90
[ 1320.265751]  [<ffffffffa003c796>] freeze_array+0x76/0xd0 [raid1]
[ 1320.265755]  [<ffffffff810d4bd0>] ? wake_atomic_t_function+0x60/0x60
[ 1320.265760]  [<ffffffffa003c831>] raid1_quiesce+0x41/0x50 [raid1]
[ 1320.265768]  [<ffffffffa00129ba>] mddev_suspend.part.30+0x7a/0x90 [md_mod]
[ 1320.265772]  [<ffffffffa003cf75>] ? print_conf+0x85/0x100 [raid1]
[ 1320.265778]  [<ffffffffa000d8f8>] ? md_wakeup_thread+0x28/0x30 [md_mod]
[ 1320.265786]  [<ffffffffa00129ec>] mddev_suspend+0x1c/0x20 [md_mod]
[ 1320.265790]  [<ffffffffa003d3c9>] raid1_add_disk+0xd9/0x1e0 [raid1]
[ 1320.265797]  [<ffffffffa001739d>] remove_and_add_spares+0x26d/0x340 [md_mod]
[ 1320.265805]  [<ffffffffa001c8db>] md_check_recovery+0x3fb/0x4c0 [md_mod]
[ 1320.265809]  [<ffffffffa003fd05>] raid1d+0x55/0x1010 [raid1]
[ 1320.265813]  [<ffffffff8196e949>] ? schedule+0x49/0x90
[ 1320.265817]  [<ffffffff8197271e>] ? schedule_timeout+0x19e/0x260
[ 1320.265822]  [<ffffffff810bbd55>] ? preempt_count_add+0x85/0xd0
[ 1320.265826]  [<ffffffff81973358>] ? _raw_write_unlock_irqrestore+0x18/0x30
[ 1320.265830]  [<ffffffff8197337e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 1320.265836]  [<ffffffffa0010e22>] md_thread+0x112/0x120 [md_mod]
[ 1320.265840]  [<ffffffff810d4bd0>] ? wake_atomic_t_function+0x60/0x60
[ 1320.265846]  [<ffffffffa0010d10>] ? find_pers+0x70/0x70 [md_mod]
[ 1320.265851]  [<ffffffff810b6dc9>] kthread+0xc9/0xe0
[ 1320.265855]  [<ffffffff8197333e>] ? _raw_spin_unlock_irq+0xe/0x10
[ 1320.265859]  [<ffffffff810b6d00>] ? kthread_worker_fn+0x170/0x170
[ 1320.265864]  [<ffffffff81973c5f>] ret_from_fork+0x3f/0x70
[ 1320.265868]  [<ffffffff810b6d00>] ? kthread_worker_fn+0x170/0x170
and then the next hung message shows up.

After the Ctrl-C, the third invocation is still running, and sha1sum status is D.

comment:7 by ken@…, 9 years ago

I've contacted the old and new maintainers - I'm not sure about the transitional arrangements, so I haven't reported the bug at github.

comment:8 by bdubbs@…, 9 years ago

Can we disable this test in the test suite?

comment:9 by ken@…, 9 years ago

Dunno - and for me it fails on an earlier test than for Fernando. I prefer to wait for a bit, to see if there is any response from upstream - hardly anybody runs tests, I don't think there will be a catastrophic problem if we leave the instructions for running ./test in the book for a few days (anyone running it must already NOT be using mdadm for real).

I guess that either an extra kernel config entry is (now) required, or perhaps there was a kernel regression (I was going to say 'or a gcc issue', but sha1sum on a small file seems to work ok).

OTOH, I have no objection to recommending NOT to run the tests at the moment because some hang and cannot be killed. I admit to being slightly concerned that some tests in the previous version failed for unexplained reasons (was hoping to look at that if the tests completed), because on my server I rely on mdadm working in RAID-1.

On _this_ test machine I'm tempted to add two small drives to test software RAID, but I've only just moved to one bigger drive. When I added another OS on a second drive, grub and linux disagreed about the primary drive (in linux, sda became sdb, and then when I added a third drive to copy everything to, linux thought the original system was sdc) - so, I'm not looking forward to trying to add drives. Nor do I yet have any thoughts about a practical set of "does it work' tests - I would only be able to do RAID 0 and RAID1 at the moment (two drives), and anyway I'm hoping to get another (extra) test machine delivered in a few hours. Getting that tested and into use will take time.

comment:10 by ken@…, 9 years ago

I got a reply from Neil Brown, this problem has possibly been around for some time. He said the machine can be unblocked by running cat suspend_lo > suspend_hi (in the /sys/block/mdXXX/md directory) - that was in the context of the replace test which hung for Fernando, I'm not sure if my problem is identical.

I had powered it off in the meantime (couldn't s2ram). He'll try to find some time to look at this later in the month, and he thinks mdadm is losing track of where it is up to.

I'll try to get back to this to see how consistent it is, and perhaps try 3.3.4 for comparison.

Last edited 9 years ago by ken@… (previous) (diff)

in reply to:  8 comment:11 by Pierre Labastie, 9 years ago

Replying to bdubbs@…:

Can we disable this test in the test suite?

I tried the testsuite too, and it is another test that hangs for me. ps aux found that it stopped while running mkfs.ext3.

Thanks to Ken for looking at this.

comment:12 by Fernando de Oliveira, 9 years ago

Thanks Ken and Pierre, for confirming that it is not my problem only.

It did not hang with 3.3.4 and I considered the results acceptable, 9 out of 122 failing < 7.4%:

Number of succeeded tests:
113

Number of FAILED tests:
9

FAILED tests:
tests/00raid1... FAILED - see ...
tests/02r1add... FAILED - see ...
tests/07autoassemble... FAILED - see ...
tests/07changelevels... FAILED - see ...
tests/07revert-grow... FAILED - see ...
tests/07revert-inplace... FAILED - see ...
tests/07revert-shrink... FAILED - see ...
tests/07testreshape5... FAILED - see ...
tests/10ddf-incremental-wrong-order... FAILED - see ...

real	62m5.807s
user	1m56.335s
sys	0m58.734s

1,9M	../DEST-mdadm-3.3.4
9,4M	../mdadm-3.3.4
12M	total

(1 SBU = 151 s)

(replace see test-logs/log-00raid1 for details and similar by see ...)

Date test was executed:

2015.08.05-09h48m59s

gcc-5.1.0

linux-4.1.4

I'm attaching the test log.

When I finish my work today, plan to run mdadm-3.3.4 tests in this new system, following your idea.

by Fernando de Oliveira, 9 years ago

Tests run on LFS-7.7, at 2015.08.05

comment:13 by ken@…, 9 years ago

Gave 3.4 another go, it hangs in what seems to be the same place. So, I took a look in /sys/block : the directory there is md0/ BUT both md0/md/suspend_lo and md0/md/suspend_hi are 0!

On the assumption they should have some numeric value, I tried echoing 3000 and then '3000' to them, but those commands hung until I hit Ctrl-C and then I got a message about an interrupted system call.

I also looked at my server which is running 3.3.4, and there both are also 0.

Looks like I need to reboot to try testing 3.3.4. And then the box locked up.

comment:14 by Fernando de Oliveira, 9 years ago

I'm running the tests. They just passed through

tests/10ddf-incremental-wrong-order... FAILED - see ...

Failed, but did not hung

Executing now tests/11spare-migration...

Semms to have less Failures than on LFS-7.7:

tests/07revert-shrink... succeeded

but above it failed for 3.4 (comment:12)

Last edited 9 years ago by Fernando de Oliveira (previous) (diff)

comment:15 by Fernando de Oliveira, 9 years ago

Edited previous comment:

s/commeny/comment/

comment:16 by ken@…, 9 years ago

I've just run the tests on 3.3.4 : again, they hang for me in 01replace with a dead sha1sum process. LOL, I suspect a kernel .config difference might account for some of this - or else it must be something really weird : do I need to sacrifice kittens to get the tests to run ?

I have not tested in the past because my desktop machines did not have RAID enabled. Yesterday, I added almost all the BLFS config options throughout the book, except for nfs v4 and 4.1 (tried those, could not mount my nfs v3 shares in fstab, and it did not like ,nfsver=3 so I took them out again).

For MD and its neighbour DM I have the following :

CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
# CONFIG_MD_LINEAR is not set
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_MQ_DEFAULT is not set
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=m
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
# CONFIG_DM_DEBUG_BLOCK_STACK_TRACING is not set
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_THIN_PROVISIONING=m
# CONFIG_DM_CACHE is not set
# CONFIG_DM_ERA is not set
CONFIG_DM_MIRROR=m
# CONFIG_DM_LOG_USERSPACE is not set
CONFIG_DM_RAID=m
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

I've already noted from dmesg or the system log that /dev/loop0..5 get used, so I suppose those need to be added to the mdadm page if the tests are to be run. Anything in my MD or DM stuff above which looks wrong ?

comment:17 by Fernando de Oliveira, 9 years ago

I am almost blind a this time, so, will give perhaps good news, first being 3.3.4 does not hang.

Just finished the tests:

real	63m28.510s
user	1m58.382s
sys	1m35.447s

2,0M	../DEST-mdadm-3.3.4
9,8M	../mdadm-3.3.4
12M	total

tests/00raid1... FAILED - see ...
tests/02r1add... FAILED - see ...
tests/07autoassemble... FAILED - see ...
tests/07changelevels... FAILED - see ...
tests/07revert-grow... FAILED - see ...
tests/07revert-inplace... FAILED - see ...
tests/10ddf-fail-two-spares... FAILED - see ...
tests/10ddf-incremental-wrong-order... FAILED - see ...

1 SBU = 150s

comment:18 by Fernando de Oliveira, 9 years ago

Number of succeeded tests: 114

Number of FAILED tests: 8

Failed ≈ 6.557377 %

comment:19 by Pierre Labastie, 9 years ago

I have many more CONFIG_DM_* things enabled on my kernel, because I needed them when I tested LVM? back in October or November. I have CONFIG_MD_MULTIPATH set, and my tests hang at a multipath test (erased the log, sorry).

Ken, I see you have a lot of CONFIG_ switches enabled as modules. Have you tried to load those modules before starting the tests? I remember I had to do that before starting the tests for LVM: not all tests load the relevant module (or check that it is loaded).

Version 0, edited 9 years ago by Pierre Labastie (next)

comment:20 by Fernando de Oliveira, 9 years ago

$ grep -E 'MD_|DM_' /boot/config-4.2.3 | grep -vE '^\#|AMD|PMD'
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_DM_BUFIO=y
CONFIG_DM_BIO_PRISON=y
CONFIG_DM_PERSISTENT_DATA=y
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_THIN_PROVISIONING=y
CONFIG_DM_MIRROR=y
CONFIG_DM_LOG_USERSPACE=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
CONFIG_DM_UEVENT=y

in reply to:  19 comment:21 by ken@…, 9 years ago

Replying to pierre.labastie:

I have many more CONFIG_DM_* things enabled on my kernel, because I needed them when I tested LVM, back in October or November. I have CONFIG_MD_MULTIPATH set, and my tests hang at a multipath test (do not know exactly which, be I erased the log, sorry).

Ken, I see you have a lot of CONFIG_ switches enabled as modules. Have you tried to load those modules before starting the tests? I remember I had to do that before starting the tests for LVM: not all tests load the relevant module (or check that it is loaded).

I assumed the kernel would know what it was doing and load them when necessary. Certainly, some were still loaded when I shutdown. I'll try turning things on when I come back to this, might take a day or so - but I suppose I'll not turn on MULTIPATH. Thanks.

in reply to:  20 comment:22 by ken@…, 9 years ago

Replying to fo:

$ grep -E 'MD_|DM_' /boot/config-4.2.3 | grep -vE '^\#|AMD|PMD'
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_DM_BUFIO=y
CONFIG_DM_BIO_PRISON=y
CONFIG_DM_PERSISTENT_DATA=y
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_THIN_PROVISIONING=y
CONFIG_DM_MIRROR=y
CONFIG_DM_LOG_USERSPACE=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
CONFIG_DM_UEVENT=y

And thanks for those. Again, I'll play with my .config but it might be a day or two - I want to get my LFS scripts up to date (still in December), then use my main monitor to see if the new box works.

comment:23 by Fernando de Oliveira, 9 years ago

Resolution: fixed
Status: newclosed

Fixed at r16905.

comment:24 by Fernando de Oliveira, 9 years ago

Resolution: fixed
Status: closedreopened

Sorry, it was a mistake.

comment:25 by ken@…, 9 years ago

Hmm, I tried changing my .config last night, and I now have the following for MD :

ken@deluxe /scratch/ken/linux-4.4 $grep _MD_ .config
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set

But 3.3.4 still hangs at the sha1sum in 01replace, although the first two failures (00linear, 00names) now pass. For the moment, I'm giving up on this. The status field in /sys implied that whatever part of mdadm had been last used had completed.

comment:26 by bdubbs@…, 9 years ago

I'm in the process of getting several new drives for my -dev system so I can properly test mdadm, lvm, btrfs, jfs, xfs, etc. properly. Not sure when it will be ready. I have to figure out how to get power to the new drives (I think I need a custom cable).

Once I get things working, I'll post results to the -dev mailing list.

Last edited 9 years ago by bdubbs@… (previous) (diff)

comment:27 by ken@…, 9 years ago

Neil has posted a fix to linux-raid, meanwhile he tells me that the hang in tests01/replace was a kernel problem, fixed in 4.4.2 by

Commit: 1501efadc524 ("md/raid: only permit hot-add of compatible integrity profiles")

comment:28 by bdubbs@…, 9 years ago

Good to know Ken. I've been looking at the mdadm tests and made a couple of posts to the linux-raid mailing lists. I was able to work around at least one test failure, but I need to put 4.4.2 into LFS and test. I do plan on using 4.4.2 in lfs-7.9-rc2.

comment:29 by bdubbs@…, 9 years ago

I have tested mdadm on a 4.4.2 kernel. The good news is that the hangs are gone. The bad news is that there are a lot of failures: 72 tests succeed, but 51 fail.

At least two tests fail because I do not have strace on the system.

The tests take a little over an hour. I think the test time is only partially processor dependent.

One change I have suggested is:

 --- test.orig   2016-02-20 17:37:53.521197001 -0600
+++ test        2016-02-20 20:02:25.083787550 -0600
@@ -332,7 +332,7 @@
       echo "FAILED - see $logdir/$log for details"
       _fail=1
     fi
-    if [ "$savelogs" == "1" ]; then
+    if [ "$savelogs" == "1" -a -e $targetdir/log ]; then
       cp $targetdir/log $logdir/$_basename.log
     fi
     if [ "$_fail" == "1" -a "$exitonerror" == "1" ]; then 

That eliminates a bogus error when trying to copy a file when it has already been mv'ed after a test failure.

I'm going to go ahead and tag this for 7.9 with some additional comments, but will leave the ticket open.

comment:30 by bdubbs@…, 9 years ago

Resolution: fixed
Status: reopenedclosed

No new developments here. Marking as fixed.

Note: See TracTickets for help on using tickets.