From atp at piskorski.com Wed Mar 4 13:31:50 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Wed, 4 Mar 2009 16:31:50 -0500 Subject: [Warewulf] debian packages for peceus 1.4.4 and warewulf 2.9 In-Reply-To: <1232754136.4319.95.camel@loup.ece.ucsb.edu> References: <1232754136.4319.95.camel@loup.ece.ucsb.edu> Message-ID: <20090304213150.GA38034@piskorski.com> On Fri, Jan 23, 2009 at 03:42:16PM -0800, Kristian Kvilekval wrote: > > I've created a "first draft" of packages for debian of both > perceus-1.4.4 and warewulf.. These have been successfully installed > on our 64 node cluster. Kris, what version of Debian are you running? Do you know of anyone using Perceus or Warewulf on Ubuntu? -- Andrew Piskorski http://www.piskorski.com/ From kris at cs.ucsb.edu Wed Mar 4 14:00:50 2009 From: kris at cs.ucsb.edu (Kristian Kvilekval) Date: Wed, 04 Mar 2009 14:00:50 -0800 Subject: [Warewulf] debian packages for peceus 1.4.4 and warewulf 2.9 In-Reply-To: <20090304213150.GA38034@piskorski.com> References: <1232754136.4319.95.camel@loup.ece.ucsb.edu> <20090304213150.GA38034@piskorski.com> Message-ID: <1236204050.4321.48.camel@loup.ece.ucsb.edu> We (were) running testing (debian lenny) before its release. The node scripts and vnfs modules are for lenny. The original packages were based on some work done for Ubuntu. There were references to ubunto that I did not remove. Not sure if anybody is actually using them with ubuntu. Kris From stefan at mdy.univie.ac.at Wed Mar 18 07:48:34 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Wed, 18 Mar 2009 15:48:34 +0100 Subject: [Warewulf] Static vs. dynamic IP addresses Message-ID: <20090318144834.GF22110@loop.mdy.univie.ac.at> [Note: all this is with caos-nsa, so this may actually be a caos issue ???] While searching for something unrelated in the logfiles, I noted that nodes in my cluster regularly (appr. every 20 minutes) renew their IP address from the clusterhead. This confuses me since I thought the default in perceus 1.4/1.5 were static addresses (once the node is up -- it's clear that upon boot and provisioning the node has to contact the server!) I use the ipaddr module, and it definitely writes a correct /etc/sysconfig/nics/eth0 file ... On the other hand, I see no dhcp related processes running, but "something" definitely contacts my server regularly ... Mar 18 11:44:10 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 Mar 18 11:44:10 rs02 perceus[3214]: Provisioning 'mdy09-3' now... ==> that one I understand ..., but then [snip] Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 A (possibly) related problem: For testing/debugging purposes I have a machine in my perceus subnet (192.168.40.*), which is *not* under perceus control, i.e., which boots from the hard disk, has its OS on the hard disk and which is configured to have a static interface (192.168.40.211). After booting, ifconfig and ip addr show this address, but during boot, as well as during normal operation, the machine also sends dhcp requests to the (perceus) server, which dutifully assigns it the next free value (192.168.40.30 in my case). Interestingly, this results in a ghost machine 192.168.40.30 which I can ping, but which isn't accessible otherwise. The 192.168.40.211 address functions normally. Nevertheless, I don't consider this very satisfactory, and I don't understand why any dhcp requests got/get sent out in the first place ... Note that the 192.168.40.211 address is not in the range under control of the perceus-dnsmasq daemon ... Thanks, Stefan -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From darr at ocean.washington.edu Wed Mar 18 10:20:31 2009 From: darr at ocean.washington.edu (david darr) Date: Wed, 18 Mar 2009 10:20:31 -0700 Subject: [Warewulf] adding packages to VNFS capsule Message-ID: <6a8af7e80903181020u13ebb8fch869121914ad3fca7@mail.gmail.com> Hi, I think this is probably straight forward -- I just haven't figured out how to do it. I need gfortran (and associated libs) in my CentOS capsule... which has gcc but not gfortran for some reason. The Caos capsules that I have contain both gcc and gfortran. I tried to install gcc+gfortran into my CentOS mounted vnfs image using "yum --installroot" and it didn't work. I think (but don't know) that this is because the master is running RHEL and it has the wrong repos for CentOS (?). I thought the easiest way to fix this (and possibly useful for other future use) would be to add this to the centOS genroot script. Could someone explain how to do this? I looked at the script and it wasn't clear to me how to add specific packages. thanks, David -- -------------------------------------------------------------------------- David Darr - ph: (206) 616-4953 - email: darr at ocean.washington.edu School of Oceanography - Univ. of Washington - Seattle, WA. 98195 -------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090318/2588dc9a/attachment.html From jal at mdacorporation.com Wed Mar 18 11:40:42 2009 From: jal at mdacorporation.com (John LLOYD) Date: Wed, 18 Mar 2009 11:40:42 -0700 Subject: [Warewulf] adding packages to VNFS capsule In-Reply-To: <6a8af7e80903181020u13ebb8fch869121914ad3fca7@mail.gmail.com> References: <6a8af7e80903181020u13ebb8fch869121914ad3fca7@mail.gmail.com> Message-ID: <57F67688A8D72449AC80164DA982083104DE1288@VMXYVR1.ds.mda.ca> I tried, and failed, to get centos clients to work from redhat, so went to a pure centos environment. # rpm --root /mnt/vnfs-build-3-rev-45-i386 /path/to/gcc-gfortran.rpm or # vi /mnt/vnfs-build-3-rev-45-i386/etc/yum.conf <--- add your centos repositories there # yum --installroot=/mnt/vnfs-build-3-rev-45-i386 list available '*gfort*' # yum --installroot=/mnt/vnfs-build-3-rev-45-i386 install gcc-gfortran.i386 worked for me. --John _____ From: warewulf-bounces at caoslinux.org [mailto:warewulf-bounces at caoslinux.org] On Behalf Of david darr Sent: Wednesday, March 18, 2009 10:21 AM To: The Warewulf Cluster Toolkit Subject: [Warewulf] adding packages to VNFS capsule Hi, I think this is probably straight forward -- I just haven't figured out how to do it. I need gfortran (and associated libs) in my CentOS capsule... which has gcc but not gfortran for some reason. The Caos capsules that I have contain both gcc and gfortran. I tried to install gcc+gfortran into my CentOS mounted vnfs image using "yum --installroot" and it didn't work. I think (but don't know) that this is because the master is running RHEL and it has the wrong repos for CentOS (?). I thought the easiest way to fix this (and possibly useful for other future use) would be to add this to the centOS genroot script. Could someone explain how to do this? I looked at the script and it wasn't clear to me how to add specific packages. thanks, David -- ------------------------------------------------------------------------ -- David Darr - ph: (206) 616-4953 - email: darr at ocean.washington.edu School of Oceanography - Univ. of Washington - Seattle, WA. 98195 ------------------------------------------------------------------------ -- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090318/038d535e/attachment.html From david.darr at gmail.com Wed Mar 18 12:08:28 2009 From: david.darr at gmail.com (david darr) Date: Wed, 18 Mar 2009 12:08:28 -0700 Subject: [Warewulf] adding packages to VNFS capsule In-Reply-To: <57F67688A8D72449AC80164DA982083104DE1288@VMXYVR1.ds.mda.ca> References: <6a8af7e80903181020u13ebb8fch869121914ad3fca7@mail.gmail.com> <57F67688A8D72449AC80164DA982083104DE1288@VMXYVR1.ds.mda.ca> Message-ID: <6a8af7e80903181208j442f1d9bxd7161b38b02f962b@mail.gmail.com> 2009/3/18 John LLOYD > I tried, and failed, to get centos clients to work from redhat, so went > to a pure centos environment. > Interesting because I tried and failed to get centos clients to work from centos... and am having much more luck with centos clients under redhat. Thanks for the additonal info, will try it.... -------------------------------------------------------------------------- David Darr - ph: (206) 616-4953 - email: darr at ocean.washington.edu School of Oceanography - Univ. of Washington - Seattle, WA. 98195 -------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090318/4393e8f7/attachment.html From lorin at east.isi.edu Wed Mar 18 12:22:48 2009 From: lorin at east.isi.edu (Lorin Hochstein) Date: Wed, 18 Mar 2009 15:22:48 -0400 Subject: [Warewulf] ARP lookups don't work in my net config In-Reply-To: <571f1a060901110750l5a6e4edcqa0d3d1709a6af78d@mail.gmail.com> References: <571f1a060812191150l204dada8r126d4f4fd9e6256b@mail.gmail.com> <571f1a060901110750l5a6e4edcqa0d3d1709a6af78d@mail.gmail.com> Message-ID: Greg, Yup, this seems to work for me in 1.5.0. I can now provision across different subnets. (Took me a couple months, but I finally upgraded!). Thanks, Lorin On Jan 11, 2009, at 10:50 AM, Greg Kurtzer wrote: > Can you try the 1.5.0 release and add "enable-nodeid" to the append > line in /var/lib/perceus/tftp/pxelinux.cfg/default? > > Thanks, > Greg > > On Fri, Dec 19, 2008 at 11:54 AM, Lorin Hochstein > wrote: >> Sure, I'd be happy to do so. My current workaround plan is to >> hardcode >> the MACs into perceusd. Ugly, but it should get me past this hurdle. >> >> Lorin >> >> >> On Dec 19, 2008, at 2:50 PM, Greg Kurtzer wrote: >> >>> Yes, we have a workaround in progress for the 1.5 development tree >>> for >>> exactly this. :) >>> >>> Are you open to test a prerelease when it is available? >>> >>> Thank you, >>> >>> Greg >>> >>> >>> >>> On Fri, Dec 19, 2008 at 11:10 AM, Lorin Hochstein >>> wrote: >>>> Hello, >>>> I'm using Perceus 1.4.4 on a system where the management node is >>>> on a >>>> separate subnet from the compute nodes, and I don't have control >>>> over the >>>> network topology. The network is configured properly so that the >>>> compute >>>> nodes can access the DHCP server running on the management node, so >>>> that >>>> part's OK. >>>> The problem is that the management node cannot retrieve the MAC >>>> address of >>>> the compute nodes via an ARP request, so perceusd gives the >>>> following error: >>>> DEBUG [perceusd/342/(eval)]: Remote address: 10.11.72.1 >>>> (168511489) >>>> DEBUG [perceusd/344/(eval)]: Parsing command arguments 'init >>>> ' >>>> DEBUG [perceusd/347/(eval)]: Parsing provisiond's arguments () >>>> DEBUG [perceusd/362/(eval)]: Doing arp lookup on '10.11.72.1' >>>> WARN [perceusd/366/(eval)]: Could not resolve node's ID/MAC addr >>>> from >>>> 10.11.72.1, skipping. >>>> >>>> Is there anything I can do in this situation? >>>> >>>> Lorin >>>> >>>> _______________________________________________ >>>> Warewulf mailing list >>>> Warewulf at caoslinux.org >>>> http://lists.caosity.org/mailman/listinfo/warewulf >>>> >>>> >>> >>> >>> >>> -- >>> Greg Kurtzer >>> http://www.infiscale.com/ >>> http://www.runlevelzero.net/ >>> http://www.perceus.org/ >>> http://www.caoslinux.org/ >>> _______________________________________________ >>> Warewulf mailing list >>> Warewulf at caoslinux.org >>> http://lists.caosity.org/mailman/listinfo/warewulf >> >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf >> > > > > -- > Greg Kurtzer > http://www.infiscale.com/ > http://www.perceus.org/ > http://www.caoslinux.org/ > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2419 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090318/46b3c722/attachment.bin From chris.hunter at yale.edu Thu Mar 19 20:11:35 2009 From: chris.hunter at yale.edu (Chris Hunter) Date: Thu, 19 Mar 2009 23:11:35 -0400 Subject: [Warewulf] Static vs. dynamic IP addresses Message-ID: <49C30967.5050103@yale.edu> I think the default for dnsmasq is 30min dhcp lease time. After lease time has expired the dnsmasq server sends a dhcprequest to the node to renew the lease. You can change the lease time in the dnsmasq options. I use a 30d lease time for static IPs. > Message: 1 > Date: Wed, 18 Mar 2009 15:48:34 +0100 > From: stefan at mdy.univie.ac.at (Stefan Boresch) > Subject: [Warewulf] Static vs. dynamic IP addresses > To: warewulf at caoslinux.org > Message-ID: <20090318144834.GF22110 at loop.mdy.univie.ac.at> > Content-Type: text/plain; charset=us-ascii > > [Note: all this is with caos-nsa, so this may actually be a caos issue ???] > > While searching for something unrelated in the logfiles, I noted > that nodes in my cluster regularly (appr. every 20 minutes) renew their > IP address from the clusterhead. This confuses me since I thought the > default in perceus 1.4/1.5 were static addresses (once the node is up -- > it's clear that upon boot and provisioning the node has to contact the > server!) I use the ipaddr module, and it definitely writes a > correct /etc/sysconfig/nics/eth0 file ... On the other hand, I see > no dhcp related processes running, but "something" definitely contacts > my server regularly ... > > Mar 18 11:44:10 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > Mar 18 11:44:10 rs02 perceus[3214]: Provisioning 'mdy09-3' now... > ==> that one I understand ..., but then > [snip] > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > > A (possibly) related problem: For testing/debugging purposes I have a > machine in my perceus subnet (192.168.40.*), which is *not* under > perceus control, i.e., which boots from the hard disk, has its OS on > the hard disk and which is configured to have a static interface > (192.168.40.211). After booting, ifconfig and ip addr show this > address, but during boot, as well as during normal operation, the > machine also sends dhcp requests to the (perceus) server, which > dutifully assigns it the next free value (192.168.40.30 in my case). > Interestingly, this results in a ghost machine 192.168.40.30 which I can > ping, but which isn't accessible otherwise. The 192.168.40.211 address > functions normally. Nevertheless, I don't consider this very > satisfactory, and I don't understand why any dhcp requests got/get sent out > in the first place ... Note that the 192.168.40.211 address is not > in the range under control of the perceus-dnsmasq daemon ... > > Thanks, > > Stefan > > -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From gmkurtzer at gmail.com Thu Mar 19 21:29:18 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 19 Mar 2009 21:29:18 -0700 Subject: [Warewulf] Static vs. dynamic IP addresses In-Reply-To: <20090318144834.GF22110@loop.mdy.univie.ac.at> References: <20090318144834.GF22110@loop.mdy.univie.ac.at> Message-ID: <571f1a060903192129u5511079en8d8a284b528f4b1c@mail.gmail.com> Have you enabled the ipaddr perceus module? On Wed, Mar 18, 2009 at 7:48 AM, Stefan Boresch wrote: > [Note: all this is with caos-nsa, so this may actually be a caos issue ???] > > While searching for something unrelated in the logfiles, I noted > that nodes in my cluster regularly (appr. every 20 minutes) renew their > IP address from the clusterhead. This confuses me since I thought the > default in perceus 1.4/1.5 were static addresses (once the node is up -- > it's clear that upon boot and provisioning the node has to contact the > server!) ?I use the ipaddr module, and it definitely writes a > correct /etc/sysconfig/nics/eth0 file ... On the other hand, I see > no dhcp related processes running, but "something" definitely contacts > my server regularly ... > > Mar 18 11:44:10 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > Mar 18 11:44:10 rs02 perceus[3214]: Provisioning 'mdy09-3' now... > ?==> that one I understand ..., but then > [snip] > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > > A (possibly) related problem: For testing/debugging purposes I have a > machine in my perceus subnet (192.168.40.*), which is *not* under > perceus control, i.e., which boots from the hard disk, has its OS on > the hard disk and which is configured to have a static interface > (192.168.40.211). After booting, ifconfig and ip addr show this > address, but during boot, as well as during normal operation, the > machine also sends dhcp requests to the (perceus) server, which > dutifully assigns it the next free value (192.168.40.30 in my case). > Interestingly, this results in a ghost machine 192.168.40.30 which I can > ping, but which isn't accessible otherwise. The 192.168.40.211 address > functions normally. Nevertheless, I don't consider this very > satisfactory, and I don't understand why any dhcp requests got/get sent out > in the first place ... Note that the 192.168.40.211 address is not > in the range under control of the perceus-dnsmasq daemon ... > > Thanks, > > Stefan > > -- > Stefan Boresch > Institute for Computational Biological Chemistry > University of Vienna, Waehringerstr. 17 ? ? ? A-1090 Vienna, Austria > Phone: -43-1-427752715 ? ? ? ? ? ? ? ? ? ? ? ?Fax: ? -43-1-427752790 > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From gmkurtzer at gmail.com Thu Mar 19 21:52:00 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 19 Mar 2009 21:52:00 -0700 Subject: [Warewulf] ARP lookups don't work in my net config In-Reply-To: References: <571f1a060812191150l204dada8r126d4f4fd9e6256b@mail.gmail.com> <571f1a060901110750l5a6e4edcqa0d3d1709a6af78d@mail.gmail.com> Message-ID: <571f1a060903192152u2c351863k1d68adc98e9d77a5@mail.gmail.com> Excellent, thanks for following up! Greg 2009/3/18 Lorin Hochstein : > Greg, > > Yup, this seems to work for me in 1.5.0. I can now provision across > different subnets. > > (Took me a couple months, but I finally upgraded!). > > Thanks, > > Lorin > > On Jan 11, 2009, at 10:50 AM, Greg Kurtzer wrote: > >> Can you try the 1.5.0 release and add "enable-nodeid" to the append >> line in /var/lib/perceus/tftp/pxelinux.cfg/default? >> >> Thanks, >> Greg >> >> On Fri, Dec 19, 2008 at 11:54 AM, Lorin Hochstein >> wrote: >>> >>> Sure, I'd be happy to do so. My current workaround plan is to hardcode >>> the MACs into perceusd. Ugly, but it should get me past this hurdle. >>> >>> Lorin >>> >>> >>> On Dec 19, 2008, at 2:50 PM, Greg Kurtzer wrote: >>> >>>> Yes, we have a workaround in progress for the 1.5 development tree for >>>> exactly this. :) >>>> >>>> Are you open to test a prerelease when it is available? >>>> >>>> Thank you, >>>> >>>> Greg >>>> >>>> >>>> >>>> On Fri, Dec 19, 2008 at 11:10 AM, Lorin Hochstein >>>> wrote: >>>>> >>>>> Hello, >>>>> I'm using Perceus 1.4.4 on a system where the management node is on a >>>>> separate subnet from the compute nodes, and I don't have control >>>>> over the >>>>> network topology. ?The network is configured properly so that the >>>>> compute >>>>> nodes can access the DHCP server running on the management node, so >>>>> that >>>>> part's OK. >>>>> The problem is that the management node cannot retrieve the MAC >>>>> address of >>>>> the compute nodes via an ARP request, so perceusd gives the >>>>> following error: >>>>> DEBUG ? [perceusd/342/(eval)]: Remote address: 10.11.72.1 (168511489) >>>>> DEBUG ? [perceusd/344/(eval)]: Parsing command arguments 'init >>>>> ' >>>>> DEBUG ? [perceusd/347/(eval)]: Parsing provisiond's arguments () >>>>> DEBUG ? [perceusd/362/(eval)]: Doing arp lookup on '10.11.72.1' >>>>> WARN ? ?[perceusd/366/(eval)]: Could not resolve node's ID/MAC addr >>>>> from >>>>> 10.11.72.1, skipping. >>>>> >>>>> Is there anything I can do in this situation? >>>>> >>>>> Lorin >>>>> >>>>> _______________________________________________ >>>>> Warewulf mailing list >>>>> Warewulf at caoslinux.org >>>>> http://lists.caosity.org/mailman/listinfo/warewulf >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Greg Kurtzer >>>> http://www.infiscale.com/ >>>> http://www.runlevelzero.net/ >>>> http://www.perceus.org/ >>>> http://www.caoslinux.org/ >>>> _______________________________________________ >>>> Warewulf mailing list >>>> Warewulf at caoslinux.org >>>> http://lists.caosity.org/mailman/listinfo/warewulf >>> >>> _______________________________________________ >>> Warewulf mailing list >>> Warewulf at caoslinux.org >>> http://lists.caosity.org/mailman/listinfo/warewulf >>> >> >> >> >> -- >> Greg Kurtzer >> http://www.infiscale.com/ >> http://www.perceus.org/ >> http://www.caoslinux.org/ >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf > > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From stefan at mdy.univie.ac.at Fri Mar 20 01:11:58 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Fri, 20 Mar 2009 09:11:58 +0100 Subject: [Warewulf] Spurious dhcp requests solved [Re: Static vs. dynamic IP addresses] In-Reply-To: <571f1a060903192129u5511079en8d8a284b528f4b1c@mail.gmail.com> References: <20090318144834.GF22110@loop.mdy.univie.ac.at> <571f1a060903192129u5511079en8d8a284b528f4b1c@mail.gmail.com> Message-ID: <20090320081158.GJ22110@loop.mdy.univie.ac.at> Chris, Greg, thanks for the reply. The issue is solved. The Intel ME (management engine) sends out these dhcp requests, completely bypassing the operating system. Once you disable it, no more dhcpd requests, no more duplicate packets with ping (apparently Intel ME finds it OK to answer pings on top of Linux as well ) It only took me 2 days, helped the last three hours by a much more experienced sysadmin to eventually figure this out by chance ... by now I can laugh about the whole thing ;-) Best regards -- sorry for the false alarm! Stefan On Thu, Mar 19, 2009 at 09:29:18PM -0700, Greg Kurtzer wrote: > Have you enabled the ipaddr perceus module? > > > > On Wed, Mar 18, 2009 at 7:48 AM, Stefan Boresch wrote: > > [Note: all this is with caos-nsa, so this may actually be a caos issue ???] > > > > While searching for something unrelated in the logfiles, I noted > > that nodes in my cluster regularly (appr. every 20 minutes) renew their > > IP address from the clusterhead. This confuses me since I thought the > > default in perceus 1.4/1.5 were static addresses (once the node is up -- > > it's clear that upon boot and provisioning the node has to contact the > > server!) ?I use the ipaddr module, and it definitely writes a > > correct /etc/sysconfig/nics/eth0 file ... On the other hand, I see > > no dhcp related processes running, but "something" definitely contacts > > my server regularly ... > > > > Mar 18 11:44:10 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > > Mar 18 11:44:10 rs02 perceus[3214]: Provisioning 'mdy09-3' now... > > ?==> that one I understand ..., but then > > [snip] > > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > > Mar 18 15:06:00 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPREQUEST(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 > > Mar 18 15:33:46 rs02 perceus-dnsmasq[3199]: DHCPACK(eth1) 192.168.40.28 00:1c:c0:6f:ad:e2 mdy09-3 > > > > A (possibly) related problem: For testing/debugging purposes I have a > > machine in my perceus subnet (192.168.40.*), which is *not* under > > perceus control, i.e., which boots from the hard disk, has its OS on > > the hard disk and which is configured to have a static interface > > (192.168.40.211). After booting, ifconfig and ip addr show this > > address, but during boot, as well as during normal operation, the > > machine also sends dhcp requests to the (perceus) server, which > > dutifully assigns it the next free value (192.168.40.30 in my case). > > Interestingly, this results in a ghost machine 192.168.40.30 which I can > > ping, but which isn't accessible otherwise. The 192.168.40.211 address > > functions normally. Nevertheless, I don't consider this very > > satisfactory, and I don't understand why any dhcp requests got/get sent out > > in the first place ... Note that the 192.168.40.211 address is not > > in the range under control of the perceus-dnsmasq daemon ... > > > > Thanks, > > > > Stefan > > > > -- > > Stefan Boresch > > Institute for Computational Biological Chemistry > > University of Vienna, Waehringerstr. 17 ? ? ? A-1090 Vienna, Austria > > Phone: -43-1-427752715 ? ? ? ? ? ? ? ? ? ? ? ?Fax: ? -43-1-427752790 > > _______________________________________________ > > Warewulf mailing list > > Warewulf at caoslinux.org > > http://lists.caosity.org/mailman/listinfo/warewulf > > > > > > -- > Greg Kurtzer > http://www.infiscale.com/ > http://www.perceus.org/ > http://www.caoslinux.org/ > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From lorin at east.isi.edu Fri Mar 20 05:56:13 2009 From: lorin at east.isi.edu (Lorin Hochstein) Date: Fri, 20 Mar 2009 08:56:13 -0400 Subject: [Warewulf] ARP lookups don't work in my net config In-Reply-To: <571f1a060903192152u2c351863k1d68adc98e9d77a5@mail.gmail.com> References: <571f1a060812191150l204dada8r126d4f4fd9e6256b@mail.gmail.com> <571f1a060901110750l5a6e4edcqa0d3d1709a6af78d@mail.gmail.com> <571f1a060903192152u2c351863k1d68adc98e9d77a5@mail.gmail.com> Message-ID: <8B7A3C04-959D-4259-91E9-1C7E0F5FD784@east.isi.edu> Quick note: I also had to uncomment the following line in /etc/init.d/ provisiond so that ARP-lookup don't happen when provisiond checks in with perceusd: #NODEID="nodeid=`ethinfo -aq eth0`" It might be helpful to mention this in the userguide. (Or, if it's there, I missed it). Thanks, Lorin On Mar 20, 2009, at 12:52 AM, Greg Kurtzer wrote: > Excellent, thanks for following up! > > Greg > > 2009/3/18 Lorin Hochstein : >> Greg, >> >> Yup, this seems to work for me in 1.5.0. I can now provision across >> different subnets. >> >> (Took me a couple months, but I finally upgraded!). >> >> Thanks, >> >> Lorin >> >> On Jan 11, 2009, at 10:50 AM, Greg Kurtzer wrote: >> >>> Can you try the 1.5.0 release and add "enable-nodeid" to the append >>> line in /var/lib/perceus/tftp/pxelinux.cfg/default? >>> >>> Thanks, >>> Greg >>> >>> On Fri, Dec 19, 2008 at 11:54 AM, Lorin Hochstein >> > >>> wrote: >>>> >>>> Sure, I'd be happy to do so. My current workaround plan is to >>>> hardcode >>>> the MACs into perceusd. Ugly, but it should get me past this >>>> hurdle. >>>> >>>> Lorin >>>> >>>> >>>> On Dec 19, 2008, at 2:50 PM, Greg Kurtzer wrote: >>>> >>>>> Yes, we have a workaround in progress for the 1.5 development >>>>> tree for >>>>> exactly this. :) >>>>> >>>>> Are you open to test a prerelease when it is available? >>>>> >>>>> Thank you, >>>>> >>>>> Greg >>>>> >>>>> >>>>> >>>>> On Fri, Dec 19, 2008 at 11:10 AM, Lorin Hochstein >>>>> wrote: >>>>>> >>>>>> Hello, >>>>>> I'm using Perceus 1.4.4 on a system where the management node >>>>>> is on a >>>>>> separate subnet from the compute nodes, and I don't have control >>>>>> over the >>>>>> network topology. The network is configured properly so that the >>>>>> compute >>>>>> nodes can access the DHCP server running on the management >>>>>> node, so >>>>>> that >>>>>> part's OK. >>>>>> The problem is that the management node cannot retrieve the MAC >>>>>> address of >>>>>> the compute nodes via an ARP request, so perceusd gives the >>>>>> following error: >>>>>> DEBUG [perceusd/342/(eval)]: Remote address: 10.11.72.1 >>>>>> (168511489) >>>>>> DEBUG [perceusd/344/(eval)]: Parsing command arguments 'init >>>>>> ' >>>>>> DEBUG [perceusd/347/(eval)]: Parsing provisiond's arguments () >>>>>> DEBUG [perceusd/362/(eval)]: Doing arp lookup on '10.11.72.1' >>>>>> WARN [perceusd/366/(eval)]: Could not resolve node's ID/MAC >>>>>> addr >>>>>> from >>>>>> 10.11.72.1, skipping. >>>>>> >>>>>> Is there anything I can do in this situation? >>>>>> >>>>>> Lorin >>>>>> >>>>>> _______________________________________________ >>>>>> Warewulf mailing list >>>>>> Warewulf at caoslinux.org >>>>>> http://lists.caosity.org/mailman/listinfo/warewulf >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Greg Kurtzer >>>>> http://www.infiscale.com/ >>>>> http://www.runlevelzero.net/ >>>>> http://www.perceus.org/ >>>>> http://www.caoslinux.org/ >>>>> _______________________________________________ >>>>> Warewulf mailing list >>>>> Warewulf at caoslinux.org >>>>> http://lists.caosity.org/mailman/listinfo/warewulf >>>> >>>> _______________________________________________ >>>> Warewulf mailing list >>>> Warewulf at caoslinux.org >>>> http://lists.caosity.org/mailman/listinfo/warewulf >>>> >>> >>> >>> >>> -- >>> Greg Kurtzer >>> http://www.infiscale.com/ >>> http://www.perceus.org/ >>> http://www.caoslinux.org/ >>> _______________________________________________ >>> Warewulf mailing list >>> Warewulf at caoslinux.org >>> http://lists.caosity.org/mailman/listinfo/warewulf >> >> >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf >> >> > > > > -- > Greg Kurtzer > http://www.infiscale.com/ > http://www.perceus.org/ > http://www.caoslinux.org/ > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2419 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090320/02b228e9/attachment.bin From zs.myth at gmail.com Mon Mar 23 07:47:12 2009 From: zs.myth at gmail.com (Zsolt Kovacs) Date: Mon, 23 Mar 2009 15:47:12 +0100 Subject: [Warewulf] problems with slurm scheduling jobs Message-ID: <84e2b6660903230747m36eae54aj92416466fbbf318a@mail.gmail.com> Hi, I have problems with slurm scheduling jobs on my cluster. I am stuck a bit, unfortunately my experience with clustering is rather limited, so I need your help to solve this problem. Here is what I have done so far: I have installed the newest CAOS NSA, I used all the default options and tools to configure it, it seems to work fine. I could get 20 GFlops using HPL after basic tuning, wwtop showed that all nodes were used during the benchmark. I also set up slurm on head and on computing nodes as well. When I switched on the nodes, Perceus updated the slurm.conf file as expected. Also I have checked that sinfo works from all nodes. So when I switch on the cluster sinfo reports that all nodes are idle. I can run a job on 1 node, but when I want to use 2 or more I get a 'Communication connection failure' message. So in principle I can use only n0000, but if I switch off n0000, n0001 will be used, and if I keep switching off nodes one-by-one, the next node will be used. I have attache a text file, which contains the commands and their output, and also the log from the slurmctld. In spite of this problem I would like to emphasis, that Caos NSA and Perceus a very good tool, which enabled me to jump start with clustering! Thanks in advance, Zsolt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/bc4d5445/attachment.html -------------- next part -------------- ========================================== [kovax at csmith001 ~]$ sinfo -alN Mon Mar 23 14:48:06 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur [kovax at csmith001 ~]$ sinfo -lN Mon Mar 23 14:48:11 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur [kovax at csmith001 ~]$ sinfo -lN Mon Mar 23 14:48:13 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ sinfo -l Mon Mar 23 14:48:26 2009 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT SHARE GROUPS NODES STATE NODELIST primary* up infinite 1-infinite no NO all 2 down* n[0002-0003] primary* up infinite 1-infinite no NO all 2 idle n[0000-0001] [kovax at csmith001 ~]$ sinfo -N NODELIST NODES PARTITION STATE n[0000-0001] 2 primary* idle n[0002-0003] 2 primary* down* [kovax at csmith001 ~]$ sinfo -lN Mon Mar 23 14:48:35 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ sinfo -lN Mon Mar 23 14:49:00 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST primary* up infinite 2 idle* n[0000-0001] primary* up infinite 2 down* n[0002-0003] [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ [kovax at csmith001 ~]$ sinfo -lN Mon Mar 23 14:49:20 2009 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON n[0000-0001] 2 primary* idle* 2 1:2:1 3815 256 1 (null) none n0002 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur n0003 1 primary* down* 2 1:2:1 1876 256 1 (null) Not responding [slur ===================================== Mar 23 15:22:02 csmith001 slurmctld[2390]: node n0003 returned to service Mar 23 15:22:04 csmith001 slurmctld[2390]: node n0002 returned to service Mar 23 15:22:29 csmith001 slurmctld[2390]: node n0001 returned to service Mar 23 15:23:21 csmith001 slurmctld[2390]: node n0000 returned to service Mar 23 15:23:37 csmith001 slurmctld[2390]: _slurm_rpc_allocate_resources JobId=102 NodeList=n0000 usec=156 Mar 23 15:23:37 csmith001 slurmctld[2390]: _slurm_rpc_job_step_create: StepId=102.0 n0000 usec=1259 Mar 23 15:23:37 csmith001 slurmctld[2390]: _slurm_rpc_step_complete StepId=102.0 usec=28 Mar 23 15:23:37 csmith001 slurmctld[2390]: completing job 102 Mar 23 15:23:37 csmith001 slurmctld[2390]: job_complete for JobId=102 successful Mar 23 15:23:40 csmith001 slurmctld[2390]: _slurm_rpc_allocate_resources JobId=103 NodeList=n[0000-0001] usec=155 Mar 23 15:23:40 csmith001 slurmctld[2390]: _slurm_rpc_job_step_create: StepId=103.0 n[0000-0001] usec=1291 Mar 23 15:23:40 csmith001 slurmctld[2390]: completing job 103 Mar 23 15:23:40 csmith001 slurmctld[2390]: job_complete for JobId=103 successful Mar 23 15:23:40 csmith001 slurmctld[2390]: error: slurm_msg_sendto: Transport endpoint is not connected Mar 23 15:23:54 csmith001 slurmctld[2390]: _slurm_rpc_allocate_resources JobId=104 NodeList=n[0000,0002-0003] usec=175 Mar 23 15:23:54 csmith001 slurmctld[2390]: _slurm_rpc_job_step_create: StepId=104.0 n[0000,0002-0003] usec=1305 Mar 23 15:23:54 csmith001 slurmctld[2390]: completing job 104 Mar 23 15:23:54 csmith001 slurmctld[2390]: job_complete for JobId=104 successful Mar 23 15:23:55 csmith001 slurmctld[2390]: Node n0001 now responding Mar 23 15:24:07 csmith001 slurmctld[2390]: _slurm_rpc_allocate_resources JobId=105 NodeList=(null) usec=97 Mar 23 15:24:34 csmith001 slurmctld[2390]: Resending TERMINATE_JOB request JobId=103 Nodelist=n[0000-0001] Mar 23 15:25:32 csmith001 slurmctld[2390]: error: Nodes n[0001-0003] not responding Mar 23 15:25:35 csmith001 slurmctld[2390]: Node n0001 now responding Mar 23 15:26:34 csmith001 slurmctld[2390]: Job 103 completion process took 174 seconds Mar 23 15:26:54 csmith001 slurmctld[2390]: completing job 105 Mar 23 15:26:54 csmith001 slurmctld[2390]: job_complete for JobId=105 successful Mar 23 15:28:55 csmith001 slurmctld[2390]: error: Nodes n[0002-0003] not responding, setting DOWN Mar 23 15:32:15 csmith001 slurmctld[2390]: Node n0001 now responding From astevens at infiscale.com Mon Mar 23 09:02:01 2009 From: astevens at infiscale.com (astevens at infiscale.com) Date: Mon, 23 Mar 2009 16:02:01 +0000 Subject: [Warewulf] problems with slurm scheduling jobs In-Reply-To: <84e2b6660903230747m36eae54aj92416466fbbf318a@mail.gmail.com> References: <84e2b6660903230747m36eae54aj92416466fbbf318a@mail.gmail.com> Message-ID: <1646223784-1237824156-cardhu_decombobulator_blackberry.rim.net-848900756-@bxe1125.bisx.prod.on.blackberry> Was this issue introduced with a fresh install or an update? Can you send your perceus and slurm confs for me to take a look at? Your close, let's get you the rest of the way there :) Arthur Sent via BlackBerry from T-Mobile -----Original Message----- From: Zsolt Kovacs Date: Mon, 23 Mar 2009 15:47:12 To: Subject: [Warewulf] problems with slurm scheduling jobs _______________________________________________ Warewulf mailing list Warewulf at caoslinux.org http://lists.caosity.org/mailman/listinfo/warewulf From astevens at infiscale.com Mon Mar 23 10:37:45 2009 From: astevens at infiscale.com (Arthur Stevens) Date: Mon, 23 Mar 2009 10:37:45 -0700 Subject: [Warewulf] problems with slurm scheduling jobs References: <84e2b6660903230747m36eae54aj92416466fbbf318a@mail.gmail.com> Message-ID: <68914AA1A2A24EF6BAD366402C798D2A@computer> Ok we notice some config stuff changed on a dot release. We will hammer out and have a new package up asap. Thanks for the heads up, Arthur ----- Original Message ----- From: Zsolt Kovacs To: Warewulf at caoslinux.org Sent: Monday, March 23, 2009 7:47 AM Subject: [Warewulf] problems with slurm scheduling jobs Hi, I have problems with slurm scheduling jobs on my cluster. I am stuck a bit, unfortunately my experience with clustering is rather limited, so I need your help to solve this problem. Here is what I have done so far: I have installed the newest CAOS NSA, I used all the default options and tools to configure it, it seems to work fine. I could get 20 GFlops using HPL after basic tuning, wwtop showed that all nodes were used during the benchmark. I also set up slurm on head and on computing nodes as well. When I switched on the nodes, Perceus updated the slurm.conf file as expected. Also I have checked that sinfo works from all nodes. So when I switch on the cluster sinfo reports that all nodes are idle. I can run a job on 1 node, but when I want to use 2 or more I get a 'Communication connection failure' message. So in principle I can use only n0000, but if I switch off n0000, n0001 will be used, and if I keep switching off nodes one-by-one, the next node will be used. I have attache a text file, which contains the commands and their output, and also the log from the slurmctld. In spite of this problem I would like to emphasis, that Caos NSA and Perceus a very good tool, which enabled me to jump start with clustering! Thanks in advance, Zsolt ------------------------------------------------------------------------------ _______________________________________________ Warewulf mailing list Warewulf at caoslinux.org http://lists.caosity.org/mailman/listinfo/warewulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/8d28f058/attachment.html From zs.myth at gmail.com Mon Mar 23 11:03:30 2009 From: zs.myth at gmail.com (Zsolt Kovacs) Date: Mon, 23 Mar 2009 19:03:30 +0100 Subject: [Warewulf] problems with slurm scheduling jobs In-Reply-To: <1646223784-1237824156-cardhu_decombobulator_blackberry.rim.net-848900756-@bxe1125.bisx.prod.on.blackberry> References: <84e2b6660903230747m36eae54aj92416466fbbf318a@mail.gmail.com> <1646223784-1237824156-cardhu_decombobulator_blackberry.rim.net-848900756-@bxe1125.bisx.prod.on.blackberry> Message-ID: <84e2b6660903231103y4b00c15aub169b936f8ebdcc4@mail.gmail.com> Arthur, Was this issue introduced with a fresh install or an update? I cannot tell you this unfortunately. Can you send your perceus and slurm confs for me to take a look at? The files are attached. Your close, let's get you the rest of the way there :) > Cool and Thanks :), Zsolt -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/370cd8ed/attachment.html -------------- next part -------------- A non-text attachment was scrubbed... Name: slurm.conf Type: application/octet-stream Size: 1025 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/370cd8ed/attachment.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: defaults.conf Type: application/octet-stream Size: 973 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/370cd8ed/attachment-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: dnsmasq.conf Type: application/octet-stream Size: 206 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/370cd8ed/attachment-0002.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: perceus.conf Type: application/octet-stream Size: 2604 bytes Desc: not available Url : http://altruistic.infiscale.org/pipermail/perceus/attachments/20090323/370cd8ed/attachment-0003.obj From 4ilya.m+warewulf at gmail.com Tue Mar 24 19:00:13 2009 From: 4ilya.m+warewulf at gmail.com (Ilya Malinov) Date: Tue, 24 Mar 2009 19:00:13 -0700 Subject: [Warewulf] NFS timeouts during provisioning Message-ID: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> Hi, In our several small clusters of 30 nodes each, I always run into a problem with the NFS RPC timeouts both during provisioning and after that for directories exported from the head node. It seems that under certain conditions, NFS server on the head node runs out of steam and attempts to close some TCP connections. It puts them into FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. Those never acknowledge but instead continue to poll the server. The solution for the main OS on the nodes was to mount "hybridized" directories using UDP. However, the mount binary in the initramfs of provisioning state is in fact a symlink to busybox. It seems that the latter does not support mouning over udp. So the only thing I was able to do to get provisioning going, was to keep restarting the NFS server on the head node every time I saw that some compute nodes were stuck in provisioning. I did find some references online that confirmed this behavior of NFS on Linux under heavy load. However, provisioning of just 30 nodes does not seem like a big loan on the NFS server. I was wondering if people on this list have experienced this kind of problem and have come up with solutions or workarounds. We are using Perceus 1.3.7, but I do not think it matters in this case. Thank you, Ilya. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090324/be54bc7e/attachment.html From astevens at infiscale.com Tue Mar 24 20:38:48 2009 From: astevens at infiscale.com (Arthur Stevens) Date: Tue, 24 Mar 2009 20:38:48 -0700 Subject: [Warewulf] NFS timeouts during provisioning References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> Message-ID: <0B860E8069A74D318FC7A7B8502E25FD@computer> Hello Ilya, We have made a lot of changes and updates since that version and are up to 1.5.x now. 30 nodes should not be much load at all for a NFS on Caos. What version of our NFS are you using? Thanks, Arthur ----- Original Message ----- From: Ilya Malinov To: warewulf at caoslinux.org Sent: Tuesday, March 24, 2009 7:00 PM Subject: [Warewulf] NFS timeouts during provisioning Hi, In our several small clusters of 30 nodes each, I always run into a problem with the NFS RPC timeouts both during provisioning and after that for directories exported from the head node. It seems that under certain conditions, NFS server on the head node runs out of steam and attempts to close some TCP connections. It puts them into FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. Those never acknowledge but instead continue to poll the server. The solution for the main OS on the nodes was to mount "hybridized" directories using UDP. However, the mount binary in the initramfs of provisioning state is in fact a symlink to busybox. It seems that the latter does not support mouning over udp. So the only thing I was able to do to get provisioning going, was to keep restarting the NFS server on the head node every time I saw that some compute nodes were stuck in provisioning. I did find some references online that confirmed this behavior of NFS on Linux under heavy load. However, provisioning of just 30 nodes does not seem like a big loan on the NFS server. I was wondering if people on this list have experienced this kind of problem and have come up with solutions or workarounds. We are using Perceus 1.3.7, but I do not think it matters in this case. Thank you, Ilya. ------------------------------------------------------------------------------ _______________________________________________ Warewulf mailing list Warewulf at caoslinux.org http://lists.caosity.org/mailman/listinfo/warewulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090324/db16fae1/attachment.html From gmkurtzer at gmail.com Tue Mar 24 21:05:18 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Tue, 24 Mar 2009 21:05:18 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <0B860E8069A74D318FC7A7B8502E25FD@computer> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> Message-ID: <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> One of the changes that Art maybe referring to is the inclusion of xget for provisioning. We have seen mixed results with it, but when it works it works very well. You can test it out by upgrading and then enabling it in the perceus.conf. I am not sure why you are seeing such issues with Linux NFS. While it isn't known for being the fastest NFS server out there, I have personally seen it boot and deal with hybridization of over 600 nodes booting at exactly the same time with no faults (this was using Caos Linux on a system with 16 cores, 32GB of memory and high performance local storage). Actually, I was just as impressed that the DHCP server got all the nodes setup correctly in that case. Good luck! Greg 2009/3/24 Arthur Stevens : > Hello Ilya, > > We have made a lot of changes and updates since that version and are up to > 1.5.x now. ?30 nodes should not be much load at all for a NFS on Caos. What > version of our NFS are you using? > > Thanks, > > Arthur > > ----- Original Message ----- > From: Ilya Malinov > To: warewulf at caoslinux.org > Sent: Tuesday, March 24, 2009 7:00 PM > Subject: [Warewulf] NFS timeouts during provisioning > Hi, > > In our several small clusters of 30 nodes each, I always run into a problem > with the NFS RPC timeouts both during provisioning and after that for > directories exported from the head node. > > It seems that under certain conditions, NFS server on the head node runs out > of steam and attempts to close some TCP connections. It puts them into > FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. Those > never acknowledge but instead continue to poll the server. > > The solution for the main OS on the nodes was to mount "hybridized" > directories using UDP. However, the mount binary in the initramfs of > provisioning state is in fact a symlink to busybox. It seems that the latter > does not support mouning over udp. So the only thing I was able to do to get > provisioning going, was to keep restarting the NFS server on the head node > every time I saw that some compute nodes were stuck in provisioning. > > I did find some references online that confirmed this behavior of NFS on > Linux under heavy load. However, provisioning of just 30 nodes does not seem > like a big loan on the NFS server. > > I was wondering if people on this list have experienced this kind of problem > and have come up with solutions or workarounds. > > We are using Perceus 1.3.7, but I do not think it matters in this case. > > Thank you, > Ilya. > > ________________________________ > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From 4ilya.m+warewulf at gmail.com Wed Mar 25 11:48:37 2009 From: 4ilya.m+warewulf at gmail.com (Ilya Malinov) Date: Wed, 25 Mar 2009 11:48:37 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> Message-ID: <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> Hello, Both the head and the nodes are Debian-based. The head runs 2.6.28 kernel from kernel.org with 1.0.10 NFS untils. However, the problem existed with previous kernels as well. We will upgrade to a more recent version of Perceus, however it cannot happen today or tomorrow. (I need to work out several issues with new Perceus, e.g., its initramfs does not have kernel modules for sata, and I have a Perceus module that prepares hard drives during provisioning, etc.) XGET and even HTTP are nice alternatives for NFS during provisioning, but the problem with NFS remains even after the nodes are provosioned. Switching to UDP in the main OS as I did is a workaround, but not a desirable solution. Perhaps, this list is not exactly the place to discuss NFS issues. I was just hoping that someone has run into similar issues and could share the solution. Ilya. On Tue, Mar 24, 2009 at 9:05 PM, Greg Kurtzer wrote: > One of the changes that Art maybe referring to is the inclusion of > xget for provisioning. We have seen mixed results with it, but when it > works it works very well. You can test it out by upgrading and then > enabling it in the perceus.conf. > > I am not sure why you are seeing such issues with Linux NFS. While it > isn't known for being the fastest NFS server out there, I have > personally seen it boot and deal with hybridization of over 600 nodes > booting at exactly the same time with no faults (this was using Caos > Linux on a system with 16 cores, 32GB of memory and high performance > local storage). > > Actually, I was just as impressed that the DHCP server got all the > nodes setup correctly in that case. > > Good luck! > > Greg > > > 2009/3/24 Arthur Stevens : > > Hello Ilya, > > > > We have made a lot of changes and updates since that version and are up > to > > 1.5.x now. 30 nodes should not be much load at all for a NFS on Caos. > What > > version of our NFS are you using? > > > > Thanks, > > > > Arthur > > > > ----- Original Message ----- > > From: Ilya Malinov > > To: warewulf at caoslinux.org > > Sent: Tuesday, March 24, 2009 7:00 PM > > Subject: [Warewulf] NFS timeouts during provisioning > > Hi, > > > > In our several small clusters of 30 nodes each, I always run into a > problem > > with the NFS RPC timeouts both during provisioning and after that for > > directories exported from the head node. > > > > It seems that under certain conditions, NFS server on the head node runs > out > > of steam and attempts to close some TCP connections. It puts them into > > FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. Those > > never acknowledge but instead continue to poll the server. > > > > The solution for the main OS on the nodes was to mount "hybridized" > > directories using UDP. However, the mount binary in the initramfs of > > provisioning state is in fact a symlink to busybox. It seems that the > latter > > does not support mouning over udp. So the only thing I was able to do to > get > > provisioning going, was to keep restarting the NFS server on the head > node > > every time I saw that some compute nodes were stuck in provisioning. > > > > I did find some references online that confirmed this behavior of NFS on > > Linux under heavy load. However, provisioning of just 30 nodes does not > seem > > like a big loan on the NFS server. > > > > I was wondering if people on this list have experienced this kind of > problem > > and have come up with solutions or workarounds. > > > > We are using Perceus 1.3.7, but I do not think it matters in this case. > > > > Thank you, > > Ilya. > > > > ________________________________ > > > > _______________________________________________ > > Warewulf mailing list > > Warewulf at caoslinux.org > > http://lists.caosity.org/mailman/listinfo/warewulf > > > > _______________________________________________ > > Warewulf mailing list > > Warewulf at caoslinux.org > > http://lists.caosity.org/mailman/listinfo/warewulf > > > > > > > > -- > Greg Kurtzer > http://www.infiscale.com/ > http://www.perceus.org/ > http://www.caoslinux.org/ > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090325/e341b3d0/attachment.html From astevens at infiscale.com Wed Mar 25 12:08:47 2009 From: astevens at infiscale.com (Arthur Stevens) Date: Wed, 25 Mar 2009 12:08:47 -0700 Subject: [Warewulf] NFS timeouts during provisioning References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com><0B860E8069A74D318FC7A7B8502E25FD@computer><571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> Message-ID: <79F8AD25EF23447480CBE66D8564200C@computer> I would blame something with the Debian stuff since we have systems over 1k nodes using nfs ok. Unless it is an old initramfs, sata support is their too. Maybe trying the ubuntu Perceus would be an option for you. Most our user base is Caos or a varient of a Enterprise Linux. Please let us know what you find so we can fix up anything on our end that might help. Thanks, Arthur ----- Original Message ----- From: Ilya Malinov To: The Warewulf Cluster Toolkit Sent: Wednesday, March 25, 2009 11:48 AM Subject: Re: [Warewulf] NFS timeouts during provisioning Hello, Both the head and the nodes are Debian-based. The head runs 2.6.28 kernel from kernel.org with 1.0.10 NFS untils. However, the problem existed with previous kernels as well. We will upgrade to a more recent version of Perceus, however it cannot happen today or tomorrow. (I need to work out several issues with new Perceus, e.g., its initramfs does not have kernel modules for sata, and I have a Perceus module that prepares hard drives during provisioning, etc.) XGET and even HTTP are nice alternatives for NFS during provisioning, but the problem with NFS remains even after the nodes are provosioned. Switching to UDP in the main OS as I did is a workaround, but not a desirable solution. Perhaps, this list is not exactly the place to discuss NFS issues. I was just hoping that someone has run into similar issues and could share the solution. Ilya. On Tue, Mar 24, 2009 at 9:05 PM, Greg Kurtzer wrote: One of the changes that Art maybe referring to is the inclusion of xget for provisioning. We have seen mixed results with it, but when it works it works very well. You can test it out by upgrading and then enabling it in the perceus.conf. I am not sure why you are seeing such issues with Linux NFS. While it isn't known for being the fastest NFS server out there, I have personally seen it boot and deal with hybridization of over 600 nodes booting at exactly the same time with no faults (this was using Caos Linux on a system with 16 cores, 32GB of memory and high performance local storage). Actually, I was just as impressed that the DHCP server got all the nodes setup correctly in that case. Good luck! Greg 2009/3/24 Arthur Stevens : > Hello Ilya, > > We have made a lot of changes and updates since that version and are up to > 1.5.x now. 30 nodes should not be much load at all for a NFS on Caos. What > version of our NFS are you using? > > Thanks, > > Arthur > > ----- Original Message ----- > From: Ilya Malinov > To: warewulf at caoslinux.org > Sent: Tuesday, March 24, 2009 7:00 PM > Subject: [Warewulf] NFS timeouts during provisioning > Hi, > > In our several small clusters of 30 nodes each, I always run into a problem > with the NFS RPC timeouts both during provisioning and after that for > directories exported from the head node. > > It seems that under certain conditions, NFS server on the head node runs out > of steam and attempts to close some TCP connections. It puts them into > FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. Those > never acknowledge but instead continue to poll the server. > > The solution for the main OS on the nodes was to mount "hybridized" > directories using UDP. However, the mount binary in the initramfs of > provisioning state is in fact a symlink to busybox. It seems that the latter > does not support mouning over udp. So the only thing I was able to do to get > provisioning going, was to keep restarting the NFS server on the head node > every time I saw that some compute nodes were stuck in provisioning. > > I did find some references online that confirmed this behavior of NFS on > Linux under heavy load. However, provisioning of just 30 nodes does not seem > like a big loan on the NFS server. > > I was wondering if people on this list have experienced this kind of problem > and have come up with solutions or workarounds. > > We are using Perceus 1.3.7, but I do not think it matters in this case. > > Thank you, > Ilya. > > ________________________________ > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ _______________________________________________ Warewulf mailing list Warewulf at caoslinux.org http://lists.caosity.org/mailman/listinfo/warewulf ------------------------------------------------------------------------------ _______________________________________________ Warewulf mailing list Warewulf at caoslinux.org http://lists.caosity.org/mailman/listinfo/warewulf -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090325/c2a228b9/attachment.html From griznog at gmail.com Wed Mar 25 12:29:41 2009 From: griznog at gmail.com (John Hanks) Date: Wed, 25 Mar 2009 15:29:41 -0400 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> Message-ID: 2009/3/25 Ilya Malinov <4ilya.m+warewulf at gmail.com>: > Hello, > > Perhaps, this list is not exactly the place to discuss NFS issues. I was > just hoping that someone has run into similar issues and could share the > solution. > Ilya, I don't know the debain method for doing it, but I've found that the defaults for number of NFS threads is always set too low in redhat derivatives. I generally set it to at least 8 * number_of_cores and sometimes even higher. I'd suggest bumping that up as something to quickly test. jbh From 4ilya.m+warewulf at gmail.com Wed Mar 25 12:42:11 2009 From: 4ilya.m+warewulf at gmail.com (Ilya Malinov) Date: Wed, 25 Mar 2009 12:42:11 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> Message-ID: <71e2304f0903251242t7c86d3a1mb7ff7c51c2d9d57e@mail.gmail.com> Hi John, Thanks for the idea. I also thought about it and bumped the number of NFS precesses to 64 sometime ago. However, this did not help. Ilya. On Wed, Mar 25, 2009 at 12:29 PM, John Hanks wrote: > 2009/3/25 Ilya Malinov <4ilya.m+warewulf at gmail.com<4ilya.m%2Bwarewulf at gmail.com> > >: > > Hello, > > > > Perhaps, this list is not exactly the place to discuss NFS issues. I was > > just hoping that someone has run into similar issues and could share the > > solution. > > > > Ilya, > > I don't know the debain method for doing it, but I've found that the > defaults for number of NFS threads is always set too low in redhat > derivatives. I generally set it to at least 8 * number_of_cores and > sometimes even higher. I'd suggest bumping that up as something to > quickly test. > > jbh > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090325/64ceffb7/attachment.html From 4ilya.m+warewulf at gmail.com Wed Mar 25 12:50:21 2009 From: 4ilya.m+warewulf at gmail.com (Ilya Malinov) Date: Wed, 25 Mar 2009 12:50:21 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <79F8AD25EF23447480CBE66D8564200C@computer> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> <79F8AD25EF23447480CBE66D8564200C@computer> Message-ID: <71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> Actually, I did not find sata drivers in the initramfs of Perceus 1.5.0. I had to recompile the kernel and re-create initrams to have those modules there. As for blaming Debian, maybe you're right. However, the kernel that we run came from kernel.org, not from a Debian package. Also, we run Debian on all our servers and have never experienced anything similar to what I'm seeing on the cluster head nodes. What is "Ubuntu Perceus"? I only see one tarball that I grabbed and compiled. Is there a specific version for Ubuntu? Thank you. Ilya. 2009/3/25 Arthur Stevens > I would blame something with the Debian stuff since we have systems over > 1k nodes using nfs ok. Unless it is an old initramfs, sata support is their > too. > > Maybe trying the ubuntu Perceus would be an option for you. Most our user > base is Caos or a varient of a Enterprise Linux. > > Please let us know what you find so we can fix up anything on our end that > might help. > > Thanks, > > Arthur > > ----- Original Message ----- > *From:* Ilya Malinov <4ilya.m+warewulf at gmail.com> > *To:* The Warewulf Cluster Toolkit > *Sent:* Wednesday, March 25, 2009 11:48 AM > *Subject:* Re: [Warewulf] NFS timeouts during provisioning > > Hello, > > Both the head and the nodes are Debian-based. The head runs 2.6.28 kernel > from kernel.org with 1.0.10 NFS untils. However, the problem existed with > previous kernels as well. > > We will upgrade to a more recent version of Perceus, however it cannot > happen today or tomorrow. (I need to work out several issues with new > Perceus, e.g., its initramfs does not have kernel modules for sata, and I > have a Perceus module that prepares hard drives during provisioning, etc.) > > XGET and even HTTP are nice alternatives for NFS during provisioning, but > the problem with NFS remains even after the nodes are provosioned. Switching > to UDP in the main OS as I did is a workaround, but not a desirable > solution. > > Perhaps, this list is not exactly the place to discuss NFS issues. I was > just hoping that someone has run into similar issues and could share the > solution. > > Ilya. > > > > On Tue, Mar 24, 2009 at 9:05 PM, Greg Kurtzer wrote: > >> One of the changes that Art maybe referring to is the inclusion of >> xget for provisioning. We have seen mixed results with it, but when it >> works it works very well. You can test it out by upgrading and then >> enabling it in the perceus.conf. >> >> I am not sure why you are seeing such issues with Linux NFS. While it >> isn't known for being the fastest NFS server out there, I have >> personally seen it boot and deal with hybridization of over 600 nodes >> booting at exactly the same time with no faults (this was using Caos >> Linux on a system with 16 cores, 32GB of memory and high performance >> local storage). >> >> Actually, I was just as impressed that the DHCP server got all the >> nodes setup correctly in that case. >> >> Good luck! >> >> Greg >> >> >> 2009/3/24 Arthur Stevens : >> > Hello Ilya, >> > >> > We have made a lot of changes and updates since that version and are up >> to >> > 1.5.x now. 30 nodes should not be much load at all for a NFS on Caos. >> What >> > version of our NFS are you using? >> > >> > Thanks, >> > >> > Arthur >> > >> > ----- Original Message ----- >> > From: Ilya Malinov >> > To: warewulf at caoslinux.org >> > Sent: Tuesday, March 24, 2009 7:00 PM >> > Subject: [Warewulf] NFS timeouts during provisioning >> > Hi, >> > >> > In our several small clusters of 30 nodes each, I always run into a >> problem >> > with the NFS RPC timeouts both during provisioning and after that for >> > directories exported from the head node. >> > >> > It seems that under certain conditions, NFS server on the head node runs >> out >> > of steam and attempts to close some TCP connections. It puts them into >> > FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. >> Those >> > never acknowledge but instead continue to poll the server. >> > >> > The solution for the main OS on the nodes was to mount "hybridized" >> > directories using UDP. However, the mount binary in the initramfs of >> > provisioning state is in fact a symlink to busybox. It seems that the >> latter >> > does not support mouning over udp. So the only thing I was able to do to >> get >> > provisioning going, was to keep restarting the NFS server on the head >> node >> > every time I saw that some compute nodes were stuck in provisioning. >> > >> > I did find some references online that confirmed this behavior of NFS on >> > Linux under heavy load. However, provisioning of just 30 nodes does not >> seem >> > like a big loan on the NFS server. >> > >> > I was wondering if people on this list have experienced this kind of >> problem >> > and have come up with solutions or workarounds. >> > >> > We are using Perceus 1.3.7, but I do not think it matters in this case. >> > >> > Thank you, >> > Ilya. >> > >> > ________________________________ >> > >> > _______________________________________________ >> > Warewulf mailing list >> > Warewulf at caoslinux.org >> > http://lists.caosity.org/mailman/listinfo/warewulf >> > >> > _______________________________________________ >> > Warewulf mailing list >> > Warewulf at caoslinux.org >> > http://lists.caosity.org/mailman/listinfo/warewulf >> > >> > >> >> >> >> -- >> Greg Kurtzer >> http://www.infiscale.com/ >> http://www.perceus.org/ >> http://www.caoslinux.org/ >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf >> > > ------------------------------ > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090325/339686bb/attachment.html From stefano.bridi at gmail.com Thu Mar 26 02:21:08 2009 From: stefano.bridi at gmail.com (Stefano Bridi) Date: Thu, 26 Mar 2009 10:21:08 +0100 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> <79F8AD25EF23447480CBE66D8564200C@computer> <71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> Message-ID: Preface: In these weeks I am fighting with my laptop NIC (a stupid sis191 that refuses tu work with gigabit)... Just to be sure... what is the hw config? In particular i suppose You are using gigabit: check the cable! Are they at least Cat5e? Which NIC are you using? with which kernel-module? stef 2009/3/25 Ilya Malinov <4ilya.m+warewulf at gmail.com>: > Actually, I did not find sata drivers in the initramfs of Perceus 1.5.0. I > had to recompile the kernel and re-create initrams to have those modules > there. > > As for blaming Debian, maybe you're right. However, the kernel that we run > came from kernel.org, not from a Debian package. Also, we run Debian on all > our servers and have never experienced anything similar to what I'm seeing > on the cluster head nodes. > > What is "Ubuntu Perceus"? I only see one tarball that I grabbed and > compiled. Is there a specific version for Ubuntu? > > Thank you. > Ilya. > > 2009/3/25 Arthur Stevens >> >> I would blame something with the Debian stuff since we have systems over >> 1k nodes using nfs ok. Unless it is an old initramfs, sata support is their >> too. >> >> Maybe trying the ubuntu Perceus would be an option for you. Most our user >> base is Caos or a varient of a Enterprise Linux. >> >> Please let us know what you find so we can fix up anything on our end that >> might help. >> >> Thanks, >> >> Arthur >> >> ----- Original Message ----- >> From: Ilya Malinov >> To: The Warewulf Cluster Toolkit >> Sent: Wednesday, March 25, 2009 11:48 AM >> Subject: Re: [Warewulf] NFS timeouts during provisioning >> Hello, >> >> Both the head and the nodes are Debian-based. The head runs 2.6.28 kernel >> from kernel.org with 1.0.10 NFS untils. However, the problem existed with >> previous kernels as well. >> >> We will upgrade to a more recent version of Perceus, however it cannot >> happen today or tomorrow. (I need to work out several issues with new >> Perceus, e.g., its initramfs does not have kernel modules for sata, and I >> have a Perceus module that prepares hard drives during provisioning, etc.) >> >> XGET and even HTTP are nice alternatives for NFS during provisioning, but >> the problem with NFS remains even after the nodes are provosioned. Switching >> to UDP in the main OS as I did is a workaround, but not a desirable >> solution. >> >> Perhaps, this list is not exactly the place to discuss NFS issues. I was >> just hoping that someone has run into similar issues and could share the >> solution. >> >> Ilya. >> >> >> >> On Tue, Mar 24, 2009 at 9:05 PM, Greg Kurtzer wrote: >>> >>> One of the changes that Art maybe referring to is the inclusion of >>> xget for provisioning. We have seen mixed results with it, but when it >>> works it works very well. You can test it out by upgrading and then >>> enabling it in the perceus.conf. >>> >>> I am not sure why you are seeing such issues with Linux NFS. While it >>> isn't known for being the fastest NFS server out there, I have >>> personally seen it boot and deal with hybridization of over 600 nodes >>> booting at exactly the same time with no faults (this was using Caos >>> Linux on a system with 16 cores, 32GB of memory and high performance >>> local storage). >>> >>> Actually, I was just as impressed that the DHCP server got all the >>> nodes setup correctly in that case. >>> >>> Good luck! >>> >>> Greg >>> >>> >>> 2009/3/24 Arthur Stevens : >>> > Hello Ilya, >>> > >>> > We have made a lot of changes and updates since that version and are up >>> > to >>> > 1.5.x now. ?30 nodes should not be much load at all for a NFS on Caos. >>> > What >>> > version of our NFS are you using? >>> > >>> > Thanks, >>> > >>> > Arthur >>> > >>> > ----- Original Message ----- >>> > From: Ilya Malinov >>> > To: warewulf at caoslinux.org >>> > Sent: Tuesday, March 24, 2009 7:00 PM >>> > Subject: [Warewulf] NFS timeouts during provisioning >>> > Hi, >>> > >>> > In our several small clusters of 30 nodes each, I always run into a >>> > problem >>> > with the NFS RPC timeouts both during provisioning and after that for >>> > directories exported from the head node. >>> > >>> > It seems that under certain conditions, NFS server on the head node >>> > runs out >>> > of steam and attempts to close some TCP connections. It puts them into >>> > FIN_WAIT and FIN_WAIT2 state, waiting acknowledgement from clients. >>> > Those >>> > never acknowledge but instead continue to poll the server. >>> > >>> > The solution for the main OS on the nodes was to mount "hybridized" >>> > directories using UDP. However, the mount binary in the initramfs of >>> > provisioning state is in fact a symlink to busybox. It seems that the >>> > latter >>> > does not support mouning over udp. So the only thing I was able to do >>> > to get >>> > provisioning going, was to keep restarting the NFS server on the head >>> > node >>> > every time I saw that some compute nodes were stuck in provisioning. >>> > >>> > I did find some references online that confirmed this behavior of NFS >>> > on >>> > Linux under heavy load. However, provisioning of just 30 nodes does not >>> > seem >>> > like a big loan on the NFS server. >>> > >>> > I was wondering if people on this list have experienced this kind of >>> > problem >>> > and have come up with solutions or workarounds. >>> > >>> > We are using Perceus 1.3.7, but I do not think it matters in this case. >>> > >>> > Thank you, >>> > Ilya. >>> > >>> > ________________________________ >>> > >>> > _______________________________________________ >>> > Warewulf mailing list >>> > Warewulf at caoslinux.org >>> > http://lists.caosity.org/mailman/listinfo/warewulf >>> > >>> > _______________________________________________ >>> > Warewulf mailing list >>> > Warewulf at caoslinux.org >>> > http://lists.caosity.org/mailman/listinfo/warewulf >>> > >>> > >>> >>> >>> >>> -- >>> Greg Kurtzer >>> http://www.infiscale.com/ >>> http://www.perceus.org/ >>> http://www.caoslinux.org/ >>> _______________________________________________ >>> Warewulf mailing list >>> Warewulf at caoslinux.org >>> http://lists.caosity.org/mailman/listinfo/warewulf >> >> ________________________________ >> >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf >> >> _______________________________________________ >> Warewulf mailing list >> Warewulf at caoslinux.org >> http://lists.caosity.org/mailman/listinfo/warewulf >> > > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > > From geoff at galitz.org Thu Mar 26 02:28:55 2009 From: geoff at galitz.org (Geoff Galitz) Date: Thu, 26 Mar 2009 10:28:55 +0100 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com><0B860E8069A74D318FC7A7B8502E25FD@computer><571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com><71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com><79F8AD25EF23447480CBE66D8564200C@computer><71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> Message-ID: <49961A0A3B3E463DB2BA6BD644806A62@geoffPC> It would be interesting to see the output of "ifconfig eth0" or whatever the correct device is. I'd pay particular attention to frame or carrier errors as that would be where any kind of physical level problems (like cables or bad switch ports) would typically show up. -geoff --------------------------------- Geoff Galitz Blankenheim NRW, Germany http://www.galitz.org/ http://german-way.com/blog/ From 4ilya.m+warewulf at gmail.com Thu Mar 26 15:16:32 2009 From: 4ilya.m+warewulf at gmail.com (Ilya Malinov) Date: Thu, 26 Mar 2009 15:16:32 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <49961A0A3B3E463DB2BA6BD644806A62@geoffPC> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> <79F8AD25EF23447480CBE66D8564200C@computer> <71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> <49961A0A3B3E463DB2BA6BD644806A62@geoffPC> Message-ID: <71e2304f0903261516w12c4d863ocb68b0971acf9f3b@mail.gmail.com> Hello, All connections are 1 Gb. We use cat5e. Below are some statistics and configuration data from one of the head nodes. Ilya. ------------------------------ #> ethtool -i eth1 driver: e1000 version: 7.3.20-k3-NAPI firmware-version: N/A bus-info: 0000:0a:03.1 ------------------------------ #> ethtool -S eth1 NIC statistics: rx_packets: 6781572079 tx_packets: 2694098248 rx_bytes: 9159427027667 tx_bytes: 688198510890 rx_broadcast: 51068479 tx_broadcast: 26089 rx_multicast: 1 tx_multicast: 9 rx_errors: 0 tx_errors: 0 tx_dropped: 0 multicast: 1 collisions: 0 rx_length_errors: 0 rx_over_errors: 0 rx_crc_errors: 0 rx_frame_errors: 0 rx_no_buffer_count: 0 rx_missed_errors: 0 tx_aborted_errors: 0 tx_carrier_errors: 0 tx_fifo_errors: 0 tx_heartbeat_errors: 0 tx_window_errors: 0 tx_abort_late_coll: 0 tx_deferred_ok: 0 tx_single_coll_ok: 0 tx_multi_coll_ok: 0 tx_timeout_count: 0 tx_restart_queue: 446323 rx_long_length_errors: 0 rx_short_length_errors: 0 rx_align_errors: 0 tx_tcp_seg_good: 7654445 tx_tcp_seg_failed: 0 rx_flow_control_xon: 0 rx_flow_control_xoff: 0 tx_flow_control_xon: 0 tx_flow_control_xoff: 0 rx_long_byte_count: 9159427027667 rx_csum_offload_good: 6729564733 rx_csum_offload_errors: 0 rx_header_split: 0 alloc_rx_buff_failed: 0 tx_smbus: 0 rx_smbus: 0 dropped_smbus: 0 ------------------------------ #> uptime 14:57:41 up 55 days, 20:41, 22 users, load average: 0.32, 0.25, 0.14 ------------------------------ #> sysctl -a|grep net.ipv4|grep tcp|egrep -v "eth0|.lo" net.ipv4.tcp_timestamps = 1 net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_sack = 1 net.ipv4.tcp_retrans_collapse = 1 net.ipv4.tcp_syn_retries = 5 net.ipv4.tcp_synack_retries = 5 net.ipv4.tcp_max_orphans = 65536 net.ipv4.tcp_max_tw_buckets = 180000 net.ipv4.tcp_keepalive_time = 7200 net.ipv4.tcp_keepalive_probes = 5 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_retries1 = 3 net.ipv4.tcp_retries2 = 15 net.ipv4.tcp_fin_timeout = 20 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_stdurg = 0 net.ipv4.tcp_rfc1337 = 0 net.ipv4.tcp_orphan_retries = 0 net.ipv4.tcp_fack = 1 net.ipv4.tcp_reordering = 3 net.ipv4.tcp_ecn = 0 net.ipv4.tcp_dsack = 1 net.ipv4.tcp_mem = 3094656 4126208 6189312 net.ipv4.tcp_wmem = 4096 16384 4194304 net.ipv4.tcp_rmem = 4096 87380 4194304 net.ipv4.tcp_app_win = 31 net.ipv4.tcp_adv_win_scale = 2 net.ipv4.tcp_tw_reuse = 0 net.ipv4.tcp_frto = 2 net.ipv4.tcp_frto_response = 0 net.ipv4.tcp_no_metrics_save = 0 net.ipv4.tcp_moderate_rcvbuf = 1 net.ipv4.tcp_tso_win_divisor = 3 net.ipv4.tcp_congestion_control = cubic net.ipv4.tcp_abc = 0 net.ipv4.tcp_mtu_probing = 0 net.ipv4.tcp_base_mss = 512 net.ipv4.tcp_workaround_signed_windows = 0 net.ipv4.tcp_dma_copybreak = 4096 net.ipv4.tcp_available_congestion_control = cubic reno net.ipv4.tcp_max_ssthresh = 0 net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60 net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000 net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30 net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120 net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300 net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 0 net.ipv4.netfilter.ip_conntrack_tcp_max_retrans = 3 On Thu, Mar 26, 2009 at 2:28 AM, Geoff Galitz wrote: > > > It would be interesting to see the output of "ifconfig eth0" or whatever > the > correct device is. I'd pay particular attention to frame or carrier errors > as that would be where any kind of physical level problems (like cables or > bad switch ports) would typically show up. > > -geoff > > > > --------------------------------- > Geoff Galitz > Blankenheim NRW, Germany > http://www.galitz.org/ > http://german-way.com/blog/ > > > > _______________________________________________ > Warewulf mailing list > Warewulf at caoslinux.org > http://lists.caosity.org/mailman/listinfo/warewulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090326/078f7d87/attachment.html From griznog at gmail.com Thu Mar 26 15:27:54 2009 From: griznog at gmail.com (John Hanks) Date: Thu, 26 Mar 2009 18:27:54 -0400 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <71e2304f0903261516w12c4d863ocb68b0971acf9f3b@mail.gmail.com> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com> <0B860E8069A74D318FC7A7B8502E25FD@computer> <571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com> <71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com> <79F8AD25EF23447480CBE66D8564200C@computer> <71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com> <49961A0A3B3E463DB2BA6BD644806A62@geoffPC> <71e2304f0903261516w12c4d863ocb68b0971acf9f3b@mail.gmail.com> Message-ID: Ilya, I had some issues with some Sun Thumpers once where during perceus provisioning they'd stall while downloading vnfs.img. I eventually solved it by having perceus force the NIC into 100/half instead of gig. The machines worked fine at 1 gig after booting a vnfs/kernel, but for some reason at 1g with the perceus kernel network traffic slowly ground to a halt. I never figured out why, once I stumbled on the 100/half solution I stopped looking. jbh 2009/3/26 Ilya Malinov <4ilya.m+warewulf at gmail.com>: > Hello, > > All connections are 1 Gb. We use cat5e. Below are some statistics and > configuration data from one of the head nodes. > > Ilya. > From jal at mdacorporation.com Mon Mar 30 15:45:10 2009 From: jal at mdacorporation.com (John LLOYD) Date: Mon, 30 Mar 2009 15:45:10 -0700 Subject: [Warewulf] NFS timeouts during provisioning In-Reply-To: <71e2304f0903261516w12c4d863ocb68b0971acf9f3b@mail.gmail.com> References: <71e2304f0903241900w5a883f1cv426a3a992dceba18@mail.gmail.com><0B860E8069A74D318FC7A7B8502E25FD@computer><571f1a060903242105v19b45fe8r5330e15726552081@mail.gmail.com><71e2304f0903251148x8cc7047n7e87739d82a12210@mail.gmail.com><79F8AD25EF23447480CBE66D8564200C@computer><71e2304f0903251250u7c543e89q9d3a9fb3dfe1299f@mail.gmail.com><49961A0A3B3E463DB2BA6BD644806A62@geoffPC> <71e2304f0903261516w12c4d863ocb68b0971acf9f3b@mail.gmail.com> Message-ID: <57F67688A8D72449AC80164DA982083104E2CAB4@VMXYVR1.ds.mda.ca> Is your switch a managed type, and can you get error statistics from that? But before you answer that, is the switch supposed to be capable of wire-speed on all ports simultaneously? For example, some Cisco enterprise (4600-series) are not capable of this, surprisingly. --John -------------- next part -------------- An HTML attachment was scrubbed... URL: http://altruistic.infiscale.org/pipermail/perceus/attachments/20090330/7017300d/attachment.html