OrangeFS over InfiniBand

Boyd Wilson boydw at omnibond.com
Fri Dec 7 13:26:11 EST 2018


Lindolfo,
Have you tried scaling up the clients from 1,2,4,6 or such to see where it
levels off?   Also do you have visibility into the links  on the servers to
see if they are all at the same utilization?

-b

On Fri, Dec 7, 2018 at 8:57 AM Lindolfo Meira <meira at cesup.ufrgs.br> wrote:

> Hi guys.
>
> Following on my issue, if I could just get some additional pointers from
> you, it would be great.
>
> I've been able to saturate my 50 Gb/s link when performing I/O from a
> single OrangeFS client, and that's great. But when I scale to 6 clients or
> more I can't get close to the 300 Gb/s I was expecting my 6 servers would
> provide.
>
> I managed to get 154 Gb/s twice, using the following set of parameters
> with MPIIO in IOR: 4 tasks/client, 12 clients, 4MB transfer size and 1GB
> block size. If I increase the number of tasks per client, or the number of
> clients, the performance gets worse.
>
> As I said before, I don't think the problem is with the network, but just
> so you know, my switch is a Mellanox MSX6710. And all my 6 OrangeFS
> servers operate with stable 50 Gb/s links (this is measured performance,
> their adapters are 56 Gb/s).
>
> Any input is appreciated.
>
>
> Lindolfo Meira, MSc
> Diretor Geral, Centro Nacional de Supercomputação
> Universidade Federal do Rio Grande do Sul
> +55 (51) 3308-3139
>
> On Thu, 29 Nov 2018, Lindolfo Meira wrote:
>
> > Hi Becky.
> >
> > I thought I'd tested for the optimum number of tasks and found 4 to be
> the
> > sweet spot. But apparently I was wrong. Following your advice I reran
> the
> > tests and now found 8 tasks to work best. Now I'm getting around 46
> Gb/s.
> > Awesome :D
> >
> > Thanks Becky. Thanks you all guys.
> >
> >
> > Lindolfo Meira, MSc
> > Diretor Geral, Centro Nacional de Supercomputação
> > Universidade Federal do Rio Grande do Sul
> > +55 (51) 3308-3139
> >
> > On Thu, 29 Nov 2018, Becky Ligon wrote:
> >
> > > Try using more tasks.  Most likely, you are not saturating the link.
> > >
> > > Becky Ligon
> > >
> > > On Thu, Nov 29, 2018 at 2:42 PM Lindolfo Meira <meira at cesup.ufrgs.br>
> wrote:
> > >
> > > > Hello Mike.
> > > >
> > > > I get about 9 Gb/s.
> > > >
> > > > I've been testing single client with pvfs2-cp 'cause I assume it uses
> > > > OrangeFS's direct interface and therefore would achieve better
> results.
> > > > But using the MPIIO interface with IOR, using 4 tasks, I managed to
> > > > achieve 33.6 Gb/s. It's better than the 25 Gb/s I get with pvfs2-cp,
> but
> > > > still too far from the 50 Gb/s the network allows.
> > > >
> > > >
> > > > Lindolfo Meira, MSc
> > > > Diretor Geral, Centro Nacional de Supercomputação
> > > > Universidade Federal do Rio Grande do Sul
> > > > +55 (51) 3308-3139
> > > >
> > > > On Thu, 29 Nov 2018, Mike Marshall wrote:
> > > >
> > > > > Hi Lindolfo
> > > > >
> > > > > I wonder if you'd see different results using
> > > > > "dd if=/dev/zero of=/pvfsmnt/foo count=128 bs=4194304"
> > > > > intead of pvfs2-cp? That blocksize is the same size as the
> > > > > buffer shared by the kernel module and the client...
> > > > >
> > > > > -Mike
> > > > >
> > > > >
> > > > > On Thu, Nov 29, 2018 at 9:19 AM Lindolfo Meira <
> meira at cesup.ufrgs.br>
> > > > wrote:
> > > > > >
> > > > > > Hi David.
> > > > > >
> > > > > > Yes, I updated the drivers to version 4.4-2 (the latest
> available when
> > > > I
> > > > > > did it) but the error didn't go away. And the adapters also have
> the
> > > > > > latest firmware available, as well as the switch.
> > > > > >
> > > > > >
> > > > > > Lindolfo Meira, MSc
> > > > > > Diretor Geral, Centro Nacional de Supercomputação
> > > > > > Universidade Federal do Rio Grande do Sul
> > > > > > +55 (51) 3308-3139
> > > > > >
> > > > > > On Thu, 29 Nov 2018, David Reynolds wrote:
> > > > > >
> > > > > > > Hi Lindolfo,
> > > > > > >
> > > > > > > In regards to the mlx4_core error, have you tried updating the
> > > > firmware
> > > > > > > on the IB cards? I don’t remember seeing this error myself,
> but I
> > > > found
> > > > > > > some posts on the Mellanox forums that say updating the
> firmware
> > > > fixed
> > > > > > > similar issues.
> > > > > > >
> > > > > > > David Reynolds
> > > > > > > Software Engineer
> > > > > > > Omnibond Systems, LLC.
> > > > > > >
> > > > > > > > On Nov 23, 2018, at 2:25 PM, Lindolfo Meira <
> meira at cesup.ufrgs.br>
> > > > wrote:
> > > > > > > >
> > > > > > > > Hi Boyd.
> > > > > > > >
> > > > > > > > Yes, that's correct. All 6 servers are exclusivelly serving.
> The
> > > > one
> > > > > > > > client is exclusivelly "clienting".
> > > > > > > >
> > > > > > > > Kernel version is 4.12.14 (the latest available with
> OpenSUSE Leap
> > > > 15) and
> > > > > > > > yes, I'm using the upstream orangefs module.
> > > > > > > >
> > > > > > > >
> > > > > > > > Lindolfo Meira, MSc
> > > > > > > > Diretor Geral, Centro Nacional de Supercomputação
> > > > > > > > Universidade Federal do Rio Grande do Sul
> > > > > > > > +55 (51) 3308-3139
> > > > > > > >
> > > > > > > >> On Fri, 23 Nov 2018, Boyd Wilson wrote:
> > > > > > > >>
> > > > > > > >> Lindolfo,
> > > > > > > >> I am copying the users at lists.orangefs.org list, the older
> one
> > > > was having
> > > > > > > >> dns migration issues.     I would have recommended the
> latest
> > > > OFED stack,
> > > > > > > >> but I have not seen the issue you describe.   Also what is
> the
> > > > linux kernel
> > > > > > > >> version and which orangefs kmod are you using, the upstream
> one?
> > > >   Also
> > > > > > > >> with the single client, it is not also a orangefs server,
> correct?
> > > > > > > >>
> > > > > > > >> -b
> > > > > > > >>
> > > > > > > >> On Fri, Nov 23, 2018 at 12:31 PM Lindolfo Meira <
> > > > meira at cesup.ufrgs.br>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >>> Hi.
> > > > > > > >>>
> > > > > > > >>> I've just finished implementing an OrangeFS and was
> wondering if
> > > > I could
> > > > > > > >>> get some pointers from you, since I don't think my system
> is
> > > > performing at
> > > > > > > >>> the level it should. I've already examined all OrangeFS
> > > > documentation and
> > > > > > > >>> FAQ, but nothing helped. The documentation fails to
> mention any
> > > > actual
> > > > > > > >>> community mailing lists, so I tried to subscribe to a
> > > > pfvs2-users list I
> > > > > > > >>> found on the Internet but it seems to be dead.
> > > > > > > >>>
> > > > > > > >>> My OrangeFS implementation has 1 client and 6 exclusive
> servers
> > > > (serving
> > > > > > > >>> meta and data, as the attached config file shows). Each
> one of
> > > > the servers
> > > > > > > >>> have an XFS formatted RAID6 volume dedicated to OrangeFS.
> All of
> > > > the nodes
> > > > > > > >>> connected by 56 Gb/s InfiniBand adapters.
> > > > > > > >>>
> > > > > > > >>> I've performed a benchmark on the network and achieved a
> very
> > > > stable mark
> > > > > > > >>> of 50 Gb/s per link.
> > > > > > > >>>
> > > > > > > >>> When I benchmark the XFS formatted RAID6 volumes
> individually
> > > > (logging
> > > > > > > >>> into the servers and writing directly to the partition --
> > > > sequential
> > > > > > > >>> write, using DD), I get a write performance of about 11
> Gb/s on
> > > > average.
> > > > > > > >>> Since there are 6 servers, I'm expecting at the very least,
> > > > something
> > > > > > > >>> somewhat close to 66 Gb/s of write performance for my
> parallel
> > > > file
> > > > > > > >>> system. Right? 6 times 11 Gb/s. But for now my system has a
> > > > single client,
> > > > > > > >>> and its network adapter reaches at most 50 Gb/s, so that's
> my
> > > > real
> > > > > > > >>> constraint.
> > > > > > > >>>
> > > > > > > >>> However, I've been unable to achieve more than 25 Gb/s when
> > > > writing to
> > > > > > > >>> such system using pvfs2-cp. And I can't understand why. I'm
> > > > using a buffer
> > > > > > > >>> size of 1GiB and a stripe size of 2MiB (which seem to be
> the
> > > > optimum
> > > > > > > >>> values for me -- anything different than these hurts the
> > > > performance).
> > > > > > > >>>
> > > > > > > >>> Another interesting thing to notice is the fact that when
> I try
> > > > a single
> > > > > > > >>> data file (thus a single server) I get a write performance
> of
> > > > about 9
> > > > > > > >>> Gb/s. I'm taking that as pretty good, since writing
> directly to
> > > > the
> > > > > > > >>> partition gave me 11 Gb/s. But then, as I increase the
> number of
> > > > data
> > > > > > > >>> files, the system won't scale as it should.
> > > > > > > >>>
> > > > > > > >>> I'm using OpenSUSE Leap 15.0 with inbox IB drivers. The
> network
> > > > doesn't
> > > > > > > >>> seem to be a problem, but I thought maybe OrangeFS would do
> > > > better with
> > > > > > > >>> the latest Mellanox OFED drivers. So I installed them and
> > > > recompiled the
> > > > > > > >>> code. Still nothing. I didn't tweak the parameters of the
> IB
> > > > drivers
> > > > > > > >>> though.
> > > > > > > >>>
> > > > > > > >>> At last, regardless of the driver (SUSE inbox or latest
> Mellanox
> > > > OFED) I
> > > > > > > >>> noticed mlx4_core logs this strange message on the client
> every
> > > > time I use
> > > > > > > >>> pvfs2-cp: "mlx4_core command 0x1f failed: fw status = 0x2".
> > > > > > > >>>
> > > > > > > >>> What am I missing here? Any help is appreciated.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> Regards,
> > > > > > > >>>
> > > > > > > >>> Lindolfo Meira, MSc
> > > > > > > >>> Diretor Geral, Centro Nacional de Supercomputação
> > > > > > > >>> Universidade Federal do Rio Grande do Sul
> > > > > > > >>> +55 (51) 3308-3139
> > > > > > >
> > > > >
> > >


More information about the Users mailing list