Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy



Ben,

The scope below is with no head of line blocking or partial head of line
blocking?

Siamack

McAlpine, Gary L wrote:

> Ben,
>
> I'm retransmitting this for two reasons. 1) I'm not sure the first one
> will get to the reflector ands 2) I missed your question: "Are you
> prepared to take a stab at bounding this scope?"
>
> We started bounding the scope with objectives in Long Beach but never
> finished. Some were approved but others were derferred. I took a stab in
> a previous email to the reflector which was essentially the following:
> To support backplane interconnects, a few shelves in a rack, and a few
> rack all interconnected by one subnet:  1) 100 M per hop max, 2) 5
> stages of switching max, and (3 layers of switching hierarchy), and 6
> hops max in any path. These not to be treated as hard limits but to
> bound the scope of the study to something we can get our arms around.
>
> See previous transmission below.
>
> Gary
>
>
> -----Original Message-----
> From: McAlpine, Gary L
> Sent: Tuesday, June 15, 2004 9:00 AM
> To: 'benjamin.brown@IEEE.ORG'; STDS-802-3-CM@listserv.ieee.org
> Subject: RE: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
> Ben,
>
> The number of queues doesn't need to explode with the number of hops
> supported. The number of queues per link can be bounded to a very
> reasonable number and still provide significant benefit.
>
> Having the additional Tx queuing paths, plus feedback that enables
> servicing the queues in an optimum order for the peer component,
> provides a subtle change in dynamics during congestion that increases
> the efficiency throughout the subnet.  The result is higher throughput,
> lower latency, and lower latency variation.
>
> The trick is in defining the optimum granularity (cost vs. performance)
> and the definition of what a queue path represents. Prioritization gives
> us one dimension of granularity. Adding another dimension of, say,  4 to
> 16 logical paths per link, can significantly increase efficiency without
> significantly increasing cost and complexity. It can also be implemented
> at the MAC level (or above). Since our interest is in enhancing
> Ethernet, our simulations implement the expanded queuing in the MAC.
>
> Gary
>
> P.S. Here is a little puzzle for the people on the reflector. I ran into
> this a few years ago while simulating flow control over 30 KM 10 Gbs
> links. The answer is one of the subtle changes in dynamics mentioned
> above. (It needs further study.)
>
> I had a 4 port switch with 30 KM links on each port and enough buffer to
> support pause based low control. I had also been experimenting with
> expanded link level queuing and congestion avoidance mechanisms, so the
> models already had these mechanisms built in. I knew these mechanism
> wouldn't provide any benefit over 30 KM because of the extreme time time
> delay in the feedback (it would be too out-of-date for the decisions
> being made on it). The simulations proved that the pause flow control
> (not XON/XOFF) worked just fine across long links given the appropriate
> buffering in the switch. Although, with a bursty workload load the
> throughput efficiency of the switch was about 65% during congestion. Out
> of curiosity, I decided to try the congestion avoidance mechanisms under
> the same conditions and was blown away by the results. The efficiency
> went up to 98% throughput, the latency and latency variations went down,
> and the switch buffer utilization went way down. Why?
>
>
>
>  -----Original Message-----
> From: owner-stds-802-3-cm@listserv.ieee.org
> [mailto:owner-stds-802-3-cm@listserv.ieee.org] On Behalf Of Benjamin
> Brown
> Sent: Tuesday, June 15, 2004 5:45 AM
> To: STDS-802-3-CM@listserv.ieee.org
> Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
>
> Gary,
>
> Finer granularity immediately brings to mind a myriad of possibilities.
> However, as others have mentioned in this thread, the number of
> queues required to support finer granularities simply explodes as
> the number of hops in the network increases.
>
> In a previous note to this thread, you mentioned that CMSG may be
> most applicable to a "microcosm" network and
>
> "Bounding the scope of the microcosms, in which we
> are trying to enable the use of Ethernet as the
> local interconnect, will help us define the set of
> assumptions that apply in that space."
> Are you prepared to take a stab at bounding this scope?
>
> Thanks,
> Ben
>
> McAlpine, Gary L wrote:
>
> Siamack,
>
> Excellent summary. I think this is exactly the right direction to
> proceed.
>
> Supporting a finer granularity of flows and flow control at the link
> level can translate to significantly better system characteristics. The
> question is: what granularity and what flow definition provides the
> optimum cost vs. performance trade-offs? I don't think we can answer
> this or the other related questions without further study. Isn't that
> what the CMSG is about?
>
> Gary
>
> -----Original Message-----
> From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
> [mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Siamack
> Ayandeh
> Sent: Friday, June 11, 2004 5:48 AM
> To: STDS-802-3-CM@LISTSERV.IEEE.ORG
> Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
> Ben,
>
> It may help to start from a more limited scope with clear value before
> we venture far in to more complex territory.
>
> Clearly there must have been a perceived value in the existing Pause
> mechanism which is part of the standard and widely deployed.  This
> mechanism, or a yet to be defined mechanism, can be improved in the
> following sense:
>
> 1) Scope: Can remain as is i.e. a single link. Given that Ethernet is a
> technology that is being used in a wide range of applications from
> inter-chip communication to the local loop for Metro Ethernet services,
> a single link would cover a wide range of applications.
>
> 2) Granularity: Needs to be improved. The granularity can be defined by
> introduction of a grain_ID (I am threading carefully hear & don't use
> flow-ID). How this is mapped to Class of Service, VLAN tags, etc.
> becomes a local matter over a single link and need not be part of a
> standard. It is application dependent. Sure there are problems to be
> solved here but that's why we need a study group.
>
> The need is to create multiple control loops rather than one. How these
> get mapped is a local decision over a single link.
>
> 3) Flow control algorithm:  Currently ON/OFF control is in place. This
> is a simple and effective mechanism. Whether it can be improved using
> the so called "rate based" algorithms or some thing else is to be seen
> and is the subject of study for the working group.
>
> In this limited context the study group can add value and produce a
> useful extension to the existing Pause flow control mechanism.
>
> Whether more can be done e.g. to extend the scope to multiple hops will
> no doubt arise and be debated in the course of the study. However the
> ambiguity that currently is floating around this subject should not
> prevent concrete progress in the more limited context.
>
> Regards, Siamack
>
>
>
> Benjamin Brown wrote:
>
>
> Gary,
>
> You say you're seeing promising results from simulations
> but you're not ready to share the data. I certainly hope
> that will change before the presentation deadline for the July meeting
>
>
>
> in 4 weeks.
>
> I don't mean to pick on you but you seem to be the only
> one that is taking up the flag AND at least suggesting that there is
> simulation data to back up your claims.
>
> As chair of this group, I'm trying to stir up discussion in order to
> get all the arguments on the table. If there are flaws in these
> arguments (the "gospels" as you call them) and the exploitation of
> these flaws has broad market potential and is both technically and
> economically feasible, then we need to get this information
> disseminated as soon as possible.
>
> I don't think we can try to go through the July meeting without this
> material and expect to get a continuation of this study group.
>
> Regards,
> Ben
>
> McAlpine, Gary L wrote:
>
>
> Norm,
>
> I agree with you on many of your points below. A higher granularity of
>
>
>
> "flow" than 8 priorities is needed to get any significant improvement
> across multiple stages of switching. I know I'm being vague about
> exactly what granularity of "flow" on which I want to exert targeted
> influence (rate control/backpressure). It's not because I don't know,
> it's because any discussions on the subject without data to back the
> proposals will "simply" turn into a big rathole. I am busy developing
> the data.
>
> I understand all your arguments below. I've been listening to the same
>
>
>
> ones for the last 15 years and, until a few years ago, treating them
> as the gospel. It wasn't until I set out to thoroughly understand the
> gorey details through simulations that I realized there were some
> interesting flaws in the "old" assumptions that can be very
> effectively exploited in confined networks such as multi-stage cluster
>
>
>
> interconnects.
>
> I guess I don't see such a clear boundary of responsibility between
> 802.1 and 802.3 as you. I think it's an IEEE problem. And since the
> target link technology is Ethernet, then the focus should be on the
> 802.3 support required to enable acceptable Ethernet based solutions.
> I think 802.1 needs to be part of a complete solution, but only to the
>
>
>
> extent of including support for the 802.3 mechanisms.
>
> Gary
>
>
>
> -----Original Message-----
> From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
> [mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Norman
> Finn
> Sent: Wednesday, June 09, 2004 2:33 AM
> To: STDS-802-3-CM@LISTSERV.IEEE.ORG
> Subject: Re: [8023-CMSG] Server/NIC analogy
>
>
> Gary,
>
> McAlpine, Gary L wrote:
>
> I think this discussion is off on a tangent.
>
> One can reasonably claim that you're the one who's off on a tangent.
> One man's tangent is another man's heart of the argument.  You keep
> saying, "we're just ..." and "we're only ..." and "we're simply ..."
> and failing to acknowledge our "but you're ..." arguments.
> Specifically:
>
> You want back pressure on some level finer than whole links.  The
> heart of the argument, that you are not addressing in your last
> message, is, "On exactly what granularity do you want to exert back
> pressure?"
>
> The answer to that question is, inevitably, "flows".  (I have no
> problem that "flows" are relatively undefined; we dealt with that in
> Link
> Aggregation.)  Per-flow back pressure is the "but you're ..."
>
> argument.
>
> Hugh Barrass's comments boil down to exactly this point.  You want to
> have per-flow back pressure.
>
> The "per-something Pause" suggestions have mentioned VLANs and
> priority levels as the granularity.  The use of only 8 priority
> levels, and thus only 8 flows, is demonstrably insufficient in any
> system with more than 9 ports.  For whatever granularity you name, you
>
>
>
> require at least one queue in each endstation transmitter for each
> flow in which that transmitter participates.  Unfortunately, this o(n)
>
>
>
> problem in the endstations is an o(n**2) problem in the switch.  A
> simple-minded switch architecture requires one queue per flow on each
> inter-switch trunk port, which means o(n**2) queues per trunk port.
> The construction of switches to handle back-pressured flows without
> requiring o(n**2) queues per inter-switch port has been quite
> thoroughly explored by ATM and Fibre Channel, to name two.  It is
> *not* an easy problem to solve.
>
> At the scale of one switch, one flow per port, and only a few ports,
> as Ben suggests, it is easy and quite convenient to ignore the o(n**2)
>
>
>
> factor, and assume that the per-link back pressure protocol is the
> whole problem.  Unfortunately, as you imply in your e-mail below, the
> trivial case of a one-switch "network" is insufficient.  As soon as
> you scale the system up to even "a few hops", as you suggest, the
> number of ports has grown large enough to stress even a 12-bit tag per
>
>
>
> link. Furthermore, to assume that a given pair of physical ports will
> never want to have multiple flows, e.g. between different processes in
>
>
>
> the CPUs, is to deny the obvious.
>
> In other words, implementing per-flow back pressure, even in networks
> with a very small number of switches, very quickly requires very
> sophisticated switch architectures.
>
> For a historical example, just look at Fibre Channel.  It started with
>
>
>
> very similar goals, and very similar scaling expectations, to what
> you're talking about, here.  (The physical size was different because
> of the technology of the day, but the number of ports and flows was
> quite
> similar.)  Fibre Channel switches are now quite sophisticated, because
> the problem they are solving becomes extraordinarily difficult even
>
> for
>
> relatively small networks.
>
> Summary:
>
> This project, as described by its proponents, is per-flow switching.
> It is not the job of 802.3 to work on switching based even on MAC
> address, much less per-flow switching.  It is essential that anyone
> who desires to work on per-flow switching in 802 or any forum become
> familiar with what the real problems are, and what solutions exist.
>
> -- Norm
>
>
> ... There are assumptions being
>
>
> made here that are off-base. We need to focus our attention on what
> it
>
>
>
>
> is we are trying to enable with new standards. (My numbered items are
>
>
>
> responses to Hugh's numbered items.)
>
> 1. If what we are trying to enable are single stage interconnects for
>
>
>
> backplanes, then wrt the IEEE standards, we're done. We just need to
> get good implementations of NIC's and switches using 802.3x (rate
> control, not XON/XOFF) to meet the requirements (e.g. good enough
> throughput, low latency, low latency variation, no loss due to
> congestion). But ... single stage interconnects are not very
> interesting to people who want to construct larger interconnects to
> tie multiple racks with multiple shelves of blades together into a
> single system.
>
> 2. (Putting on my server hat) We're NOT asking for IEEE to provide
> end-to-end congestion management mechanisms. If IEEE can simply
> standardize some tweaks to the current 802.3 (& 802.1) standards to
> support better congestion visibility at layer 2 and better methods of
>
>
>
> reacting to congestion at layer 2 (more selective rate control and no
>
>
>
> frame drops), then the rest can be left up to the upper layers. There
>
>
>
> are methods that can be implemented in layer 2 that don't prohibit
> scalability. Scalability may be limited to a few hops, but that is
> all
>
>
>
>
> that is needed.
>
> 3. The assumption in item 3 is not entirely true. There are
> relationships (that can be automatically discovered or configured)
> that can be expoited for significantly improved layer 2 congestion
> control.
>
> 4. For backpressure to work, it neither requires congestion to be
> pushed all the way back to the source nor does it require the
> backpressuring device to accurately predict the future. From the
> layer
>
>
>
>
> 2 perspective, the source may be a router. So back pressure only
> needs
>
>
>
>
> to be pushed up to the upper layers (which could be a source endpoint
>
>
>
> or a router). Also, the backpressuring device simply needs to know
> its
>
>
>
>
> own state of congestion and be able to convey clues to that state to
> the surrounding devices. We don't need virtual circuits to supported
> at layer 2 to get "good enough" congestion control.
>
> 5. From an implementation perspective, I believe the queues can go
> either in the MAC or the bridge, depending on the switch
> implementation. (Am I wrong? I haven't seen anything in the interface
>
>
>
> between the bridge and the MAC that would force the queues to be in
> the bridge.) IMO, where they go should NOT be dictated by either
> 802.1
>
>
>
>
> or 802.3. The interface between the bridge and MAC should be defined
> to enable the queues to be place where most appropriate for the
> switch
>
>
>
>
> architecture. In fact, a switch could be implemented such that frame
> payloads bypass the bridge and the bridge only deal with the task of
> routing frame handles from MAC receivers to one or more MAC
> transmitters (Do the 802.1 standards prevent such a design?).
>
> As far as the IETF standards go, they don't seem to rely on layer 2
> to
>
>
>
>
> drop frames (although we don't yet have a clear answer on this). If a
>
>
>
> router gets overwhelmed, it will drop packets. But if it supports
> ECN,
>
>
>
>
> it can start forwarding ECN notices before becoming overwhelmed. I
> think the jury is still out on whether the upper layers (in a
> confined
> network) would work better with layer 2 backpressure or layer 2
>
> drops.
>
>>From a datacenter server perspective, there is no doubt in my mind
>
>
>
> that
>
>
>
> backpressure would be preferrable to drops.
>
> Gary
>
>
>
>
> --
> -----------------------------------------
> Benjamin Brown
> 178 Bear Hill Road
> Chichester, NH 03258
> 603-491-0296 - Cell
> 603-798-4115 - Office
> benjamin-dot-brown-at-ieee-dot-org
> (Will this cut down on my spam???)
> -----------------------------------------
>
>
>
>
>
>
> --
> -----------------------------------------
> Benjamin Brown
> 178 Bear Hill Road
> Chichester, NH 03258
> 603-491-0296 - Cell
> 603-798-4115 - Office
> benjamin-dot-brown-at-ieee-dot-org
> (Will this cut down on my spam???)
> -----------------------------------------
>