Thread Links Date Links
Thread Prev Thread Next Thread Index Date Prev Date Next Date Index

Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy



Ben,

I agree, 6 hops would probably cover the vast majority of layer 2
subnets in existance. The way I arrived at that number wasn't very
scientific. I was thinking in terms of how a few racks of blade based
shelves might be hierarchically interconnected by a single layer 2
subnet. The bottom layer of the hierarchy interconnecting the blades in
a shelf (although there could also be a lower layer of switching
integrated into the blades), a second layer interconnecting a group of
shelves, and a third layer might be required to interconnect a group of
racks. With 3 layers of hierarchy, the longest path between any two
endpoints is 5 stages of switching and 6 hops (3 hops up and 3 hops
down).

I was simply putting a stake in the ground hoping others from systems
companies would chime-in with objections or supporting information. From
the point-of-view of bounding the space for study, 6 hops isn't that
scarey to me. (Correct me if I'm wrong) I assume the objective of a
study is not to solve the problems but to gain enough understanding of
the problem space so that we can determine how much of that space we can
realistically address. Even if we set the bounds at 6 hops, the results
of the study may indicate a reasonable set of solutions will only cover
3 or 4 hops. But if we don't set the study challenges high enough, then
the solution space we consider won't cover a broad enough spectrum to
enable significant enhancements.

Can we poll people on the reflector for the sizes of systems (or
microcosms) they think we should consider in this study, in terms of the
following?

1. Max # of subnet endpoints:

2. Max # of hops between any two endpoints in a subnet:

3. Max length of any link in the subnet:

4. Sweet-spot number of endpoints (that will cover >80% of the market):

Thanks,
Gary


-----Original Message-----
From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
[mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Benjamin
Brown
Sent: Saturday, June 19, 2004 4:50 PM
To: STDS-802-3-CM@LISTSERV.IEEE.ORG
Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy


Gary,

Sorry for the delay in responding. I want to thank you for your
perseverance in this effort. I've asked you some tough questions and put
a lot of pressure on you for the answers. I don't mean to personalize
this level of questioning but you seem to be the only one providing the
answers.

I agree with you and Siamack regarding the desire to bound the scope of
what we're studying. I was also trying to do that when I started this
thread by limiting the scope to a single hop. Expanding this to 6 hops
is really quite an expansion. While I don't make a habit of studying
this too much, when I think about 6 hops and what that might cover, it
occurs to me that this would probably encompass the vast majority of
layer 2 networks. Certainly this should support the Broad Market
Potential criteria. However, reducing the number of hops might still
provide a BMP while simplifying the solution.

Do we have numbers on this new "microcosm" space (to steal
a word from another email of yours) regarding the number of hops that
covers the majority of its applications?

Others that have knowledge of this space are encouraged to respond also!

Thanks,
Ben

McAlpine, Gary L wrote:

>Ben,
>
>I'm retransmitting this for two reasons. 1) I'm not sure the first one
>will get to the reflector ands 2) I missed your question: "Are you
>prepared to take a stab at bounding this scope?"
>
>We started bounding the scope with objectives in Long Beach but never
>finished. Some were approved but others were derferred. I took a stab
>in a previous email to the reflector which was essentially the
>following: To support backplane interconnects, a few shelves in a rack,

>and a few rack all interconnected by one subnet:  1) 100 M per hop max,

>2) 5 stages of switching max, and (3 layers of switching hierarchy),
>and 6 hops max in any path. These not to be treated as hard limits but
>to bound the scope of the study to something we can get our arms
>around.
>
>See previous transmission below.
>
>Gary
>
>
>-----Original Message-----
>From: McAlpine, Gary L
>Sent: Tuesday, June 15, 2004 9:00 AM
>To: 'benjamin.brown@IEEE.ORG'; STDS-802-3-CM@listserv.ieee.org
>Subject: RE: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
>Ben,
>
>The number of queues doesn't need to explode with the number of hops
>supported. The number of queues per link can be bounded to a very
>reasonable number and still provide significant benefit.
>
>Having the additional Tx queuing paths, plus feedback that enables
>servicing the queues in an optimum order for the peer component,
>provides a subtle change in dynamics during congestion that increases
>the efficiency throughout the subnet.  The result is higher throughput,

>lower latency, and lower latency variation.
>
>The trick is in defining the optimum granularity (cost vs. performance)

>and the definition of what a queue path represents. Prioritization
>gives us one dimension of granularity. Adding another dimension of,
>say,  4 to 16 logical paths per link, can significantly increase
>efficiency without significantly increasing cost and complexity. It can

>also be implemented at the MAC level (or above). Since our interest is
>in enhancing Ethernet, our simulations implement the expanded queuing
>in the MAC.
>
>Gary
>
>P.S. Here is a little puzzle for the people on the reflector. I ran
>into this a few years ago while simulating flow control over 30 KM 10
>Gbs links. The answer is one of the subtle changes in dynamics
>mentioned above. (It needs further study.)
>
>I had a 4 port switch with 30 KM links on each port and enough buffer
>to support pause based low control. I had also been experimenting with
>expanded link level queuing and congestion avoidance mechanisms, so the

>models already had these mechanisms built in. I knew these mechanism
>wouldn't provide any benefit over 30 KM because of the extreme time
>time delay in the feedback (it would be too out-of-date for the
>decisions being made on it). The simulations proved that the pause flow

>control (not XON/XOFF) worked just fine across long links given the
>appropriate buffering in the switch. Although, with a bursty workload
>load the throughput efficiency of the switch was about 65% during
>congestion. Out of curiosity, I decided to try the congestion avoidance

>mechanisms under the same conditions and was blown away by the results.

>The efficiency went up to 98% throughput, the latency and latency
>variations went down, and the switch buffer utilization went way down.
>Why?
>
>
>
> -----Original Message-----
>From: owner-stds-802-3-cm@listserv.ieee.org
>[mailto:owner-stds-802-3-cm@listserv.ieee.org] On Behalf Of Benjamin
>Brown
>Sent: Tuesday, June 15, 2004 5:45 AM
>To: STDS-802-3-CM@listserv.ieee.org
>Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
>
>Gary,
>
>Finer granularity immediately brings to mind a myriad of possibilities.

>However, as others have mentioned in this thread, the number of queues
>required to support finer granularities simply explodes as the number
>of hops in the network increases.
>
>In a previous note to this thread, you mentioned that CMSG may be most
>applicable to a "microcosm" network and
>
>"Bounding the scope of the microcosms, in which we
>are trying to enable the use of Ethernet as the
>local interconnect, will help us define the set of
>assumptions that apply in that space."
>Are you prepared to take a stab at bounding this scope?
>
>Thanks,
>Ben
>
>McAlpine, Gary L wrote:
>
>Siamack,
>
>Excellent summary. I think this is exactly the right direction to
>proceed.
>
>Supporting a finer granularity of flows and flow control at the link
>level can translate to significantly better system characteristics. The

>question is: what granularity and what flow definition provides the
>optimum cost vs. performance trade-offs? I don't think we can answer
>this or the other related questions without further study. Isn't that
>what the CMSG is about?
>
>Gary
>
>-----Original Message-----
>From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
>[mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Siamack
>Ayandeh
>Sent: Friday, June 11, 2004 5:48 AM
>To: STDS-802-3-CM@LISTSERV.IEEE.ORG
>Subject: Re: [8023-CMSG] {Spam?} Re: [8023-CMSG] Server/NIC analogy
>
>
>Ben,
>
>It may help to start from a more limited scope with clear value before
>we venture far in to more complex territory.
>
>Clearly there must have been a perceived value in the existing Pause
>mechanism which is part of the standard and widely deployed.  This
>mechanism, or a yet to be defined mechanism, can be improved in the
>following sense:
>
>1) Scope: Can remain as is i.e. a single link. Given that Ethernet is a

>technology that is being used in a wide range of applications from
>inter-chip communication to the local loop for Metro Ethernet services,

>a single link would cover a wide range of applications.
>
>2) Granularity: Needs to be improved. The granularity can be defined by

>introduction of a grain_ID (I am threading carefully hear & don't use
>flow-ID). How this is mapped to Class of Service, VLAN tags, etc.
>becomes a local matter over a single link and need not be part of a
>standard. It is application dependent. Sure there are problems to be
>solved here but that's why we need a study group.
>
>The need is to create multiple control loops rather than one. How these

>get mapped is a local decision over a single link.
>
>3) Flow control algorithm:  Currently ON/OFF control is in place. This
>is a simple and effective mechanism. Whether it can be improved using
>the so called "rate based" algorithms or some thing else is to be seen
>and is the subject of study for the working group.
>
>In this limited context the study group can add value and produce a
>useful extension to the existing Pause flow control mechanism.
>
>Whether more can be done e.g. to extend the scope to multiple hops will

>no doubt arise and be debated in the course of the study. However the
>ambiguity that currently is floating around this subject should not
>prevent concrete progress in the more limited context.
>
>Regards, Siamack
>
>
>
>Benjamin Brown wrote:
>
>
>Gary,
>
>You say you're seeing promising results from simulations
>but you're not ready to share the data. I certainly hope
>that will change before the presentation deadline for the July meeting
>
>
>
>in 4 weeks.
>
>I don't mean to pick on you but you seem to be the only
>one that is taking up the flag AND at least suggesting that there is
>simulation data to back up your claims.
>
>As chair of this group, I'm trying to stir up discussion in order to
>get all the arguments on the table. If there are flaws in these
>arguments (the "gospels" as you call them) and the exploitation of
>these flaws has broad market potential and is both technically and
>economically feasible, then we need to get this information
>disseminated as soon as possible.
>
>I don't think we can try to go through the July meeting without this
>material and expect to get a continuation of this study group.
>
>Regards,
>Ben
>
>McAlpine, Gary L wrote:
>
>
>Norm,
>
>I agree with you on many of your points below. A higher granularity of
>
>
>
>"flow" than 8 priorities is needed to get any significant improvement
>across multiple stages of switching. I know I'm being vague about
>exactly what granularity of "flow" on which I want to exert targeted
>influence (rate control/backpressure). It's not because I don't know,
>it's because any discussions on the subject without data to back the
>proposals will "simply" turn into a big rathole. I am busy developing
>the data.
>
>I understand all your arguments below. I've been listening to the same
>
>
>
>ones for the last 15 years and, until a few years ago, treating them as

>the gospel. It wasn't until I set out to thoroughly understand the
>gorey details through simulations that I realized there were some
>interesting flaws in the "old" assumptions that can be very effectively

>exploited in confined networks such as multi-stage cluster
>
>
>
>interconnects.
>
>I guess I don't see such a clear boundary of responsibility between
>802.1 and 802.3 as you. I think it's an IEEE problem. And since the
>target link technology is Ethernet, then the focus should be on the
>802.3 support required to enable acceptable Ethernet based solutions. I

>think 802.1 needs to be part of a complete solution, but only to the
>
>
>
>extent of including support for the 802.3 mechanisms.
>
>Gary
>
>
>
>-----Original Message-----
>From: owner-stds-802-3-cm@LISTSERV.IEEE.ORG
>[mailto:owner-stds-802-3-cm@LISTSERV.IEEE.ORG] On Behalf Of Norman Finn
>Sent: Wednesday, June 09, 2004 2:33 AM
>To: STDS-802-3-CM@LISTSERV.IEEE.ORG
>Subject: Re: [8023-CMSG] Server/NIC analogy
>
>
>Gary,
>
>McAlpine, Gary L wrote:
>
>I think this discussion is off on a tangent.
>
>One can reasonably claim that you're the one who's off on a tangent.
>One man's tangent is another man's heart of the argument.  You keep
>saying, "we're just ..." and "we're only ..." and "we're simply ..."
>and failing to acknowledge our "but you're ..." arguments.
>Specifically:
>
>You want back pressure on some level finer than whole links.  The heart

>of the argument, that you are not addressing in your last message, is,
>"On exactly what granularity do you want to exert back pressure?"
>
>The answer to that question is, inevitably, "flows".  (I have no
>problem that "flows" are relatively undefined; we dealt with that in
>Link
>Aggregation.)  Per-flow back pressure is the "but you're ..."
>
>argument.
>
>Hugh Barrass's comments boil down to exactly this point.  You want to
>have per-flow back pressure.
>
>The "per-something Pause" suggestions have mentioned VLANs and priority

>levels as the granularity.  The use of only 8 priority levels, and thus

>only 8 flows, is demonstrably insufficient in any system with more than

>9 ports.  For whatever granularity you name, you
>
>
>
>require at least one queue in each endstation transmitter for each flow

>in which that transmitter participates.  Unfortunately, this o(n)
>
>
>
>problem in the endstations is an o(n**2) problem in the switch.  A
>simple-minded switch architecture requires one queue per flow on each
>inter-switch trunk port, which means o(n**2) queues per trunk port. The

>construction of switches to handle back-pressured flows without
>requiring o(n**2) queues per inter-switch port has been quite
>thoroughly explored by ATM and Fibre Channel, to name two.  It is
>*not* an easy problem to solve.
>
>At the scale of one switch, one flow per port, and only a few ports, as

>Ben suggests, it is easy and quite convenient to ignore the o(n**2)
>
>
>
>factor, and assume that the per-link back pressure protocol is the
>whole problem.  Unfortunately, as you imply in your e-mail below, the
>trivial case of a one-switch "network" is insufficient.  As soon as you

>scale the system up to even "a few hops", as you suggest, the number of

>ports has grown large enough to stress even a 12-bit tag per
>
>
>
>link. Furthermore, to assume that a given pair of physical ports will
>never want to have multiple flows, e.g. between different processes in
>
>
>
>the CPUs, is to deny the obvious.
>
>In other words, implementing per-flow back pressure, even in networks
>with a very small number of switches, very quickly requires very
>sophisticated switch architectures.
>
>For a historical example, just look at Fibre Channel.  It started with
>
>
>
>very similar goals, and very similar scaling expectations, to what
>you're talking about, here.  (The physical size was different because
>of the technology of the day, but the number of ports and flows was
>quite
>similar.)  Fibre Channel switches are now quite sophisticated, because
>the problem they are solving becomes extraordinarily difficult even
>
>for
>
>relatively small networks.
>
>Summary:
>
>This project, as described by its proponents, is per-flow switching. It

>is not the job of 802.3 to work on switching based even on MAC address,

>much less per-flow switching.  It is essential that anyone who desires
>to work on per-flow switching in 802 or any forum become familiar with
>what the real problems are, and what solutions exist.
>
>-- Norm
>
>
>... There are assumptions being
>
>
>made here that are off-base. We need to focus our attention on what it
>
>
>
>
>is we are trying to enable with new standards. (My numbered items are
>
>
>
>responses to Hugh's numbered items.)
>
>1. If what we are trying to enable are single stage interconnects for
>
>
>
>backplanes, then wrt the IEEE standards, we're done. We just need to
>get good implementations of NIC's and switches using 802.3x (rate
>control, not XON/XOFF) to meet the requirements (e.g. good enough
>throughput, low latency, low latency variation, no loss due to
>congestion). But ... single stage interconnects are not very
>interesting to people who want to construct larger interconnects to tie

>multiple racks with multiple shelves of blades together into a single
>system.
>
>2. (Putting on my server hat) We're NOT asking for IEEE to provide
>end-to-end congestion management mechanisms. If IEEE can simply
>standardize some tweaks to the current 802.3 (& 802.1) standards to
>support better congestion visibility at layer 2 and better methods of
>
>
>
>reacting to congestion at layer 2 (more selective rate control and no
>
>
>
>frame drops), then the rest can be left up to the upper layers. There
>
>
>
>are methods that can be implemented in layer 2 that don't prohibit
>scalability. Scalability may be limited to a few hops, but that is all
>
>
>
>
>that is needed.
>
>3. The assumption in item 3 is not entirely true. There are
>relationships (that can be automatically discovered or configured) that

>can be expoited for significantly improved layer 2 congestion control.
>
>4. For backpressure to work, it neither requires congestion to be
>pushed all the way back to the source nor does it require the
>backpressuring device to accurately predict the future. From the layer
>
>
>
>
>2 perspective, the source may be a router. So back pressure only needs
>
>
>
>
>to be pushed up to the upper layers (which could be a source endpoint
>
>
>
>or a router). Also, the backpressuring device simply needs to know its
>
>
>
>
>own state of congestion and be able to convey clues to that state to
>the surrounding devices. We don't need virtual circuits to supported at

>layer 2 to get "good enough" congestion control.
>
>5. From an implementation perspective, I believe the queues can go
>either in the MAC or the bridge, depending on the switch
>implementation. (Am I wrong? I haven't seen anything in the interface
>
>
>
>between the bridge and the MAC that would force the queues to be in the

>bridge.) IMO, where they go should NOT be dictated by either 802.1
>
>
>
>
>or 802.3. The interface between the bridge and MAC should be defined to

>enable the queues to be place where most appropriate for the switch
>
>
>
>
>architecture. In fact, a switch could be implemented such that frame
>payloads bypass the bridge and the bridge only deal with the task of
>routing frame handles from MAC receivers to one or more MAC
>transmitters (Do the 802.1 standards prevent such a design?).
>
>As far as the IETF standards go, they don't seem to rely on layer 2 to
>
>
>
>
>drop frames (although we don't yet have a clear answer on this). If a
>
>
>
>router gets overwhelmed, it will drop packets. But if it supports ECN,
>
>
>
>
>it can start forwarding ECN notices before becoming overwhelmed. I
>think the jury is still out on whether the upper layers (in a confined
>network) would work better with layer 2 backpressure or layer 2
>
>drops.
>
>>From a datacenter server perspective, there is no doubt in my mind
>
>
>
>that
>
>
>
>backpressure would be preferrable to drops.
>
>Gary
>
>
>
>
>--
>-----------------------------------------
>Benjamin Brown
>178 Bear Hill Road
>Chichester, NH 03258
>603-491-0296 - Cell
>603-798-4115 - Office
>benjamin-dot-brown-at-ieee-dot-org
>(Will this cut down on my spam???)
>-----------------------------------------
>
>
>
>
>
>
>--
>-----------------------------------------
>Benjamin Brown
>178 Bear Hill Road
>Chichester, NH 03258
>603-491-0296 - Cell
>603-798-4115 - Office
>benjamin-dot-brown-at-ieee-dot-org
>(Will this cut down on my spam???)
>-----------------------------------------
>
>
>

--
-----------------------------------------
Benjamin Brown
178 Bear Hill Road
Chichester, NH 03258
603-491-0296 - Cell
603-798-4115 - Office
benjamin-dot-brown-at-ieee-dot-org
(Will this cut down on my spam???)
-----------------------------------------