# Multi-Lane PMD Reliability and Partial Fault Protection (PFP)

WB Jiang, Zeng Li, Min Ye, Chiwu Ding Huawei Technologies

IEEE 802.3ba Task Force, 23-25 Jan 2008

HUAWEI TECHNOLOGIES Co., Ltd. IEEE 802.3ba, January, 2008



## **Outline**

- Multi-Lane PMD reliability
- Current network level protection scheme
- Partial fault protection (PFP) proposal
- Conclusions



## **Leading IEEE 802.3ba PMD Options**

- 4x25G xWDM around 1310nm over 10km and 40km SMF
  - Early products will likely be EML based
    - Traditional InGaAsP based cooled EML
    - Uncooled InAlGaAs based EML
      - Cooling required for non-CWDM
  - Uncooled DML could be a low cost option if the distance is reduced to less than 10km to accommodate CWDM
- 4x10G and 10x10G over 100m parallel OM3 MMF
  - Likely candidate will be oxide VCSEL array



# OE Device Failure Rate Characteristics --- Bathtub Curve



- Early failures: Infant mortality normally eliminated through burn-in
- Random failures: constant failure rate occurring throughout the device life due to accidental reasons, such as mis-handling, ESD, EOS, etc.
- Wear-out failures: device aged and close to the end of its life



## Single Mode Fiber PMD

- Traditional InGaAsP/InP technology has long been proven reliable
- InAIGaAs/InP technology improves high temperature performance by better confining carriers in the active region
  - Al is chemically active and easy to trap oxygen
  - Exposed cleaved facets tend to rapidly degrade laser operation and facet degradation is the top cause of failures
  - Good facet protection is required to reduce early and random/sudden failures
  - Field deployment history is not long enough and volume not high enough for high confidence, although lab studies from some vendors have shown good long term reliability potential
- Early deployment will be based on 4 discrete laser components for the four-lane xWDM PMD, and the failure rate will be four times higher



## **Multimode Fiber PMD**

- 4-channel oxide VCSEL array for 40GE
- 10-channel oxide VCSEL array for 100GE
- Extension of discrete 10G oxide VCSEL technology
  - Smaller oxide confinement aperture than 1-4G oxide VCSELs



## Oxide VCSEL Structure Schematics



Current confinement aperture diameter

•1-4G VCSEL: 12 – 17 μm

•10G VCSEL: 6 – 10 μm



## **Oxide VCSEL Structure**

- A thin layer of AIGaAs (97-98%AI) oxidized into AI<sub>x</sub>O<sub>y</sub> @ 400+ ° C in water vapor for current confinement
- Mechanical stress introduced due to materials shrinkage after oxidation
  - Stress increases at lower temperature
  - Source of Dark Line Defects (|110>DLD)
  - High temperature burn-in not effective in removing early failures due to this cause
- Al<sub>x</sub>O<sub>y</sub> introduced in the VCSEL structure calls for hermetic packaging
  - A cause for random failure for array VCSEL not hermetically packaged
- 10G VCSELs less reliable due to smaller apertures and more susceptible to mechanical stress
- Smaller devices more susceptible to ESD due to smaller capacitance
  - Latent ESD and point defects as sources for |100> DLD



## Oxide VCSEL Failure Modes --- Wear-out

#### Wear-out failure

- Failure rate increasing over the time
- Uniform dimming due to point defects generation across active emission area
- MTTF several million hours or longer
- Rarely a concern for VCSEL or VCSEL array
  - An array behaves similarly to a discrete in wear-out



### Oxide VCSEL Failure Modes --- Random and ESD

#### Random failure

- Constant failure rate
- Due to process flaws, miss-handling, poor packaging, not hermetically sealed, etc.
- |100> DLD or |110> DLD
  - Activation energy of 0.7eV normally used for wear-out does not apply
  - Lower temperature failure rate could be equivalent to or higher than higher temperature due to extra mechanical stress (flat or negative activation energy)
- Random nature, follow Gaussian statistics
  - X10 array will be 10 times more likely to fail than a discrete VCSEL

#### ESD failure (HBM, MM, CDM)

- Large black spots in EL topography
- Latent ESD can be a source for |100>DLD, thus a cause for random failure
- 10G VCSEL more susceptible to ESD
- Unlike CMOS IC, VCSEL does not have any built-in ESD protection
- The failure path follows the weakest link
  - An x10 array will be 10 times more likely to fail than a discrete VCSEL



## Oxide VCSEL Failure Modes --- Early Failures

### Infant Mortality

- Failure rate decreasing over the time
- Bad processed/grown wafers
- Excessive mechanical stress due to oxidation
- ESD during process and packaging
- Burn-in removes most of the early failures, but some could escape from the burn-in screening due to long infant mortality tails of oxide VCSELs
  - Screening effectiveness varies from vendor to vendor
  - Once a while, bad wafers escape from being screened out
  - High temperature burn-in not effective in removing failures due to mechanical stress in the oxide VCSEL



## Field Failure Returns & Implication to Array

- Field transceiver failures due to VCSELs are not wear-out failures
  - Oxide VCSEL transceiver shipping history is less than 7 years
  - VCSEL wear-out MTTF is at least 10 times longer, even operating at 85°C
- 90%+ VCSEL failures from field returns are due to either DLD or ESD
  - Categorized as either residual early failures, random failures or ESD
  - A x10 VCSEL array will be 10 times more likely to fail than the current field returns based on discrete VCSELs
- Published VCSEL reliability study and projection by VCSEL suppliers
  are done at temperatures too high to reveal the failure modes due to
  oxide layer stress, thus not effective in predicating random failure
  rates under normal operation condition

## Field Return FA from a VCSEL Vendor



Ref: "A plot twist: the continuing story of VCSELs at AOC", Technical Publication, www.finisar.com



## **Partial Fault Protection (PFP)**

- Network level and physical layer link redundancy are normally used to protect from single link failure
- Initial 100G applications are more likely for 10G aggregation, and its cost can be high
- Considering 4 times more likely to fail for a 4-lane PMD and 10 times more likely to fail for a 10-lane PMD, PFP on the PMD/PHY layer will reduce the burden on having network layer protection, and reduce the need for a redundant physical layer link.
  - A multi-lane PMD provides an opportunity to utilize the surviving lanes for data transmission without an immediate need for replacement and reducing the network redundancy requirements



## **Current Line-card Protection Scheme**



- The connection between router and transport devices must be highly reliable
- 1+1 line card protection has been used
- But, it is costly, and makes physical links more complex



## 100GE Implementation per Current (1+1) Protection Scheme



- Two sets of system fabric buses are on for backup
- Reduce the equipment capacity and increase the ssytem cost due to redundancy



- Path H-A-E-D-F, E to D is a 100GE link by multi-lane PHY.
- If one lane fails in the Node E, node E will not be available. A new 100G link needs to be computed
  - Backup with an additional 100G link (e.g. 1+1)
     is very costly to the network (Capex)
  - If relying upon reroute protection protocol (e.g. FRR, 1:1), it may take a long time to compute new flow path when there are large numbers of 100G flow streams.

| Flow numbers | Time          |
|--------------|---------------|
| <200         | hundreds msec |
| >200         | >several sec  |

ASON Recovering time, depending on how many paths to be recovered



## Proposed Line-card Protection with PFP



- With PFP (partial fault protection), when a laser fails, the link connection will be maintained at a lower data rate
  - Laser has been the weakest link among all components in the system
- Partial fault indication may be sent to inform the higher level protocol (802.1, FRR, 1:1 protect) to decrease the bandwidth to accommodate the remaining lanes
- It will have little effect on the flow in the data network.
- Failed modules will be replaced during scheduled maintenance
- Network level protection kicks in when PFP fails
  - Reduce the need for redundant link card protection



#### **PFP for Multi-Lane PMD**



- When a laser or lane fails, receiver or DDM detects the failure.
- Receiver or SD sends LLF (Local Link Fail) to the higher level, and sends RLF (Remote Link Fail) to the transmit.
- Transmit inserts NULL blocks to the failed lane, and sends a flow control indication to the high level protocol in order to decrease the bandwidth to accommodate the remaining lanes



## Optional Line-card Protection for Fiber Breakage



- Passive splitter and optical switch introduced as 1+1 link protection to protect the link from fiber breakage
- The combination of PFP and 1+1 link protection provides a seamless PHY level protection to alleviate the need for network level protection through data rerouting



#### PFP for 10-Lane PMD in PBL



 When informed of a lane failure, TX inserts NULL blocks into the failed lane, keeping valid blocks distributed to the valid lanes.



#### PFP for 4-Lane PMD in PBL



 When informed of a lane failure, TX inserts NULL blocks into the failed lane, keeping valid blocks distributed to the valid lanes.



## PFP for 4-Lane PMD with CTBI in PBL





## **Conclusions**

- Non-wear-out failures are the major concerns for field PMD degradation
- Current single channel PMD field return rate due to laser failures has not been negligible and multi-lane PMD failure rate will increase by four to ten times
  - Encourage equipment vendors to share transceiver field failure statistics to help with the decision making
- Multi-lane PMD provides an opportunity for lower cost link fault protection by utilizing the surviving lanes --- PFP
  - PFP is complementary to the current network protection architecture
  - PFP makes line-card protection more economical
    - NULL block is defined to disable the failure lane
    - Laser/PHY-Lane fail---- use PFP
    - Fiber fail ----- use 1+1 protection
    - Router or node fail ---- use network protection
- PFP indication may be used by higher layer (802.1) to achieve network wide protection
  - Dialogue needed with 802.1



## Thank You