-
2025-02-04
18/900,714
2024-09-28
US 12,218,848 B1
2025-02-04
-
-
Jung H Park
2044-09-28
Datalink (data link) frames or networking packets contain protocol information. A system and method is disclosed where part of or all of the protocol information is contained in the same data link frame as network packet or another datalink frame referred to as STPI. The STPI contains enough protocol information to identify the source of the datalink, the destination and the next hop node or port. STPI sent in a datalink frame can be a request feed-back to avoid network congestion. The request STPI will be a pause or slow down request and comprise the source, destination and class of the datalink frames that are causing the congestion. There will be one datalink frame or packet for each non-request STPI, called DFoNP, containing data. The creation of STPI and DFoNP is done by the originator of the network packet such as an operating system coupled to an end node.
Get notified when new applications in this technology area are published.
H04L47/24 » CPC main
Traffic control in data switching networks; Flow control; Congestion control Traffic characterised by specific attributes, e.g. priority or QoS
G06F13/4022 » CPC further
Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units; Information transfer, e.g. on bus; Bus structure; Coupling between buses using switching circuits, e.g. switching matrix, connection or expansion network
G06F13/4282 » CPC further
Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units; Information transfer, e.g. on bus; Bus transfer protocol, e.g. handshake; Synchronisation on a serial bus, e.g. I2C bus, SPI bus
H04L45/74 » CPC further
Routing or path finding of packets in data switching networks Address processing for routing
H04L47/125 » CPC further
Traffic control in data switching networks; Flow control; Congestion control; Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
H04L49/25 » CPC further
Packet switching elements Routing or path finding in a switch fabric
H04L69/32 » CPC further
Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass; Definitions, standards or architectural aspects of layered protocol stacks Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level
H04L69/324 » CPC further
Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass; Definitions, standards or architectural aspects of layered protocol stacks; Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level; Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions in the data link layer [OSI layer 2], e.g. HDLC
G06F13/40 IPC
Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units; Information transfer, e.g. on bus Bus structure
G06F13/42 IPC
Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units; Information transfer, e.g. on bus Bus transfer protocol, e.g. handshake; Synchronisation
This application is a continuation of U.S. application Ser. No. 18/648,425 filed on Apr. 28, 2024 titled “DMA in PCI Express Network Cluster” which is a continuation of U.S. application Ser. No. 18/600,441, filed on Mar. 8, 2024, entitled “PCI Express network Cluster” which is a continuation of U.S. application Ser. No. 18/201,779, filed on May 25, 2023, entitled “A System for Avoiding Layer 2 Network Congestion”, now U.S. Pat. No. 11,956,154, issued on Apr. 9, 2024, which is a continuation of U.S. application Ser. No. 17/834,097, filed on Jun. 7, 2022, entitled “Delaying Layer 2 Frame Transmission”, now U.S. Pat. No. 11,706,148, issued on Jul. 6, 2023, which is a continuation of U.S. application Ser. No. 17/062,594, filed on Oct. 4, 2020, entitled “Data link Frame Reordering”, now U.S. Pat. No. 11,398,985, issued on Jul. 26, 2022, which is a continuation of U.S. application Ser. No. 16/132,427, filed on Sep. 16, 2018, entitled “Network Congestion and Packet Reordering”, now U.S. Pat. No. 10,841,227, issued on Nov. 17, 2020 which is a continuation of U.S. application Ser. No. 15/268,729, filed on Sep. 19, 2016, entitled “Networking using PCI Express”, now U.S. Pat. No. 10,110,498, issued on Oct. 23, 2018, which is a divisional application of U.S. application Ser. No. 14/120,845, filed on Jul. 1, 2014, entitled “Method for Congestion Avoidance”, now U.S. Pat. No. 9,479,442, issued on Oct. 25, 2016, which is a continuation of U.S. application Ser. No. 13/385,155, filed on Feb. 6, 2012, entitled “Method for Identifying Next Hop”, now U.S. Pat. No. 8,811,400 issued on Aug. 19, 2014, which is a continuation of U.S. application Ser. No. 11/505,788, filed on Aug. 18, 2006, entitled “Creation and Transmission of Part of Protocol Information Corresponding to Network Packets or Data link Frames Separately”, now U.S. Pat. No. 8,139,574 issued on Mar. 20, 2012, all of which are incorporated herein by reference in their entirety.
The present invention relates to efficient transfer of data link frame or network packets in a “custom” network. The network is “custom” as all switches and end nodes need to create or process data link frames or packets of special formats.
The OSI, or Open System Interconnection, model defines a networking framework for implementing protocols in seven layers. Most networking protocols do not implement all seven layers, but only a subset of layers. For example, TCP and IP protocol corresponds to layers 4 (TCP) and 3 (IP) respectively. Network packets contain protocol layer information corresponding to the packet. For example, a TCP/IP packet contains a header with both TCP and IP information corresponding to the packet.
The physical layer (layer 1) specifies how bits stream is created on a network medium and physical and electrical characteristics of the medium. The data link layer (layer 2) specifies framing, addressing and frame level error detection. For outgoing packets to the network, the datalink layer receives network packets from networking layer (layer 3) and creates datalink frames by adding data link (layer 2) protocol information and passes the frame to the physical layer. For incoming packets from network, data link layer receives data link frames from physical layer (layer 1), removes the data link (layer 2) protocol information and passes network packet to the networking layer. The network layer (layer 3) specifies network address and protocols for end to end delivery of packets.
Network packets contain protocol layer information corresponding to the packet. FIG. 1A illustrates a network packet containing 01001 layer 1, 01002 layer 2, 01003 layer 3, 01004 layer 4 headers, 01005 Data and 01008 layer 1, 01007 layer 2, 01006 layer 3 trailers. FIG. 1B illustrates a network packet with 01011 layer 1, 01012 layer 2 (data link), 01013 layer 3 (networking) and 01014 layer 4 (transport) headers and 01017 layer 1 and 01016 layer 2 trailers and 01015 Data. For each layer, the corresponding header and trailer (if present) together contain all the protocol information required to send the packet/frame to the consumer of the data in a remote node.
For example, headers/trailers corresponding to a TCP/IP packet in a 10BaseT Ethernet LAN are:
When parts of networks get congested and end nodes continue transmitting packets to congested parts of a networks, more and more switches can get congested. This can lead to switches dropping large number of packets, nodes retransmitting the dropped or lost packets and network slowing down.
U.S. Pat. No. 6,917,620 specifies a method and apparatus for a switch that separates the data portion and the header portion. This method has a disadvantage that overhead and logic for separating the data portion and the header portion and then combining the header portion and the data portion before transmission is required. This method also can not consolidate headers from more than one packet for transmission to the next node or delay packet arrival if the destination path of the packet is congested and therefore, can not avoid congestion.
According to claim (1)(c) of U.S. Pat. No. 5,140,582, the header portion of a packet is decoded prior to the receipt of full packet to determine the destination node. This invention can help with faster processing of the packet within a switch. This method can not consolidate headers from more than one packet for transmission to the next node or delay packet arrival if the destination path of the packet is congested and therefore, can not avoid congestion.
U.S. Pat. No. 6,032,190 specifies an apparatus and method of separating the header portion of an incoming packet and keeping the header portion in a set of registers and combining the header portion with the data portion before transmitting the packet. This method has a disadvantage that overhead and logic for separating the data portion and the header portion is required. This method can not consolidate headers from more than one packet for transmission to the next-node or delay packet arrival if the destination path of the packet is congested and therefore, can not avoid congestion.
U.S. Pat. No. 6,408,001 improves transport efficiency by identifying plurality of packets having common destination node, transmitting at least one control message, assigning label to these packets and removing part or all of header. This method has a disadvantage that switches need to identify messages with common destination node and additional logic to remove header and add label. This method can not delay packet arrival if the destination path of the packet is congested and therefore, can not avoid congestion.
It is the object of the present invention to create and transmit part of protocol information separately from the Datalink Frame or Network Packet (DFoNP) containing data. The Separately Transmitted Protocol Information is referred to as STPI. Network congestion can be reduced or avoided using STPI.
According to the invention, there should be at least one DFoNP which contains the data and rest of the protocol information not contained in STPI, corresponding to each STPI. Preferably, there will be only one DFoNP corresponding to each STPI. The STPI and DFoNP together contain all the protocol information required to send the packet/frame to the consumer of the data in a remote node.
The creation of STPI and DFoNP is done by the originator of the frame or packet such as an operating system in an end node. The format (contents and location of each information in a frame or packet) of the frame or packet containing STPI and DFoNP should be recognized by the final destination of the frame or packet. The format of STPI and DFoNP should also be recognized by switches in the network. So preferably, all STPIs and DFoNP in a given network should be of fixed formats.
Preferably, one or more STPIs are transmitted in a datalink frame or a network packet. The datalink frame containing STPIs is referred to as STPI Frame. The network packet containing STPIs is referred to as STPI packet. The switches in this case should be capable of extracting each STPI in an incoming STPI Frame or STPI packet and forwarding it to the next node in a different STPI Frame or STPI Packet. The switches can add each STPI from an incoming STPI Frame or STPI Packet into an STPI Frame or STPI Packet it creates. Preferably, the layer 2 address in the datalink frame containing multiple STPIs will be the next hop node address.
Optionally, STPI Frame or STPI Packet contains number of STPIs or length of the STPI frame. Optionally, STPI Frame or STPI Packet contains the offset or position of STPIs in the STPI frame—this is required only if STPIs supported by the network are not of fixed length.
Optionally, STPI Frame or STPI Packet does not contain the number of STPIs and switches in the network are capable of identifying the number of STPIs from length of the frame as they are of fixed length.
Preferably, some protocol information contained in STPI may not be contained in the corresponding DFoNP. But protocol information contained in STPI and the corresponding DFoNP need not be mutually exclusive. In this method, the switches obtain both STPI and the corresponding DFoNP before the STPI and the corresponding DFoNP are forwarded. Optionally, STPI need not be forwarded to end node if sufficient protocol information is contained in the corresponding DFONP.
The proposed invention can be employed for data, control and/or RDMA packets in a network.
The proposed method allows switches to read the more than one STPI, and then delay obtaining the corresponding DFoNP. The DFoNP may be read or forwarded in a different order compared to the order in which STPI are read or forwarded. This method allows switches to optimize resources and packet/frame forwarding efficiency.
STPI contain temporary information such as current node or port number of the node containing the corresponding DFoNP. STPI also contains an address of a buffer containing the corresponding DFoNP or an offset in a buffer where the corresponding DFoNP is stored or an index of the corresponding DFoNP in an array. These information help in associating STPI to the corresponding DFoNP. The exact information contained in STPI whether it is an address or an offset or an index or a combination of these is implementation specific.
Optionally, STPI may contain originating node identifier and a sequence number. Such information can help in reporting errors when STPI or corresponding DFoNP are corrupted or lost.
Optionally, STPI may contain other vendor specific or DFoNP related miscellaneous information.
Optionally, DFoNP may contain some information that help in associating itself with corresponding STPI, such as originating node identifier and a sequence number. Preferably, DFoNP sequence number is same as the sequence number of the corresponding STPI.
Optionally, DFoNP may contain other vendor specific miscellaneous information.
The originating node creating an STPI by creating and initializing one or more data structures. Preferably, there is only one data structure containing STPI.
A switch receiving both frame containing STPI and the DFoNP before forwarding a frame containing STPI or DFoNP to the next switch or node.
Preferably, a switch receiving frame containing STPI before reading the corresponding DFONP.
A switch can delay transmitting or reading DFoNP after the corresponding STPI is transmitted or received, allowing the switch to optimize its resource usage and improve efficiency.
A switch can read DFONPs corresponding to a switch port with minimum outbound traffic, ahead of other DFONPs, thereby improving link efficiency.
The switch modifying temporary information in STPI such as node number or port number corresponding to the node containing corresponding DFoNP and buffer pointer or index or offset for the corresponding DFoNP, when the DFoNP is transmitted to another node.
If the DFoNP and STPI is forwarded to another subnet, layer 2 information in STPI and DFoNP should be updated to be compatible with the subnet to which it is forwarded (for example, in an IP network when a packet moves from Ethernet to ATM, layer 2 protocol information will have to be modified to be made compatible with ATM network).
If STPI contains a multicast or broadcast destination address, the switch transmitting both the DFONPs and the STPI to all next hop nodes identified by the address.
A switch can delay reading or forwarding the DFoNP after the corresponding STPI is received or forwarded, and vice versa.
A switch may or may not receive or transmit DFONPs in the same order as the corresponding STPIs are received or transmitted from a switch port.
Optionally, a switch may receive or transmit one or more DFoNP in one frame.
For networks that support layer 5/6/7 (example OSI networks), STPI optionally containing part of or all of layer 5/6/7 information. Preferably, no layer 5/6/7 information may be contained in STPI.
FIG. 1A illustrates a network packet containing layer 1, layer 2, layer 3, layer 4 headers, Data and layer 1, layer 2, layer 3 trailers.
FIG. 1B illustrates a network packet with layer 1, layer 2 (datalink), layer 3 (networking) and layer 4 (transport) headers and layer 1 and layer 2 trailers and Data.
FIG. 2A illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2B illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2C illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2D illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2E illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2F illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2G illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2H illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2I illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2J illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2K illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2L illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2M illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 2N illustrates example formats for DFoNP, the corresponding STPI and an STPI frame which contain STPIs.
FIG. 3A illustrates Switch/Node A containing an STPI and the corresponding DFoNP to be transmitted to the Switch/Node B.
FIG. 3B illustrates the Switch/Node A sending an STPI frame containing the STPI.
FIG. 3C illustrates the Switch/Node B deciding to fetch the DFoNP corresponding to the STPI and sending Read-DFoNP Frame to the Switch/Node A containing the Read-DFoNP request for the DFONP.
FIG. 3D illustrates the Switch/Node A responding to the Read-DFoNP request for the DFoNP by sending the DFONP.
FIG. 3E illustrates the STPI being updated with the identifier of the Switch/Node B and the location of the DFoNP in the Switch/Node B.
FIG. 4A illustrates Switch/Node A containing an STPI and the corresponding DFoNP to be transmitted to the Destination Node B.
FIG. 4B illustrates the Switch/Node A transmitting an STPI Frame containing the STPI to the Switch/Node B.
FIG. 4C illustrates the Switch/Node A transmitting the DFoNP to the Switch/Node B.
FIG. 4D illustrates the Switch/Node B updating the STPI with the location of the DFoNP in the Switch/Node B.
FIG. 5A illustrates Switch/Node A containing an STPI and the corresponding DFoNP to be transmitted to the Switch/Node B.
FIG. 5B illustrates Switch/Node A transmitting a frame containing the STPI to the Switch/Node B.
FIG. 5C illustrates the Switch/Node B deciding to fetch the DFoNP corresponding to the STPI and sending Read-DFoNP Frame to the Switch/Node A containing DFoNP request for the DFONP.
FIG. 5D illustrates the Switch/Node A responding to the Read-DFoNP request by transmitting the DFoNP.
FIG. 5E illustrates the STPI being updated with identifier of Switch/Node B and the location of the corresponding DFoNP in the Switch/Node B.
FIG. 6A illustrates Switch/Node A containing an STPI and the corresponding DFoNP to be transmitted to the Switch/Node B.
FIG. 6B illustrates the Switch/Node A responding by sending an STPI frame containing all STPIs to be transmitted to the Switch/Node B.
FIG. 6C illustrates the Switch/Node A transmitting the DFoNP corresponding to the STPI to the Switch/Node B.
FIG. 6D illustrates the STPI being updated with identifier of the Switch/Node B and the location of the corresponding DFoNP in the Switch/Node B.
FIG. 7A illustrates Switch/Node A containing an STPI and the corresponding DFoNP to be transmitted to the Destination End Node B.
FIG. 7B illustrates Switch/Node A transmitting the DFoNP to the Destination End Node B and updating the STPI with the location (DMA address) of the DFoNP in the Destination End Node B.
FIG. 7C illustrates Switch/Node A transmitting the STPI in an STPI frame to the Destination End Node B.
FIG. 7D illustrates that both STPI and DFoNP are received by End Node B.
FIG. 8A illustrates a Read-STPI frame with Frame Type “Read-STPI” and “Number of STPIs” set to 3.
FIG. 8B illustrates a Read-STPI frame in a network where explicit frame type specification is not required.
FIG. 8C illustrates a Read-STPI frame in a network without layer 1 headers or trailers.
FIG. 8D illustrates a Read-STPI frame in a network without layer 1 headers or trailers.
FIG. 9A illustrates a Read-DFoNP frame with Frame Type “Read-DFoNP” and “Number of Read-DFoNP requests” set to 2.
FIG. 9B illustrates a Read-DFoNP frame in a network where explicit frame type specification is not required.
FIG. 9C illustrates Read-DFoNP frame in a network without layer 1 headers or trailers.
FIG. 9D illustrates a Read-DFoNP frame in a network without layer 1 headers or trailers.
FIG. 10A illustrates a Number-of-STPIs frame with Frame Type “Number-of-STPIs” and “Number of STPIs” set to 3.
FIG. 10B illustrates Number-of-STPIs frame in a network where explicit frame type specification is not required.
FIG. 10C illustrates Number-of-STPIs frame in a network without layer 1 headers or trailers.
FIG. 10D illustrates a Number-of-STPIs frame in a network without layer 1 headers or trailers.
FIG. 11A illustrates an example of DFoNP and STPI frames which can be used with Ethernet.
FIG. 11B illustrates Read-DFoNP frame which can be used with Ethernet.
FIG. 12A illustrates format of PCI Express Read Completion containing DFoNP, from a root bridge in response to a Memory Read request from a switch.
FIG. 12B illustrates format of PCI Express Read Completion containing STPIs, from a root bridge in response to a Memory Read request from a switch.
FIG. 12C illustrates a PCI Express Memory Write transaction containing DFoNP, from a switch to a root bridge.
FIG. 12D illustrates a PCI Express Memory Write transaction containing STPIs, from a switch to a root bridge.
FIG. 13A illustrates a frame containing both Number-of-STPIs message and Read-DFoNP requests.
FIG. 13B illustrates a frame containing both Read-STPI request and Read-DFoNP requests.
FIG. 14A illustrates Switch A has 3 DFONPs to be transmitted to Switch B.
FIG. 14B illustrates the switch identifying that STPI [1] and STPI [2] received are for node D and adding STPI [1] and STPI [2] to the queue for the node D.
There are a very large number of design options with network component designers with respect to the format of DFoNP, STPI and STPI frame/packet. FIG. 2A, FIG. 2B, FIG. 2C, FIG. 2D, FIG. 2E, FIG. 2F, FIG. 2G, FIG. 2H, FIG. 2I, FIG. 2J, FIG. 2K, FIG. 2L, FIG. 2M and FIG. 2N illustrate some examples of different formats in which the STPI and the corresponding DFoNP can be created adhering to this invention. The layer 2, layer 3, and layer 4 information that may be present in the DFoNP and STPI may or may not be mutually exclusive and is dependent on specific format or formats of STPI and DFoNP supported by switches and endnodes. Each network will employ only few STPI/DFoNP formats (preferably, as few as 1-3), one each for a subtype of a packet or a frame. Preferably, a network may employ only one format for STPI and one format for DFoNP to reduce complexity in switches and endnodes. STPI should have enough information for the switch to find the port for the next hop.
Below five options for transferring STPI and the corresponding DFoNP from one node to another, are described. One of the first 4 methods can be used for transferring STPI and the corresponding DFoNP from the originating node or a switch to another switch or end node. The fifth method can be used for transferring STPI and the corresponding DFoNP to a destination end node:
A switch can employ one of the STPI and DFoNP transfer options (strategies) listed above, for each port. Both ports on a point-to-point link must agree to the same frame transmitting option. All ports on a link or bus must follow the same frame transmitting option. Preferably, a network employs only one of the four STPI/DFoNP transfer options listed in FIG. 3A to FIG. 3E, FIG. 4A to FIG. 4D, FIG. 5A to FIG. 5E, FIG. 6A to FIG. 6D. Preferably, a network also employs the STPI/DFoNP transfer option listed in FIG. 7A to FIG. 7D. For the option corresponding to FIG. 7A to FIG. 7D, updating STPI with address (location) of DFoNP in the end node is optional.
If DFONPs do not contain information (such as originating node identifier, DFoNP identifier, DFoNP address in previous node, etc.) that allow a DFoNP to be mapped to the corresponding STPI, then the DFONPs must be transmitted in the same order as requested in Read-DFoNP frame/s with design options listed in FIG. 3A to FIG. 3E and FIG. 5A to FIG. 5E. With design options listed in FIG. 4A to FIG. 4D and FIG. 6A to FIG. 6D, if DFONPs do not contain information that allow the DFoNP to be mapped to the corresponding STPI, DFONPs must be transmitted in the same order as the corresponding STPIs are transmitted. This will allow switches to identify STPI corresponding to an DFoNP that is received.
There are a very large number of design options with network component designers with respect to the format of Read-STPI request and Read-STPI Frames containing Read-STPI request. FIG. 8A, FIG. 8B, FIG. 8C and FIG. 8D illustrate some examples of different formats in which the Read-STPI Frames can be created adhering to this invention. Preferably a given network employs only one format (design option) for Read-STPI request to keep the design of switches and end nodes simple.
A Read-DFoNP Frame contains one or more Read-DFoNP requests and each Read-DFoNP request contains the location of the requested DFoNP. There are a very large number of design options with network component designers with respect to the format of Read-DFoNP requests and Read-DFoNP Frames containing Read-DFoNP requests. FIG. 9A, FIG. 9B, FIG. 9C and FIG. 9D illustrate some examples of different formats in which the Read-DFoNP Frame can be created adhering to this invention. Preferably, a given network employs only one format (design option) for Read-DFoNP request to keep the design of switches and end nodes simple.
Optionally, a switch or node can send the number of STPIs available for transmission to the next hop node or switch. There are a very large number of design options with network component designers with respect to the format of Number-of-STPIs message and Number-of-STPIs Frames containing Number-of-STPIs message. FIG. 10A, FIG. 10B, FIG. 10C and FIG. 10D illustrate some examples of different formats in which the Number-of-STPIs Frame can be created adhering to this invention. Preferably a given network employs only one format for Number-of-STPI message to keep the design of switches and end nodes simple.
The network described in this invention can be connected to an I/O card (in a server or embedded system) or to a PCI bus.
When destination address contained in an STPI is a Multi-cast and Broadcast address, both STPI and DFoNP are transmitted to all next hop nodes identified by the Multi-cast or Broadcast address.
When STPI or DFoNP frames are corrupted or lost, switches and nodes may employ retransmission of the corrupted or lost frame. The retransmission policy and error recovery are link (example PCI) and vendor specific.
Some networks allow more than one type of content to be present in the same frame. The types of contents are STPI, DFONP, Read-STPI request, Read-DFoNP request and Number-of-STPIs message.
FIG. 14A and FIG. 14B illustrate an example of reading DFONPs in a different order compared to the order in which STPIs are received. In FIG. 14A, Switch A 14001 has 3 DFONPs 14004 to be transmitted to Switch B 14002. The Switch A forwards 3 STPIs corresponding to the DFONPs in an STPI frame 14003 to Switch B. The Switch B has 10 STPIs in its queue 14006 for its link to node D. The switch B has no STPIs in its queue 14005 for its link to node C. In FIG. 14B, the switch identifies that STPI [1] and STPI [2] received are for node D and adds STPI [1] and STPI [2] to the queue 14006 for the node D. The Switch B delays reading DFoNP [1] and DFoNP [2] since there are a large of STPIs already queued for the node D. The Switch B identifies that STPI [3] received is for the node C and queues STPI [3] to the queue 14005 for the node C. The Switch B sends Read-DFoNP Frame 14013 to the Switch A with DFoNP [3] address.
If STPI contains a priority or QoS field, a switch can use it for controlling the order in which DFONPs are read. Similarly, a priority or QoS field in STPI or DFoNP could be used by switches or nodes to control the order in which STPIs are transmitted to the next node.
A network corresponding to this invention could be used to connect a server or servers to storage devices (such as disks, disk arrays, JBODs, Storage Tapes, DVD drives etc.). iSCSI and iSER (ISCSI Extensions for RDMA) are examples in which SCSI commands and SCSI data are transmitted using networks technologies used for server interconnect.
Advantages
A switch can delay receiving DFoNP for paths which are already congested.
A switch can read DFoNP corresponding to a lightly loaded link ahead of other DFONPs and transmit STPI and DFoNP more quickly to the lightly loaded link improving link efficiency.
A switch can delay reading DFONPs based on QoS or priority field in STPI.
A switch can optimize switch resources, memory and frame/packet queues as congestions are minimized by delaying DFONPs for ports which are already congested.
The switch can ensure higher throughput on all links by rearranging order in which DFONPs are read.
1. A system enabled for congestion reduction in a network, the system comprising:
a plurality of interconnected network nodes;
a plurality of network switches, distributed within the network;
each of the network switches configured for interconnecting network nodes in the network and for forwarding network packets;
the plurality of network nodes comprising at least a plurality of end nodes within the network, one or more of the plurality of end nodes having a capability to create network packets and thereby being creators of network packets;
the plurality of network nodes comprising at least a first network node and at least a second network node;
the at least the first network node comprising a network switch;
the at least a second network node configured as an end node configured as a creator of network packets;
the network switch receiving a plurality of first network packets and second network packets created by the one or more end nodes comprising at least the second network node configured as a creator of network packets;
the first and the second network packets received by the network switch are to be forwarded over the network as data link frames comprising the network packets;
the network switch having a plurality of queues at its network ports;
the plurality of queues of the network switch comprising at least a first queue for storing and forwarding the datalink frames comprising the first network packets;
the network system is configured to implement a method for per queue congestion reduction over the interconnected network nodes; the method comprising:
the network switch receiving a plurality of a first of network packets to be stored in and forwarded out of the first queue on the at least the network switch;
the at least the network switch experiencing high load at its first queue, the high load being an indicator of congestion;
the at least the second network node being one of the creators of first and second network packets receiving a request in a data link frame;
the second network node responding to the request by sending network packets which are to be forwarded using the plurality of queues of the network switch, other than the first network packets to be stored in the first queue of the at least the network switch, while delaying sending network packets which are to be forwarded using first queue in the network switch;
the method thereby reducing the congestion by reducing the load in the first queue of the at least the network switch.
2. The network system of claim 1, wherein the system is implemented in any of a data-center (DC) or a wide area network (WAN).
3. The system of claim 1, wherein the request received in the datalink frame is a request for reordering of network packets created by the end node by delaying transmission of only the network packets to be stored and forwarded using the first queue of the network switch.
4. The network system of claim 1, wherein the first queue in the network switch is used to forward first network packets from the plurality of end nodes that are creators of network packets.
5. The system of claim 1, wherein the one or more of the plurality of end nodes, having a capability to create network packets and thereby being creators of network packets.
6. A system enabled for congestion reduction in a network, the system comprising:
a plurality of interconnected network nodes;
a network comprising plurality of network switches, distributed within the network comprising the plurality of interconnected network nodes;
each of the network switches configured for interconnecting network nodes in the network and for forwarding datalink frames;
the data link frames comprising any one of a request, a part of protocol information, a Network Packet containing data or a protocol information with network packet containing data;
the plurality of network switches comprising at least a first network switch at a first node interconnected to a second switch at a second node in the network;
the first switch and the second switch interconnected directly via a first port on the first switch connecting to a second port on the second switch;
each of the network switches having a plurality of queues at its ports for storing and forwarding datalink frames based on at least a priority or a quality of service (QOS) associated with each datalink frame;
the plurality of queues comprising a first queue for storing and forwarding of first datalink frames based on a first priority and at least a second queue for storing and forwarding of second datalink frames, based on priorities other than the first priority, over the network; and
the system is configured to implement a method for per queue congestion reduction over the network:
the method comprising:
the first switch receiving at least a first datalink frame and a plurality of second datalink frames, in a first order, designated to be forwarded to the second switch from the first port of the first switch interconnecting to the second port of the second switch;
the at least the first datalink frame forwarded to the second switch is to be stored and forwarded from the first queue of second switch;
the plurality of second datalink frames forwarded to the second switch are to be stored and forwarded from the second queue of the second switch;
the first switch receiving a third datalink frame as a delay request, from the second switch;
the third datalink frame received indicative of congestion at the first queue of the second switch, wherein the first queue being more loaded than the second queue of the second switch;
the first switch responding to the delay request received as the third datalink frame by delaying, by pausing or stopping for a time, the sending of the at least the first datalink frame to the first queue of the second switch while continuing to send the plurality of second datalink frames to the second queue of the second switch;
the first switch thereby re-arranging and sending the plurality of first datalink frames and the plurality of second datalink frames in a second order different from the first order to the second network switch;
the method thereby enabling a reduction of the congestion by reducing the load on the first queue of the second switch.
7. The system of claim 6, wherein the system is implementable in any one or more of a data-center (DC) or a wide area network (WAN).
8. The system of claim 6 enabled to implement the method for per queue congestion reduction, wherein the delay request is a request to delay the transmission of only the datalink frames having the first priority to be stored in the first queue of the second switch.
9. The system of claim 8 enabled to implement the method for per queue congestion reduction, wherein the delaying of the transmission of the datalink frames to be stored in the first queue of the second switch is by pausing or stopping for a period of time transmission of the datalink frames to be stored in the first queue of the second switch.
10. The system of claim 8 enabled to implement the method for per queue congestion reduction, wherein the delay request received as the third datalink frame is a request to reorder transmission of datalink frames to the second switch by delaying transmission of the datalink frames having a first priority to be stored in the first queue of the second switch while continuing transmission of datalink frames having priorities other than the first priority, to the second switch.
11. A system enabled for congestion elimination at a congestion point (CP) by implementation of a method for per queue elimination in a network implemented in a data center, the system comprising:
a plurality of interconnected nodes;
a plurality of switches enabled for interconnection distributed at the nodes of the network;
the plurality of interconnected nodes comprising at least a first node comprising a first network switch directly connecting via a first port to a second port of a second network switch at a second node in the network;
the second network switch receiving a plurality of datalink frames that comprise at least a first datalink frame and at least a second datalink frame to be sent to the first network switch;
the first network switch comprising a plurality of queues for storing and forwarding datalink frames of differing quality of service (QOS) based on a priority or a QoS field within the datalink frames;
the plurality of queues at the first network switch comprising at least a first queue configured to receive, store and forward first datalink frames received from the second network switch;
the first network switch configured to generate a request as a datalink frame to be sent to the sources of the network packets forming the at least the first datalink frames, when there is high network load indicative of congestion on the first queue of the first network switch;
responsive to the request the sources of the network packets forming the at least the first datalink frames are configured to delay, the transmission of the network packets forming the at least the first datalink frames received by the second switch, to be sent to the first queue of the first switch, while continuing to transmit the network packets forming the at least the datalink frames other than the at least the first data link frames to be sent the first network switch.
12. The system of claim 11, wherein the sources of network packets forming the datalink frames being sent to the second network switch are end nodes in the network configured with a capability to create network packets and thereby being a creators of network packets to be converted to datalink frames.
13. The system of claim 12, wherein each of the end nodes configured as the creator of network packets is coupled to at least a processor.
14. The system of claim 11, wherein the request sent as datalink frame to the sources of the at least the creators of network packets forming the first datalink frames is a request to delay transmission of the network packets forming the first datalink frames to be sent to the second switch.
15. The system of claim 12, wherein the sources of the network packets forming the first data link frames are determined from the source identifier within the datalink frames.
16. The system of claim 11, wherein the implemented method for per queue congestion elimination comprise the request generated by the network switch and sent to the sources of the network packets; wherein the request is a delay request to delay the transmission of only the network packets forming the first datalink frames to be stored in the first queue of the second switch.
17. The system of claim 16, wherein the delaying in response to the delay request is by pausing or stopping for a period of time transmission of the network packets forming the first datalink frames to be stored in the first queue of the first switch.
18. The system of claim 17, wherein the request sent in the third data link frame is a request is a request to reorder the transmission of network packets created by the end node by delaying transmission of the network packets to be sent as first data link frames to be stored and forwarded using the first queue of the second network switch.
19. The network system of claim 18, wherein the queues in the first network switch is used to forward data link frames comprising network packets from a plurality of creators of network packets.