🔗 Permalink

Patent application title:

Model Training Method and Communication Apparatus

Publication number:

US20260121942A1

Publication date:

2026-04-30

Application number:

19/431,549

Filed date:

2025-12-23

Smart Summary: A communication device gets information about a specific layer of an intelligent model, including input data and a factor that helps with gradient accumulation. It then uses this information to calculate a set of gradients for that layer. One of the factors can help create several gradients at once. These gradients are important for improving the intelligent model. Overall, the method helps in training the model more effectively. 🚀 TL;DR

Abstract:

A model training method includes: a communication apparatus receives first information, where the first information indicates an input data set of an l^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the l^thlayer, and l is a positive integer. The communication apparatus determines, based on the input data set of the l^thlayer and the gradient accumulation factor set corresponding to the l^thlayer, a gradient set corresponding to the l^thlayer, where one gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in the gradient set, and the gradient set is used to determine the intelligent model.

Inventors:

Jianglei Ma 628 🇨🇦 Ottawa, Canada
Hao Tang 142 🇨🇳 Shanghai, China
Ting Wang 176 🇨🇳 Shanghai, China
Lei DONG 51 🇨🇳 Shanghai, China

Na GAO 13 🇨🇳 Shanghai, China
Dongdong Wei 25 🇨🇳 Shenzhen, China

Assignee:

HUAWEI TECHNOLOGIES CO., LTD. 29,989 🇨🇳 Shenzhen, China

Applicant:

Huawei Technologies Co., Ltd. 🇨🇳 Shenzhen, China

Interested in similar patents?

Get notified when new applications in this technology area are published.

Create Free Alert

Classification:

H04L41/16 » CPC main

Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence

G06N20/00 » CPC further

Machine learning

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This is a continuation of International Patent Application No. PCT/CN2023/104201 filed on Jun. 29, 2023, which is hereby incorporated by reference in its entirety.

TECHNICAL FIELD

This disclosure relates to the communication field, and more specifically, to a model training method and a communication apparatus.

BACKGROUND

With continuous development of technologies related to computing power, algorithm, and data, the three driving forces of artificial intelligence (AI)/machine learning (ML), AI/ML technologies have sparked a new wave of technological revolution in human society. AI/ML technologies implement modeling and learning in complex and unknown environments, have great application potentials in channel prediction, intelligent signal generation and processing, network status tracking and intelligent scheduling, and network deployment optimization, are expected to promote future communication paradigm evolution and network architecture transformation, and are of great significance and value to the research of future mobile communication technologies.

A training process of an AI model may be jointly completed by multiple devices. For example, in distributed federated learning, multiple terminals may independently perform AI model training by using local sample data, and send model weights or gradients obtained through training to a center server. The center server performs aggregation processing on the received multiple model weights or gradients, and then delivers an aggregation result to the multiple terminals, so that the terminal performs next model training. The process is iterated, until the model converges. This can resolve time-consuming data collection caused when a single device independently performs model training, implement device resource (for example, data and processing capability) sharing, increase diversity of training data, and improve training performance.

However, in a distributed model training process, a model weight/gradient is transferred between devices, causing extremely high data transmission overheads. Especially in the case of application to a wireless communication network, an air interface resource is occupied, which may affect a normal communication service of a terminal.

SUMMARY

Embodiments of this disclosure provide a model training method and a communication apparatus, to reduce transmission overheads and improve resource utilization.

According to a first aspect, a model training method is provided. The method may be performed by a communication device or an apparatus (such as a chip, a chip system, or a processor) configured in (or used in) the communication device, or may be implemented by a logical node, a logical module, or software that can implement a part of or the entire communication device. An example in which a second communication apparatus performs the method is used below for description.

The method includes: a second communication apparatus receives first information, where the first information indicates an input data set of an ^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, and is a positive integer. The second communication apparatus determines, based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, a gradient set corresponding to the ^thlayer. One gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in the gradient set, and the gradient set is used to determine the intelligent model.

According to the foregoing solution, when multiple communication apparatuses jointly perform model training, and a gradient set of a neuron layer needs to be transferred between communication apparatuses, an input data set and a corresponding gradient accumulation factor set that are obtained through decomposing the gradient set corresponding to the neuron layer may be transmitted, and a receiving apparatus may determine the gradient set based on the input data set and the gradient accumulation factor set. A combination of each gradient accumulation factor in the gradient accumulation factor set and the input data set may be used to determine multiple gradients in the gradient set. Compared with a manner of directly transmitting the gradient set, transmission overheads can be reduced and resource utilization can be improved for each time of training. A model training process including multiple training iterations can greatly reduce transmission overheads and improve resource utilization.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: the second communication apparatus determines a weight set of the intelligent model based on the gradient set, where the gradient set includes a gradient corresponding to each weight in the weight set; and the second communication apparatus determines an updated intelligent model based on the weight set.

According to the foregoing solution, the second communication apparatus may determine the weight set of the intelligent model based on the gradient set, to determine an updated intelligent model obtained after a current training iteration.

With reference to the first aspect, in some implementations of the first aspect, the ^thlayer is a fully connected layer, the ^thlayer includes N nodes, an (+1)^thlayer of the intelligent model includes M nodes, and N and M are positive integers greater than 1. The input data set of the ^thlayer includes N pieces of input data corresponding to the N nodes, and the gradient accumulation factor set corresponding to the ^thlayer includes M gradient accumulation factors corresponding to the M nodes, where the gradient set includes T gradients, and Tis a product of N and M.

According to the foregoing solution, for a fully connected layer, if a first communication apparatus directly sends the gradient set to the second communication apparatus, M×N parameters (that is, T parameters) need to be transmitted. When M and N are greater than 2, M+N is less than M×N, reducing model parameter transmission overheads. The first communication apparatus may transmit model parameters at each layer of the intelligent model in this manner. Especially when each layer includes a large quantity of nodes and training is performed a large quantity of times, transmission overheads can be reduced to a greater extent.

With reference to the first aspect, in some implementations of the first aspect, that the second communication apparatus determines, based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, the gradient set corresponding to the ^thlayer includes: the second communication apparatus determines the T gradients in the gradient set based on input data of each node in the N nodes and a gradient accumulation factor corresponding to each node in the M nodes, where a gradient g_mnin the T gradients is a product of input data of an n^thnode in the N nodes and a gradient accumulation factor corresponding to an m^thnode in the M nodes, n is a positive integer less than or equal to N, and m is a positive integer less than or equal to M.

With reference to the first aspect, in some implementations of the first aspect, the input data set of the ^thlayer includes input data of P channels. The ^thlayer includes Q convolution kernels, the Q convolution kernels correspond to Q channels of output data of the ^thlayer, and the gradient accumulation factor set corresponding to the ^thlayer includes Q gradient accumulation factor subsets corresponding to the Q convolution kernels. The gradient set includes R gradient subsets, the R gradient subsets include P gradient subsets corresponding to each of the Q convolution kernels, the P gradient subsets corresponding to each convolution kernel correspond to the P channels, P and Q are positive integers, and R is a product of Q and P.

According to the foregoing solution, for a convolution layer, the gradient accumulation factor set includes Q gradient accumulation factor subsets, the Q gradient accumulation factors are in one-to-one correspondence with Q convolution kernels, and gradients corresponding to P channels corresponding to a corresponding convolution kernel may be determined by combining each gradient accumulation factor subset and the input data set. Instead of transmitting the gradient set , the first communication apparatus transfers the gradient accumulation factor set and the input data set that are obtained through decomposing gradients corresponding to the ^thlayer, reducing model parameter transmission overheads. The first communication apparatus may transmit model parameters at each layer of the intelligent model in this manner. Especially when a dimension of input data is small and training is performed a large quantity of times, transmission overheads can be reduced to a greater extent.

With reference to the first aspect, in some implementations of the first aspect, that the second communication apparatus determines the gradient set based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer includes: the second communication apparatus determines, by performing a convolution operation, the R gradient subsets in the gradient set based on input data of each of the P channels and a gradient accumulation factor subset corresponding to each of the Q convolution kernels. A gradient subset G_qpin the R gradient subsets is obtained, by performing a convolution operation, based on input data of a p^thchannel in the P channels and a gradient accumulation factor corresponding to a q^thconvolution kernel in the Q convolution kernels.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: the second communication apparatus sends the gradient set, where the gradient set is used to determine the intelligent model.

According to the foregoing solution, after determining the gradient set, the second communication apparatus may send the gradient set to the first communication apparatus, so that the first communication apparatus determines a weight set based on the gradient set, to determine an updated intelligent model, and performs next model training based on the updated intelligent model. Alternatively, the second communication apparatus may send, to the first communication apparatus, a weight set obtained based on the gradient set, so that the first communication apparatus determines an updated intelligent model based on the weight set, and performs next model training.

According to a second aspect, a model training method is provided. The method may be performed by a communication device or an apparatus (such as a chip, a chip system, or a processor) configured in (or used in) the communication device, or may be implemented by a logical node, a logical module, or software that can implement a part of or the entire communication device. An example in which a first communication apparatus performs the method is used below for description.

The method includes: a first communication apparatus performs a model training process of an intelligent model, and determines an input data set of an ^thlayer of the intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, where is a positive integer. The first communication apparatus sends first information, where the first information indicates the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer. One gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in a gradient set corresponding to the ^thlayer.

With reference to the second aspect, in some implementations of the second aspect, that the first communication apparatus determines the gradient accumulation factor set corresponding to the ^thlayer of the intelligent model includes: the first communication apparatus determines, based on a weight set of an (+1)^thlayer of the intelligent model, a gradient accumulation factor set corresponding to the (+1)^thlayer, and an output data set of the ^thlayer, the gradient accumulation factor set corresponding to the ^thlayer.

With reference to the second aspect, in some implementations of the second aspect, the intelligent model includes L layers, is less than or equal to L, and that the first communication apparatus determines the gradient accumulation factor set corresponding to the ^thlayer of the intelligent model includes: if is equal to L, the first communication apparatus determines, based on an output data set and a label set of an L^thlayer, a gradient accumulation factor set corresponding to the L^thlayer, where the label set is used to determine a weight set of the intelligent model.

With reference to the second aspect, in some implementations of the second aspect, the ^thlayer is a fully connected layer, the ^thlayer includes N nodes, the (+1)^thlayer of the intelligent model includes M nodes, and N and M are positive integers greater than 1. The input data set of the ^thlayer includes N pieces of input data corresponding to the N nodes, and the gradient accumulation factor set corresponding to the ^thlayer includes M gradient accumulation factors corresponding to the M nodes, where the gradient set includes T gradients, and T is a product of N and M.

With reference to the second aspect, in some implementations of the second aspect, that the first communication apparatus determines, based on the weight set of the (+1)^thlayer of the intelligent model, the gradient accumulation factor set corresponding to the (+1)^thlayer, and the output data set of the ^thlayer, the gradient accumulation factor set corresponding to the ^thlayer includes: the first communication apparatus determines, based on the weight set of the (+1)^thlayer of the intelligent model, the gradient accumulation factor set corresponding to the (+1)^thlayer, and the output data set of the ^thlayer, the gradient accumulation factor set corresponding to the ^thlayer by using a partial derivative operation and/or a multiplication operation.

With reference to the second aspect, in some implementations of the second aspect, the input data set of the ^thlayer includes input data of P channels. The ^thlayer includes Q convolution kernels, the Q convolution kernels correspond to Q channels of output data of the ^thlayer, and the gradient accumulation factor set corresponding to the ^thlayer includes Q gradient accumulation factor subsets corresponding to the Q convolution kernels. The gradient set includes R gradient subsets, the R gradient subsets include P gradient subsets corresponding to each of the Q convolution kernels, the P gradient subsets corresponding to each convolution kernel correspond to the P channels, P and Q are positive integers, and R is a product of Q and P.

According to a third aspect, a communication apparatus is provided. In a design, the apparatus may include a module corresponding to the method/operation/step/action according to the first aspect or any implementation of the first aspect. The module may be implemented by a hardware circuit, software, or a combination of a hardware circuit and software. In a design, the apparatus includes a transceiver unit, configured to receive first information, where the first information indicates an input data set of an ^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, and/is a positive integer; and a processing unit, configured to determine, based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, a gradient set corresponding to the ^thlayer, where one gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in the gradient set, and the gradient set is used to determine the intelligent model.

With reference to the third aspect, in some implementations of the third aspect, the processing unit is further configured to determine a weight set of the intelligent model based on the gradient set, where the gradient set includes a gradient corresponding to each weight in the weight set. The processing unit is further configured to determine an updated intelligent model based on the weight set.

With reference to the third aspect, in some implementations of the third aspect, the ^thlayer is a fully connected layer, the ^thlayer includes N nodes, an (+1)^thlayer of the intelligent model includes M nodes, and N and M are positive integers greater than 1.

The input data set of the ^thlayer includes N pieces of input data corresponding to the N nodes, and the gradient accumulation factor set corresponding to the ^thlayer includes M gradient accumulation factors corresponding to the M nodes, where the gradient set includes T gradients, and T is a product of N and M.

With reference to the third aspect, in some implementations of the third aspect, the processing unit is configured to determine the T gradients in the gradient set based on input data of each node in the N nodes and a gradient accumulation factor corresponding to each node in the M nodes, where a gradient g_mnin the T gradients is a product of input data of an n^thnode in the N nodes and a gradient accumulation factor corresponding to an m^thnode in the M nodes, n is a positive integer less than or equal to N, and m is a positive integer less than or equal to M.

With reference to the third aspect, in some implementations of the third aspect, the input data set of the ^thlayer includes input data of P channels. The ^thlayer includes Q convolution kernels, the Q convolution kernels correspond to Q channels of output data of the ^thlayer, and the gradient accumulation factor set corresponding to the ^thlayer includes Q gradient accumulation factor subsets corresponding to the Q convolution kernels. The gradient set includes R gradient subsets, the R gradient subsets include P gradient subsets corresponding to each of the Q convolution kernels, the P gradient subsets corresponding to each convolution kernel correspond to the P channels, P and Q are positive integers, and R is a product of Q and P.

With reference to the third aspect, in some implementations of the third aspect, the processing unit is configured to determine, by performing a convolution operation, the R gradient subsets in the gradient set based on input data of each of the P channels and a gradient accumulation factor subset corresponding to each of the Q convolution kernels, where a gradient subset G_qpin the R gradient subsets is obtained, by performing a convolution operation, based on input data of a p^thchannel in the P channels and a gradient accumulation factor corresponding to a q^thconvolution kernel in the Q convolution kernels.

With reference to the third aspect, in some implementations of the third aspect, the transceiver unit is further configured to send the gradient set, where the gradient set is used to determine the intelligent model.

According to a fourth aspect, a communication apparatus is provided. In a design, the apparatus may include a module corresponding to the method/operation/step/action according to the second aspect or any implementation of the second aspect. The module may be implemented by a hardware circuit, software, or a combination of a hardware circuit and software. In a design, the apparatus includes a processing unit, configured to perform a model training process of an intelligent model, and determine an input data set of an ^thlayer of the intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, where is a positive integer; and a transceiver unit, configured to send first information, where the first information indicates the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, and one gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in a gradient set corresponding to the ^thlayer.

With reference to the fourth aspect, in some implementations of the fourth aspect, the processing unit is configured to determine, based on a weight set of an (+1)^thlayer of the intelligent model, a gradient accumulation factor set corresponding to the (+1)^thlayer, and an output data set of the ^thlayer, the gradient accumulation factor set corresponding to the ^thlayer.

With reference to the fourth aspect, in some implementations of the fourth aspect, the intelligent model includes L layers, is less than or equal to L, and if is equal to L, the processing unit is configured to determine, based on an output data set and a label set of an L^thlayer, a gradient accumulation factor set corresponding to the L^thlayer, where the label set is used to determine a weight set of the intelligent model.

With reference to the fourth aspect, in some implementations of the fourth aspect, the ^thlayer is a fully connected layer, the ^thlayer includes N nodes, an (+1)^thlayer of the intelligent model includes M nodes, and N and M are positive integers greater than 1. The input data set of the ^thlayer includes N pieces of input data corresponding to the N nodes, and the gradient accumulation factor set corresponding to the ^thlayer includes M gradient accumulation factors corresponding to the M nodes, where the gradient set includes T gradients, and T is a product of N and M.

With reference to the fourth aspect, in some implementations of the fourth aspect, the processing unit is configured to determine, based on the weight set of the (+1)^thlayer of the intelligent model, the gradient accumulation factor set corresponding to the (+1)^thlayer, and the output data set of the ^thlayer, the gradient accumulation factor set corresponding to the 1th layer by using a partial derivative operation and/or a multiplication operation.

With reference to the fourth aspect, in some implementations of the fourth aspect, the input data set of the ^thlayer includes input data of P channels. The ^thlayer includes Q convolution kernels, the Q convolution kernels correspond to Q channels of output data of the ^thlayer, and the gradient accumulation factor set corresponding to the ^thlayer includes Q gradient accumulation factor subsets corresponding to the Q convolution kernels. The gradient set includes R gradient subsets, the R gradient subsets include P gradient subsets corresponding to each of the Q convolution kernels, the P gradient subsets corresponding to each convolution kernel correspond to the P channels, P and Q are positive integers, and R is a product of Q and P.

According to a fifth aspect, a communication apparatus is provided, including a processor. The processor is coupled to a memory, the memory is configured to store instructions, and the processor is configured to execute the instructions in the memory, to cause the communication apparatus to implement the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect.

Optionally, the communication apparatus further includes a memory, and may be configured to execute instructions in the memory, to implement the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect.

Optionally, the communication apparatus further includes a communication interface, and the processor is coupled to the communication interface. In embodiments of this disclosure, the communication interface may be a transceiver, a pin, a circuit, a bus, a module, or a communication interface of another type. This is not limited.

In an implementation, the communication apparatus is a communication device (for example, a terminal or a network device). When the communication apparatus is a communication device, the communication interface may be a transceiver or an input/output interface.

In another implementation, the communication apparatus is a chip configured in the communication device. When the communication apparatus is a chip configured in the communication device, the communication interface may be an input/output interface.

Optionally, the transceiver may be a transceiver circuit. Optionally, the input/output interface may be an input/output circuit.

According to a sixth aspect, a processor is provided, including an input circuit, an output circuit, and a processing circuit. The processing circuit is configured to: receive a signal by using the input circuit, and transmit a signal by using the output circuit, to cause the processor to perform the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect.

In a specific implementation process, the processor may be one or more chips, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a trigger, any logic circuit, or the like. An input signal received by the input circuit may be received and input by, for example, but not limited to, a receiver, a signal output by the output circuit may be output to, for example, but not limited to, a transmitter and transmitted by the transmitter, and the input circuit and the output circuit may be a same circuit, where the circuit serves as the input circuit and the output circuit at different moments. Specific implementations of the processor and the various circuits are not limited in embodiments of this disclosure.

According to a seventh aspect, a computer program product is provided. The computer program product includes a computer program (which may also be referred to as code or instructions). When the computer program is run, a computer is caused to perform the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect.

According to an eighth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores a computer program (which may also be referred to as code or instructions). When the computer program is run on a computer, the computer is caused to perform the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect.

According to a ninth aspect, a communication system is provided, including at least one communication apparatus according to the third aspect and at least one communication apparatus according to the fourth aspect.

According to a tenth aspect, a chip is provided, including at least one processor and a communication interface. The communication interface is configured to receive a signal input to the chip or a signal output from the chip. The processor communicates with the communication interface, and implements the method according to the first aspect or the second aspect and any possible implementation of the first aspect or the second aspect by using a logic circuit or by executing a code instruction.

It may be understood that, for beneficial effects of features corresponding to the first aspect in the second aspect to the tenth aspect, refer to related descriptions in the first aspect. Details are not repeatedly described again.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1A is a block diagram of a communication system applicable to an embodiment of this disclosure;

FIG. 1B is another block diagram of a communication system applicable to an embodiment of this disclosure;

FIG. 2 is a diagram of fully connected layers of a neural network according to this disclosure;

FIG. 3 is a diagram of a convolution layer of a neural network according to this disclosure;

FIG. 4 is a schematic flowchart of a model training process according to this disclosure;

FIG. 5 is a diagram of joint model training according to this disclosure;

FIG. 6A is a schematic flowchart of a model training method according to this disclosure;

FIG. 6B is a diagram of an intelligent model including a fully connected layer according to this disclosure;

FIG. 7 is another schematic flowchart of a model training method according to this disclosure;

FIG. 8 is a diagram of a structure of a communication apparatus according to this disclosure;

FIG. 9 is another diagram of a structure of a communication apparatus according to this disclosure; and

FIG. 10 is still another diagram of a structure of a communication apparatus according to this disclosure.

DESCRIPTION OF EMBODIMENTS

The following describes technical solutions of this disclosure with reference to accompanying drawings.

In embodiments of this disclosure, “/” may represent an “or” relationship between associated objects. For example, A/B may represent A or B. “And/or” may be used to describe that there are three relationships between associated objects. For example, A and/or B may represent the following three cases: only A exists, both A and B exist, and only B exists, where A and B may be singular or plural. For ease of describing the technical solutions in embodiments of this disclosure, terms such as “first” and “second” may be used for differentiation in embodiments of this disclosure. The terms such as “first” and “second” are not intended to limit a quantity and an execution sequence, and the terms such as “first” and “second” are not intended to limit a definite difference. In embodiments of this disclosure, the term like “example” or “for example” is used to represent an example, evidence, or a description. Any embodiment or design solution described as “example” or “for example” should not be explained as being more preferred or having more advantages than another embodiment or design solution. A word such as “example” or “for example” is used to present a related concept in a specific manner for ease of understanding. In embodiments of this disclosure, “at least one (type)” may alternatively be described as “one (type) or more (types)”, and “multiple (types)” may be two (types), three (types), four (types), or more (types). This is not limited in embodiments of this disclosure.

FIG. 1A is a diagram of an architecture of a communication system 1000 to which this disclosure is applicable. As shown in FIG. 1A, the communication system includes a radio access network (RAN) 100 and a core network (CN) 200. Optionally, the communication system 1000 may further include an internet 300. The RAN 100 may include at least one access network node (for example, 110a and 110b in FIG. 1A), and may further include at least one terminal (for example, 120a to 120j in FIG. 1A). The terminal is connected to the access network node in a wireless communication manner. The access network node is connected to the core network in a wireless communication or wired communication manner. The core network device and the access network node may be different physical devices that are independent, or may be a same physical device that integrates functions of the core network device and functions of the access network node. Alternatively, there may be another possible case: for example, functions of the access network node and a part of functions of the core network device may be integrated into one physical device, and another physical device implements a remaining part of the functions of the core network device. Physical existence forms of the core network device and the access network node are not limited in this disclosure. Terminals may be connected to each other in a wired or wireless manner. Access network nodes may be connected to each other in a wired or wireless manner. FIG. 1A is merely a diagram. The communication system may further include another network device, for example, may further include a wireless relay device and a wireless backhaul device.

The access network node may be an access network device, for example, a base station, a NodeB, an evolved NodeB (eNodeB or eNB), a transmission reception point (TRP), a next generation NodeB (gNB) in a 5th generation (5G) mobile communication system, an access network node in an open radio access network (O-RAN or open RAN), a next generation base station in a 6th generation (6G) mobile communication system, a base station in a future mobile communication system, or an access node in a WI-FI system. Alternatively, the access network node may be a module or unit that implements a part of functions of a base station, for example, may be a central unit (CU), a distributed unit (DU), a central unit-control plane (CU-CP) module, or a central unit-user plane (CU-UP) module. The access network node may be a macro base station (for example, 110a in FIG. 1A), may be a micro base station or an indoor base station (for example, 110b in FIG. 1A), or may be a relay node, a donor node, or the like. A specific technology and a specific device form that are used by the access network node are not limited in this disclosure. The 5G system may also be referred to as a new radio (NR) system.

In this disclosure, an apparatus configured to implement functions of an access network node may be an access network node, or may be an apparatus that can support an access network node in implementing the functions, for example, a chip system, a hardware circuit, a software module, or a combination of a hardware circuit and a software module. The apparatus may be installed in the access network node, or may be used together with the access network node. In this disclosure, the chip system may include a chip, or may include a chip and another discrete component. For ease of description, the following describes the technical solutions provided in this disclosure by using an example in which an apparatus configured to implement functions of an access network node is an access network node, and optionally, the access network node is a base station.

The terminal may alternatively be referred to as a terminal device, user equipment (UE), a mobile station, a mobile terminal, or the like. The terminal may be widely used in various scenarios for communication. For example, the scenario includes but is not limited to at least one of the following scenarios: enhanced mobile broadband (eMBB), ultra-reliable low-latency communication (URLLC), massive machine-type communications (mMTC), device-to-device (D2D) communication, vehicle-to-everything (V2X) communication, machine-type communication (MTC), Internet of things (IoT), virtual reality (VR), augmented reality (AR), industrial control, autonomous driving, telemedicine, smart grid, smart furniture, smart office, smart wearable, smart transportation, smart city, or the like. The terminal may be a mobile phone, a tablet computer, a computer having a wireless transceiver function, a wearable device, a vehicle, an uncrewed aerial vehicle, a helicopter, an airplane, a ship, a robot, a robotic arm, a smart home device, or the like. A specific technology and a specific device form that are used by the terminal are not limited in this disclosure.

In this disclosure, an apparatus configured to implement functions of a terminal may be a terminal, or may be an apparatus that can support a terminal in implementing the functions, for example, a chip system, a hardware circuit, a software module, or a combination of a hardware circuit and a software module. The apparatus may be installed in the terminal or may be used together with the terminal. For ease of description, the following describes the technical solutions provided in this disclosure by using an example in which an apparatus configured to implement functions of a terminal is a terminal, and optionally, the terminal is user equipment (UE).

The base station and/or the terminal may be stationary or mobile. The base station and/or the terminal may be deployed on the land, including an indoor or outdoor device, a handheld device, or a vehicle-mounted device; may be deployed on the water; or may be deployed on an airplane in the air, a balloon, and an artificial satellite. An environment/a scenario in which the base station and the terminal are located is not limited in this disclosure. The base station and the terminal may be deployed in a same environment/scenario or different environments/scenarios. For example, the base station and the terminal are both deployed on the land. Alternatively, the base station is deployed on the land, and the terminal is deployed on the water. Examples are not enumerated.

The roles of the base station and the terminal may be relative. For example, the helicopter or uncrewed aerial vehicle 120i in FIG. 1A may be configured as a mobile base station; and for terminals 120j accessing the RAN 100 through 120i, the terminal 120i is a base station. However, for the base station 110a, 120i may be a terminal, to be specific, 110a and 120i may communicate with each other according to a wireless air interface protocol. Alternatively, 110a and 120i communicate with each other according to an interface protocol between base stations. In this case, for 110a, 120i is also a base station. Therefore, both the base station and the terminal may be collectively referred to as communication apparatuses (or communication devices), 110a and 110b in FIG. 1A may be referred to as communication apparatuses having functions of the base station, and 120a to 120j in FIG. 1A may be referred to as communication apparatuses having functions of the terminal.

Optionally, the protocol layer structure between the access network node and the terminal may include an AI layer, used to transmit data related to an AI function.

In this disclosure, an independent network element (for example, referred to as an AI entity, an AI network element, an AI node, or an AI device) may be introduced to the communication system shown in FIG. 1A, to implement an AI-related operation. The AI network element may be directly connected to a base station, or may be indirectly connected to a base station through a third-party network element. Alternatively, the AI network element may be an independently disposed network element in the communication system, for example, may be a core network element such as an access and mobility management function (AMF) network element or a user plane function (UPF) network element. Alternatively, the AI network element may be built in a network element in the communication system, or an AI entity (or may be referred to as an AI module or another name) is configured in a network element in the communication system to implement an AI-related operation. Optionally, the network element configured with the AI entity may be a base station, a core network device, operation, administration, and maintenance (OAM), or the like. The OAM is configured to operate, administer, and/or maintain a core network device, and/or is configured to operate, administer, and/or maintain an access network node.

Optionally, to match and support the AI, the terminal or the terminal chip may include an AI entity, configured to implement an AI-related function.

Optionally, in this disclosure, the AI entity may also have another name, for example, an AI module or an AI unit, and is mainly configured to implement an AI function (or referred to as an AI-related operation). A specific name of the AI entity is not limited in this disclosure.

In this disclosure, an intelligent model may be referred to as an AI model, and the AI model is a specific method for implementing an AI function. The AI model represents a mapping relationship between an input and an output of the model. The AI model may be a neural network, a linear regression model, a decision tree model, a clustering (singular value decomposition (SVD)) model, or another ML model. The AI model may be referred to as an intelligent model, a model, or another name for short. This is not limited. The AI-related operation may include at least one of the following: data collection, model training, model information release, model test (or referred to as model verification), model reasoning (or referred to as model inference, inference, prediction, or the like), inference result release, or the like.

FIG. 1B is another diagram of an architecture of a communication system to which this disclosure is applicable. As shown in FIG. 1B, network elements in the communication system may be connected through an interface. For example, a core network node and an access network node shown in FIG. 1B may be connected through a next generation (NG) interface, and access network nodes may be connected to each other through an Xn interface. Network elements in the communication system may be connected to each other through an air interface. For example, an access network node may be connected to a terminal through an air interface. One or more AI modules are disposed in one or more of these network element nodes, for example, the core network node, the access network node (RAN node), the terminal, or the OAM (for clarity, FIG. 1B shows only one AI module in each node). The access network node in FIG. 1A may serve as an independent RAN node, or may include multiple RAN nodes, for example, include a CU and a DU. One or more AI modules may also be disposed in the CU and/or the DU. Optionally, the CU may be further split into a CU-CP and a CU-UP. One or more AI models are disposed in the CU-CP and/or the CU-UP.

The AI module is configured to implement a corresponding AI function. AI modules deployed in different network elements may be the same or different. A model of the AI module is configured based on different parameters, and the AI module may implement different functions. The model of the AI module may be configured based on one or more of the following parameters: a structure parameter (for example, at least one of the following: a quantity of layers of a neural network, a width of a neural network, a connection relationship between layers, a weight value of a neuron, an activation function of a neuron, or a bias in an activation function), an input parameter (for example, a type of the input parameter and/or a dimension of the input parameter), or an output parameter (for example, a type of the output parameter and/or a dimension of the output parameter). The bias in the activation function may also be referred to as a bias of a neural network.

One AI module may have one or more models. One output may be obtained through inference by using one model, where the output includes one or more parameters. Learning processes, training processes, or inference processes of different models may be deployed on different nodes or devices, or may be deployed on a same node or device.

The communication apparatus (for example, a first communication apparatus and/or a second communication apparatus) provided in embodiments of this disclosure may be a network device, for example, the access network node, the server, or the base station described above. The communication apparatus (for example, the first communication apparatus and/or the second communication apparatus) provided in embodiments of this disclosure may alternatively be the terminal described above.

The solutions provided in embodiments of this disclosure may be applied to various application scenarios, such as VR, AR, industrial control, industrial wireless sensor network (IWSN), self-driving, telemedicine, smart grid, smart city, video surveillance, smart retail, and/or smart home.

Before the method in this disclosure is described, some knowledge related to AI is first briefly described.

1. AI

The AI enables a machine to have a learning ability and accumulate experience, so that issues such as natural language understanding, image recognition, and/or chess playing that may be addressed by humans through experience can be addressed. For example, a machine can use software and hardware of a computer to simulate some intelligent behavior of a human. To implement AI, an ML method or another method may be used. This is not limited in this disclosure.

2. Training or Learning

The training is a process of processing an AI model (also referred to as a training model). In this processing process, the model is enabled to perform a specific task by optimizing a weighted value (referred to as a weight) in the model.

A training method for the AI model includes but is not limited to supervised learning, unsupervised learning, reinforcement learning, transfer learning, and the like. Unsupervised learning may also be referred to as non-supervised learning.

In terms of supervised learning, based on collected sample values and sample labels, a mapping relationship between the sample values and the sample labels is learned by using an ML algorithm, and the learned mapping relationship is expressed by using an AI model. A process of training an ML model is a process of learning the mapping relationship. In the training process, a sample value is input into the model to obtain a predicted value of the model, and a model parameter is optimized by calculating an error between the predicted value of the model and a sample label (ideal value). After the mapping relationship is learned, a new sample label may be predicted by using the learned mapping. The mapping relationship learned through supervised learning may include linear mapping or non-linear mapping. A learning task may be classified into a classification task and a regression task based on a type of a label.

In terms of unsupervised learning, an internal pattern of a sample is explored autonomously by using an algorithm based on a collected sample value. For a specific type of algorithm of unsupervised learning, a sample is used as a supervisory signal. In other words, a model learns a mapping relationship between samples, which is referred to as self-supervised learning. During training, a model parameter is optimized by calculating an error between a predicted value of the model and the sample. Self-supervised learning may be used for signal compression and decompression restoration. Common algorithms include an autoencoder, a generative adversarial network, and the like.

Reinforcement learning is different from supervised learning, and is an algorithm that learns a problem-solving policy by interacting with an environment. Different from supervised learning and unsupervised learning, reinforcement learning does not have clear “correct” action label data. The algorithm needs to interact with the environment to obtain a reward signal fed back by the environment and adjust a decision action to obtain a larger reward signal value. For example, in downlink power control, a reinforcement learning model adjusts a downlink transmit power of each user based on a total system throughput fed back by a wireless network, to expect to obtain a higher system throughput. The goal of reinforcement learning is also to learn a mapping relationship between an environment status and an optimal decision action. However, a label of a “correct action” cannot be obtained in advance. Therefore, a network cannot be optimized by calculating an error between an action and a “correct action”. Reinforcement learning training is implemented through iterative interaction with the environment.

During AI model training, a loss function may be defined. The loss function describes a gap or a difference between an output value of the AI model and an ideal target value. A training process of the AI model is a process of adjusting a weight of the AI model to make a value of the loss function less than a threshold or meet a target requirement.

3. Inference

The inference refers to performing data processing by using a trained AI model (where the trained model may be referred to as an inference model). Actual data is input into the inference model for processing, to obtain a corresponding inference result. The inference may also be referred to as prediction or decision-making, and the inference result may also be referred to as a prediction result, a decision-making result, or the like.

4. Neural Network (NN)

The neural network is a specific implementation of ML in AI, and the neural network is a network structure that simulates behavior features of an animal neural network for information processing. The structure of the neural network includes a large quantity of nodes (referred to as neurons) that are connected to each other. The neural network is based on a specific operation model, and processes information through learning and training based on input information. According to the general approximation theorem, the neural network can theoretically approximate any continuous function, so that the neural network has the capability of learning any mapping. Therefore, the neural network can accurately perform abstract modeling for a complex high-dimension problem. That is, the AI model may be implemented by using a neural network.

The neural network is usually a multi-layer structure, and includes multiple layers. Each layer may include one or more neurons, and one layer may be referred to as a neuron layer or a network layer. A depth of the neural network is a quantity of layers included in the neural network, and a quantity of neurons included in each layer may be referred to as a width of the layer. A neural network includes an input layer, a hidden layer, and an output layer. The input layer is responsible for receiving an input signal, the output layer is responsible for outputting a calculation result of the neural network, and the hidden layer is responsible for complex functions such as feature expression. The function of the hidden layer is represented by a weight and a corresponding activation function. FIG. 2 is a diagram of fully connected layers of a neural network. A 1^stlayer in the neural network is an input layer, and the input layer includes eight neurons. Starting from a next layer of the input layer, that is starting from a 2^ndlayer, there are three hidden layers, a hidden layer 1, a hidden layer 2, and a hidden layer 3. Each hidden layer includes six neurons, and the quantity of neurons may vary between different hidden layers. This is not limited in this disclosure. A last layer of the neural network is an output layer, which includes three neurons. A fully connected layer means that all neurons between two adjacent layers are connected. Because one neuron at one layer is connected to each neuron at a next layer, a quantity of weights between two adjacent layers is equal to a product of quantities of neurons at the two layers. As shown in FIG. 2, the quantity of weights between the input layer and the hidden layer 1 is 48, the quantity of weights between the hidden layer 1 and the hidden layer 2 is 36, the quantity of weights between the hidden layer 2 and the hidden layer 3 is 36, and the quantity of weights between the hidden layer 3 and the output layer is 18.

The neural network may further include a convolution layer. The convolution layer extracts an important feature of data in a weight value sharing (or referred to as weight sharing) manner. The convolution layer may implement a partially connected neural network, and the quantity of weights can be reduced compared with the fully connected layer. Input data of one convolution layer may include at least one channel, one convolution layer may include at least one convolution kernel, each convolution kernel corresponds to one channel of output data, and a data amount of channels of the output data of the convolution layer is the same as a quantity of convolution kernels included in the convolution layer. As shown in FIG. 3, input data of a convolution layer includes three channels, for example, a channel 0, a channel 1, and a channel 2 of the input data, and a dimension of data of each channel is 5×5. The convolution layer may include Q convolution kernels, and Q is a positive integer. The Q convolution kernels include a convolution kernel q, and 1≤q≤Q. Each convolution kernel includes three channels, for example, a channel 0, a channel 1, and a channel 2 of the convolution kernel q, that are in one-to-one correspondence with the three channels of the input data. A dimension of each channel is 3×3. After convolution is performed on a channel of the convolution kernel q and a corresponding channel of the input data to obtain a convolution result, convolution results of channels of the convolution kernel q are added to obtain output data of the convolution kernel q. Output data of the convolution layer includes Q channels, the Q channels are in one-to-one correspondence with the Q convolution kernels, and one of the Q channels is the output data of the convolution kernel q.

FIG. 4 is a diagram of a training process of an intelligent model. In the example shown in FIG. 4, the intelligent model is a neural network. The neural network f_θ(⋅) includes four convolution layers and four fully connected layers. An AI entity may obtain training data from a basic data set (for example, a set of channel data). The training data may include a training sample (or referred to as sample data) and a label. A sample x is used as an input and is processed by the neural network f_θ(⋅) to output an inference result. This process is referred to as a forward propagation process of the sample. A loss function calculates an error between the inference result and the label. The AI entity may perform gradient backpropagation by using a backpropagation optimization algorithm (which may be referred to as a model optimization algorithm) based on the error obtained by using the loss function, to optimize a weight θ of the neural network. The neural network is trained by using a large amount of training data, so that training of the neural network is completed after a difference between an output of the neural network and a label is less than a preset value.

It should be noted that the training process shown in FIG. 4 is described by using a training manner of supervised learning as an example. The supervised learning is based on a sample and a label, and model training is implemented by using a loss function. Unsupervised learning may also be used in the training process of the intelligent model, and an algorithm is used to learn an internal pattern of a sample, to complete training of the intelligent model based on the sample. Reinforcement learning may also be used in the training process of the intelligent model, and a reward signal fed back by an environment is obtained through interaction with the environment, to learn a problem-solving policy and optimize the model. A model training method and a model type are not limited in this disclosure.

The trained AI model can execute inference tasks. After actual data is input to the AI model for processing, a corresponding inference result is obtained.

5. Federated Learning

The federated learning is a distributed AI model training method. A training process is jointly performed by multiple apparatuses based on local data, instead of being performed by one apparatus. This can resolve time-consuming data collection caused during centralized AI model training, implement apparatus resource sharing, increase diversity of training data, and improve training performance. In addition, because the apparatus does not need to transmit local data, for some privacy data, privacy security problems can also be reduced.

As shown in FIG. 5, a communication apparatus 0, a communication apparatus 1, a communication apparatus 2, . . . , and a communication apparatus N may jointly perform model training. The communication apparatus 0 may provide another communication apparatus (for example, one or more of the communication apparatuses 1, 2, . . . , and N) with a service for executing an AI task, and/or the communication apparatus 0 may jointly complete an AI task with another communication apparatus. However, this disclosure is not limited thereto.

For example, at least one communication apparatus shown in FIG. 5 may perform federated learning. The communication apparatus 0 may serve as a central apparatus that provides a model parameter (for example, a weight, or a gradient used to determine a weight) for each communication apparatus. The communication apparatus 0, the communication apparatus 1, the communication apparatus 2, . . . , and the communication apparatus N may serve as distributed apparatuses and obtain a model parameter from a central apparatus. After a model is updated based on the model parameter, an updated model is separately trained by using a local data set. For example, the communication apparatus 1 trains the model by using a local data set 1, the communication apparatus 2 trains the model by using a local data set 2, and the communication apparatus N trains the model by using a local data set N. After performing model training, the multiple communication apparatuses send training results of the current training to the central node. For example, the training result includes a model weight, a gradient corresponding to a weight, or the like. The central apparatus obtains the training results from the multiple communication apparatuses, combines the training results of the multiple communication apparatuses, updates the model parameter based on the combined training result, and notifies each communication apparatus. Each communication apparatus performs next model training, and the foregoing process is repeated until a training end condition is met, to complete model training.

The following describes some specific application scenarios of federated learning. For example, multiple image collection devices perform federated learning model training on an image recognition model by using image data collected by the multiple image collection devices respectively. After federated learning is completed, the image recognition model may recognize an item in an image based on an input image. For another example, in the Internet of Vehicles, a camera, a positioning system, and an inertial measurement unit (IMU) of a vehicle separately collect different types of data, and these apparatuses may perform federated learning model training on a traffic condition inference model by using the collected data. After completing the federated learning, the vehicle may input collected environmental data into the traffic condition inference model, to obtain a traffic condition obtained through model inference, to assist the vehicle in driving, and the like.

Applying federated learning to a wireless communication system can significantly improve performance of the communication system. In most scenarios, the network device and the terminal also need to use a matching AI model to improve wireless communication performance. The network device and one or more terminals may complete model training in a federated learning manner. For example, model training may be performed, by using the federated learning, on an encoder model that implements channel state information (CSI) feedback compression, so that a trained model is used for CSI feedback compression.

Applying AI to the field of communication often involve complex nonlinear function fitting. An intelligent model usually has a large scale, for example, a large quantity of model layers and a large quantity of model weights. For example, a convolutional neural network (CNN) of an encoder model that implements CSI feedback compression may include 15 layers of neurons. A residual network (ResNet) widely used in the AI field includes 34 layers of neurons. For example, a quantity of weights/gradients of ResNet-20 reaches 269,722. A visual geometry group (VGG) neural network includes 19 layers of neurons. For example, a quantity of weights/gradients of a VGG-16 model reaches 14,728,266. For a large-scale AI model, a quantity of weights/gradients may reach millions or even tens of millions. When multiple communication apparatuses jointly perform model training (for example, federated learning), a model weight/gradient is transferred between communication apparatuses, causing extremely high data transmission overheads. Especially in the case of application to a wireless communication network, an air interface resource is occupied, which may affect a normal communication service of a terminal. Therefore, to facilitate application of joint model training to a communication network for improved communication performance, a problem of parameter (for example, weight/gradient) transmission overheads further needs to be resolved.

This disclosure proposes that, based on gradient derivation in a model training process, a gradient set corresponding to one layer of an intelligent model may be decomposed into a gradient accumulation factor set and an input data set. One gradient accumulation factor in the gradient accumulation factor set may be combined with the input data set, to determine multiple gradients in the gradient set. When multiple communication apparatuses jointly perform model training, the communication apparatus transfers the gradient accumulation factor set and the input data set, instead of transferring the gradient set, so that transmission overheads can be reduced, and communication resource utilization can be improved.

FIG. 6A is a schematic flowchart of a model training method 600 according to an embodiment of this disclosure. In the method 600, a first communication apparatus and a second communication apparatus may jointly perform intelligent model training.

S601: The first communication apparatus performs a model training process of an intelligent model, and determines an input data set of an ^thlayer of the intelligent model and a gradient accumulation factor set corresponding to the ^thlayer.

The intelligent model may be referred to as an AI model, an ML model, an AI/ML model, or a neural network. A specific name of the intelligent model is not limited in this disclosure.

The model training process of the intelligent model performed by the first communication apparatus includes forward propagation of a training sample (or referred to as sample data) and backpropagation of a gradient. The forward propagation of the training sample includes: using the training sample as input data, inputting the training sample to an input layer of the intelligent model, and obtaining output data output by an output layer through transfer at each layer of the intelligent model, that is, an output result of current training of the intelligent model. The first communication apparatus calculates an error between the output result and a label by using a loss function, and determines, through gradient backpropagation, a gradient corresponding to each layer.

In an implementation, the ^thlayer of the intelligent model is a fully connected layer, the ^thlayer includes N nodes, an (+1)^thlayer includes M nodes, and N and M are positive integers greater than 1. An output of forward propagation of the ^thlayer may be represented as follows:

a l = σ ⁡ ( z l ) = σ ⁡ ( W l ⁢ a l - 1 + b l ) ( 1 )

A dimension of is M×1, that is, includes M pieces of data corresponding to the M nodes of the (+1)^thlayer, where the M pieces of data are M pieces of output data of the ^thlayer, and are M pieces of input data of the (+1)^thlayer. is an output data set of the ^thlayer with processing using an activation function σ. is an output data set of the ^thlayer without processing using the activation function σ. is an output data set of an (−1)^thlayer with processing using the activation function σ, that is, is the input data set of the ^thlayer. A dimension of is N×1, which includes N pieces of data corresponding to the N nodes of the ^thlayer. is a weight set between the ^thlayer and the (+1)^thlayer, a dimension of is M×N, is a bias of a weighted sum of input data, and a dimension of is M×1.

When the gradient is backpropagated to the ^thlayer, the gradient corresponding to the ^thlayer may be represented as follows:

g l = δ l ( σ ⁡ ( z l - 1 ) ) T = δ l ( a l - 1 ) T ( 2 )

is a gradient set corresponding to the ^thlayer, one gradient in the gradient set corresponds to one weight in the weight set of the ^thlayer, one gradient may be used to determine a weight corresponding to the gradient, and a dimension of is M×N. is the gradient accumulation factor set corresponding to the ^thlayer,

δ l = ∂ L ∂ z l

is obtained by calculating a partial derivative of the loss function L with respect to , and a dimension of is M×1. (x)^Trepresents a transposition of x. The foregoing Formula (2) may be represented as follows:

g l = [ g 11 l g 12 l … g 1 ⁢ N l g 21 l g 22 l … g 2 ⁢ N l ⋮ ⋮ ⋱ ⋮ g M ⁢ 1 l g M ⁢ 2 l … g MN l ] = [ δ 1 l δ 2 l ⋮ δ M l ] [ a 1 l - 1 a 2 l - 1 … a N l - 1 ] ( 2 - 1 )

If the first communication apparatus performs model training and transfers a gradient obtained through model training to the second communication apparatus, for the ^thlayer, a quantity of gradients to be transmitted by the first communication apparatus is T=M×N. It can be learned from the foregoing Formula (2-1) that a gradient set of each layer may be decomposed into a gradient accumulation factor set and an input data set that correspond to the layer. Each gradient accumulation factor in the gradient accumulation factor set and the input data set may be combined to determine N gradients in the gradient set. A gradient

g m ⁢ n l

is a product of a gradient accumulation factor

δ m l

corresponding to an m^thnode of the (+1)^thlayer and input data

a n l - 1

corresponding to an n^thnode of the ^thlayer, where 1≤m≤M, and 1≤n≤N.

If the first communication apparatus transmits the gradient accumulation factor set and the input data set that correspond to the ^thlayer instead of transmitting the gradient set, the quantity of parameters to be transmitted by the first communication apparatus is M+N. When M and N are greater than 2, M+N is less than M×N, and model parameter transmission overheads can be reduced. The first communication apparatus may transmit model parameters at each layer of the intelligent model in this manner. Especially when each layer includes a large quantity of nodes and training is performed a large quantity of times, transmission overheads can be reduced to a greater extent.

In another implementation, the ^thlayer of the intelligent model is a convolution layer, an input data set of the ^thlayer includes input data of P channels, the ^thlayer includes Q convolution kernels, and P and Q are positive integers. An output of forward propagation of the ^thlayer may be represented as follows:

a l = σ ⁡ ( z l ) = σ ⁡ ( c ⁢ o ⁢ n ⁢ v ⁡ ( W l , a l - 1 ) + b l ) ( 3 )

conv(x, y) represents convolution of x and y. is an output data set of the ^thlayer with processing using an activation function σ. is an output data set of the ^thlayer without processing using the activation function σ. is output data of an (−1)^thlayer with processing using the activation function, and is the input data of the ^thlayer. includes input data of P channels. If a dimension of input data of each channel is

h i l × w i l ,

a dimension or may be denoted as

P × h i l × w i l

includes weights of Q convolution kernels, each of the Q convolution kernels includes weights of P channels, and the weights of the P channels are in one-to-one correspondence with the P channels of the input data set. If a dimension of a channel of a convolution kernel is

h c l × w c l ,

a dimension of may be denoted as

Q × P × h c l × w c l .

Convolution is performed on weights of P channels of each of the Q convolution kernels and input data of corresponding P channels, and summation is performed, to obtain output data of one channel corresponding to each convolution kernel, and output data of Q channels corresponding to the Q convolution kernels. Therefore, the output data set of the ^thlayer includes the output data of the Q channels. If a dimension of output data of each channel is

h o l × w o l ,

a dimension of a may be denoted as

Q × h o l × w o l .

For a processing process of a specific convolution layer, refer to the foregoing description of the example shown in FIG. 3. Details are not described herein again.

A dimension

h o l × w o l

of output data or a channel is related to a dimension

h i l × w i l

input data, amounts

h p l ⁢ and ⁢ w p l

of data to be padded during convolution, a dimension

h c l × w c l

of a channel of a convolution kernel, and a convolution step s. The amounts

h p l ⁢ and ⁢ w p l

of data to be padded may be controlled, so that a dimension of input data of a channel of a convolution layer is the same as a dimension of output data of a channel. For example, the parameters meet the following Formula (4) and Formula (5):

h o l = ⌊ h i l + 2 × h p l - h c l s ⌋ + 1 = h i l ( 4 ) w o l = ⌊ w i l + 2 * w p l - w c l s ⌋ + 1 = w i l ( 5 )

└x┘ represents rounding down x.

When the gradient is backpropagated to the ^thlayer, the gradient corresponding to the ^thlayer may be represented as follows:

g l = c ⁢ o ⁢ n ⁢ v ⁡ ( δ l , ( σ ⁡ ( z l - 1 ) ) T ) = conv ⁡ ( δ l , ( a l - 1 ) T ) ( 6 )

is a gradient set corresponding to the ^thlayer, one gradient in the gradient set corresponds to one weight in the weight set of the ^thlayer, a dimension of is

Q × P × h c l × w c l ,

is the gradient accumulation factor set corresponding to the ^thlayer,

δ l = ∂ L ∂ z l

is obtained by calculating a partial derivative of the loss function L with respect to , and a dimension of is

Q × h o l × w o l .

The gradient accumulation factor set corresponding to the ^thlayer includes Q gradient accumulation factor subsets corresponding to the Q convolution kernels. Each gradient accumulation factor subset may be combined with the input data set to determine a gradient corresponding to P channels of a convolution kernel corresponding to the gradient accumulation factor subset. In an example of a gradient accumulation factor subset

δ q l

corresponding to a convolution kernel q and an input data set , and a dimension of being

h o l × w o l ,

a gradient

g q l

corresponding to P channels corresponding to the convolution kernel q may be determined, where a dimension of

g q l ⁢ is ⁢ P × h c l × w c l .

Therefore, after performing model training, instead of transmitting the gradient set , the first communication apparatus may transfer, to the second communication apparatus, the gradient accumulation factor set and the input data set that are obtained through decomposing gradients corresponding to the ^thlayer, reducing model parameter transmission overheads. The first communication apparatus may transmit model parameters at each layer of the intelligent model in this manner. Especially when a dimension of input data is small and training is performed a large quantity of times, transmission overheads can be reduced to a greater extent.

S602: The first communication apparatus sends first information to the second communication apparatus, where the first information indicates the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer.

Correspondingly, the second communication apparatus receives the first information from the first communication apparatus, and determines, based on the first information, the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, so that the second communication apparatus may determine, in S603 based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, the gradient set that corresponds to the ^thlayer and that is obtained after the first communication apparatus performs current training on the intelligent model.

The first information may indicate an input data set of each layer of the intelligent model and a gradient accumulation factor set corresponding to each layer, so that the second communication apparatus may determine, in S603 based on the input data set of each layer and the gradient accumulation factor set corresponding to each layer, a gradient set corresponding to each layer obtained after the first communication apparatus performs current training on the intelligent model.

S603: The second communication apparatus determines, based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, the gradient set corresponding to the ^thlayer.

For example, the second communication apparatus may obtain, according to the foregoing Formula (2) and based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, the gradient set that corresponds to the ^thlayer and that is obtained after the first communication apparatus performs the current training on the intelligent model.

Optionally, the second communication apparatus may determine the weight set of the ^thlayer based on the gradient set corresponding to the ^thlayer. The second communication apparatus may determine an updated intelligent model based on the weight set.

The second communication apparatus may obtain, from the first information, the input data set corresponding to each layer of the intelligent model and the gradient accumulation factor set corresponding to each layer, and the second communication apparatus may determine, based on the first information, the gradient set corresponding to each layer. Optionally, the second communication apparatus may determine the weight set of each layer based on the gradient set corresponding to each layer, to determine the updated intelligent model based on the weight set of each layer.

According to the foregoing solution of this disclosure, when multiple communication apparatuses jointly perform model training, and a gradient set or a weight set of a neuron layer needs to be transferred between communication apparatuses, an input data set and a corresponding gradient accumulation factor set that are obtained through decomposing the gradient set corresponding to the neuron layer may be transmitted, and a receiving apparatus may determine the gradient set or the weight set based on the input data set and the gradient accumulation factor set. Transmission overheads can be reduced and resource utilization can be improved by using this manner of model parameter transmission.

As shown in Table 1, an example in which the intelligent model is a neural network including a fully connected layer is used, and quantities of neurons included at five layers of a neural network 1 are 2, 25, 50, 25, and 2. After the first communication apparatus performs model training once, if a gradient set corresponding to each layer obtained through training is to be transmitted, a quantity of gradients (that is, a quantity of parameters) to be transmitted is 2600 (that is, 2×25+25×50+50×25+25×2). If the gradients are decomposed and a gradient accumulation factor set and an input data set that correspond to each layer are transmitted, a quantity of model parameters (including gradient accumulation factors and input data) to be transmitted is 204 (that is, 2+25+25+50+50+25+25+2). It can be learned that transmission overheads after training once are reduced to less than one tenth of the original transmission overheads. For a neural network that includes a large quantity of neurons at each layer, quantities of neurons included at five layers of a neural network 2 shown in Table 1 are 16, 512, 1024, 1024, and 16. After the first communication apparatus performs model training once, if a gradient set corresponding to each layer obtained through training is to be transmitted, a quantity of gradients (that is, a quantity of parameters) to be transmitted is 1597440. If the gradients are decomposed and a gradient accumulation factor set and an input data set that correspond to each layer are transmitted, a quantity of model parameters (including gradient accumulation factors and input data) to be transmitted is 5184. It can be learned that according to the solution provided in this disclosure, transmission overheads can be greatly reduced after model training once, and transmission overheads can be reduced to a greater extent and resource utilization can be improved after model training multiple times.

	TABLE 1

	Overheads of

		gradient
	Overheads of	accumulation factor
	gradient set	set and input data
	transmission	set transmission

Intelligent

Quantity of neurons

(quantity of

model	Layer 0	Layer 1	Layer 2	Layer 3	Layer 4	parameters)	parameters)

Neural	2	25	50	25	2	2600	204
network 1
Neural	16	512	1024	1024	16	1597440	5184
network 2

The foregoing describes that the first communication apparatus may indicate, by using the first information, the input data set of the ^thlayer of the intelligent model and the gradient accumulation factor set corresponding to the ^thlayer, so that the second communication apparatus can determine, based on the first information, the gradient set or the weight set corresponding to the ^thlayer of the intelligent model. In an implementation, the first communication apparatus may indicate, by using the first information, an input data set of each layer of the intelligent model and a gradient accumulation factor set corresponding to each layer, so that the second communication apparatus can determine, based on the first information, a gradient set corresponding to each layer of the intelligent model.

This disclosure further provides another implementation. The first information may indicate an input data set of each layer of the intelligent model and a gradient accumulation factor component set corresponding to each layer. A gradient accumulation factor component in the gradient accumulation factor component set is a component of a gradient accumulation factor in the gradient accumulation factor set, and the gradient accumulation factor component set and the gradient accumulation factor set have a same dimension.

If the ^thlayer of the intelligent model is a fully connected layer, the first communication apparatus may determine, based on a gradient factor set and a weight set that correspond to the (+1)^thlayer, and an input data set of the (+1)^thlayer (that is, the output data set of the ^thlayer) by using a partial derivative operation and/or a multiplication operation, the gradient accumulation factor set corresponding to the ^thlayer, which for example, may be represented as follows:

δ l = ( W l + 1 ) T ⁢ δ l + 1 ⊙ σ ′ ( z l ) ( 7 )

σ′() represents a derivative of σ(), that is, a derivative of the output data set of the ^thlayer, and ⊙ represents a Hadamard product. ( is the gradient accumulation factor component set of the gradient accumulation factor set , and a dimension of ( is M×1.

For example, the intelligent model is shown in FIG. 6B. An activation function

σ ⁡ ( x ) = 1 1 + e x

is used as an example, and a derivative σ′(x) of the activation function may be represented as follows:

σ ′ ( x ) = d dx ⁢ σ ⁡ ( x ) = d dx ⁢ ( 1 1 + e x ) = σ ⁡ ( x ) [ 1 - σ ⁡ ( x ) ]

The gradient set of the ^thlayer may be represented as:

g l = δ l ( a l - 1 ) T = [ δ 1 l δ 2 l ⋮ δ M l ] [ a 1 l - 1 ⁢ a 2 l - 1 ⁢ … ⁢ a N l - 1 ]

In the foregoing Formula (7), a dimension of is H×M, H is a quantity of nodes included in an (+2)^thlayer, a dimension of δ^l+1is H×1, and the following may be obtained:

[ δ 1 l δ 2 l ⋮ δ M l ] = [ w 11 l + 1 w 12 l + 1 … w 1 ⁢ M l + 1 w 21 l + 1 w 22 l + 1 … w 2 ⁢ M l + 1 ⋮ ⋮ ⋱ ⋮ w H ⁢ 1 l + 1 w H ⁢ 2 l + 1 … w HM l + 1 ] T [ δ 1 l + 1 δ 2 l + 1 ⋮ δ H l + 1 ] ⊙ σ ′ ( x )

σ()=, and

σ ′ ( z m l ) = σ ⁡ ( z m l ) [ 1 - σ ⁡ ( z m l ) ] = a m l [ 1 - a m l ] .

Therefore, the following is provided:

δ m l = a m l [ 1 - a m l ] ⁢ ∑ h = 1 H w h ⁢ m l + 1 ⁢ δ h l + 1

m is an integer greater than or equal to 1 and less than or equal to M.

If is equal to L, that is, if the ^thlayer is the output layer of the intelligent model, the first communication apparatus determines, based on an output data set and a label set of an L^thlayer, a gradient accumulation factor set corresponding to the L^thlayer, where the label set is used to determine a weight set of the intelligent model.

In the example shown in FIG. 6B, a gradient set of the L^thlayer may be represented as:

g L = δ L ( a L - 1 ) T = [ δ 1 L δ 2 L ⋮ δ K L ] [ a 1 L - 1 ⁢ a 2 L - 1 ⁢ … ⁢ a J L - 1 ]

J is a quantity of nodes included in an (L−1)^thlayer.

δ L = ( a L - t ) ⊙ σ ′ ( z l )

t is the label set, and has a dimension K×1. In this case,

σ ′ ( z k L ) = σ ⁡ ( z k L ) [ 1 - σ ⁡ ( z k L ) ] = a k L [ 1 - a k L ] .

Therefore, the following is provided:

δ k L = a k L [ 1 - a k L ] ⁢ ( a k L - t k )

k is an integer greater than or equal to 1 and less than or equal to K.

If the ^thlayer of the intelligent model is a convolution layer, the first communication apparatus may determine, based on a gradient factor set and a weight set that correspond to the (+1)^thlayer, and an input data set of the (+1)^thlayer (that is, the output data set of the ^thlayer) by using a partial derivative operation and/or a convolution operation, the gradient accumulation factor set corresponding to the ^thlayer, which for example, may be represented as follows:

δ l = c ⁢ o ⁢ n ⁢ v ⁡ ( r ⁢ o ⁢ t ⁢ 1 ⁢ 8 ⁢ 0 ⁢ ( W l + 1 ) T , δ l + 1 ) ⊙ σ ′ ( z l ) ( 8 )

conv(rot180()^T, ) is a convolution of the gradient accumulation factor set and the weight . rot180()^Trepresents rotating the weight by 180 degrees. Rotating by 180 degrees means that rows of a matrix are sorted in a reverse order, and elements in each row are also sorted in a reverse order. For example, for the matrix whose dimension is H×M:

W l + 1 = [ w 11 l + 1 w 12 l + 1 … w 1 ⁢ M l + 1 w 21 l + 1 w 22 l + 1 … w 2 ⁢ M l + 1 ⋮ ⋮ ⋱ ⋮ w H ⁢ 1 l + 1 w H ⁢ 2 l + 1 w H ⁢ 3 l + 1 w HM l + 1 ]

rot180() obtained by rotating by 180 degrees is:

rot180 ⁡ ( W l + 1 ) = [ w HM l + 1 w H ⁡ ( M - 1 ) l + 1 … w H1 l + 1 w ( H - 1 ) ⁢ M l + 1 w ( H - 1 ) ⁢ ( M - 1 ) l + 1 … w ( H - 1 ) ⁢ 1 l + 1 ⋮ ⋮ ⋱ ⋮ w 1 ⁢ M l + 1 w 1 ⁢ ( M - 1 ) l + 1 w 12 l + 1 w 11 l + 1 ]

It can be learned from the foregoing Formula (7) and Formula (8) that the gradient accumulation factor set corresponding to the ^thlayer may be obtained based on the gradient accumulation factor component set and the input data set of the (+1)^thlayer. Therefore, the first communication apparatus may send the first information to the second communication apparatus. The first information indicates the gradient accumulation factor component set corresponding to each layer of the intelligent model and the input data set of each layer that are obtained by performing model training by the first communication apparatus. After receiving the first information, the second communication apparatus may determine, according to the foregoing Formula (7) or Formula (8), the gradient accumulation factor set corresponding to each layer based on the gradient accumulation factor component set of each layer and the output data set of each layer (that is, an input data set of a next layer), and then determine the gradient set corresponding to each layer according to Formula (2) or Formula (6). Optionally, the weight set corresponding to each layer may be further determined based on the gradient set corresponding to each layer, to determine the updated intelligent model.

In this implementation, overheads of indication using the first information is the same as overheads of indicating input data of each layer and a corresponding gradient accumulation factor set by the first information. Compared with feeding back a gradient set, transmission overheads can be reduced, and resource utilization can be improved.

FIG. 7 is a schematic flowchart of a communication method 700 according to an embodiment of this disclosure. In the method 700, a communication apparatus 0 to a communication apparatus 2 may correspond to the communication apparatus 0 to the communication apparatus 2 shown in FIG. 5. The communication apparatus 0 and the communication apparatus 2 perform model training in a federated learning manner. The communication apparatus 0 serves as a central apparatus and provides model parameters for the communication apparatus 1 and the communication apparatus 2 that serve as distributed apparatuses. The communication apparatus 1 and the communication apparatus 2 separately perform model training based on local data and feed back training results to the communication apparatus 0 by using the method shown in FIG. 6A. After combining the training results of the communication apparatus 1 and the communication apparatus 2, the communication apparatus 0 provides updated model parameters for the communication apparatus 1 and the communication apparatus 2. The communication apparatus 1 and the communication apparatus 2 determine an updated model based on an updated model parameter set, and perform next model training. The method may further include another communication apparatus. In the method 700, three communication apparatuses are used as an example for description. The method includes but is not limited to the following steps.

S701: The communication apparatus 1 performs an n^thmodel training process of an intelligent model, and determines an input data set 1 and a gradient accumulation factor set 1.

The communication apparatus 1 may receive, from the communication apparatus 0, a corresponding model parameter set indicating an (n−1)^thtime of model training, and determine an updated intelligent model based on the model parameter set corresponding to the (n−1)^thtime of model training. For example, the model parameter set may be obtained by combining, by the communication apparatus 0, results of the (n−1)^thtime of model training performed by the communication apparatus 0 and the communication apparatus 1. If n=1, the indication information may indicate an initial model parameter set of the intelligent model. The model parameter set may be a gradient set or a weight set of the intelligent model.

After determining the updated intelligent model based on the model parameter set corresponding to the (n−1)^thtime of model training, the communication apparatus 1 performs the n^thmodel training process by using the updated intelligent model, to obtain the input data set 1 and the gradient accumulation factor set 1. The input data set 1 includes an input data set that corresponds to each layer of the intelligent model and that is obtained in current model training, and the gradient accumulation factor set 1 includes a gradient accumulation factor set that corresponds to each layer of the intelligent model and that is obtained in current model training. For a specific determining manner, refer to the description in the embodiment shown in FIG. 6A. Details are not described herein again.

S702: The communication apparatus 2 performs the n^thmodel training process of the intelligent model, and determines an input data set 2 and a gradient accumulation factor set 2.

The communication apparatus 2 determines the input data set 2 and the gradient accumulation factor 2 in a manner the same as that of the communication apparatus 1 in S701.

It should be understood that a sequence of performing the steps in the embodiment shown in FIG. 7 by the communication apparatuses is not limited in this disclosure. A sequence of the steps is determined by a logical relationship between the steps. In specific implementation, a sequence of the steps may be changed when there is no logic conflict. For example, S702 may be performed before S701, and subsequent S705 may be performed before S703, or S704 may be performed after S705.

S703: The communication apparatus 1 sends information 1 to the communication apparatus 0, where the information 1 indicates the input data set 1 and the gradient accumulation factor set 1 of the intelligent model.

Correspondingly, the communication apparatus 0 receives the information 1 from the communication apparatus 1.

S704: The communication apparatus 0 determines a gradient set 1 based on the input data set 1 and the gradient accumulation factor set 1.

The communication apparatus 0 may determine the gradient set 1 based on a type of each layer (for example, a fully connected layer or a convolution layer) of the intelligent model and according to Formula (2) or Formula (6). The gradient set 1 includes a gradient set corresponding to each layer determined by the communication apparatus 1 by performing the n^thtime of model training.

S705: The communication apparatus 2 sends information 2 to the communication apparatus 0, where the information 2 indicates the input data set 2 and the gradient accumulation factor set 2 of the intelligent model.

Correspondingly, the communication apparatus 0 receives the information 2 from the communication apparatus 2.

S706: The communication apparatus 0 determines a gradient set 2 based on the input data set 2 and the gradient accumulation factor set 2.

Similarly, S705 and S706 may be implemented with reference to S703 and S704.

S707: The communication apparatus 0 determines, based on the gradient set 1 and the gradient set 2, a model parameter set corresponding to the n^thtime of model training.

After the communication apparatus 0 separately obtains, based on the information 1 and the information 2, the gradient set 1 and the gradient set 2 that are determined by the communication apparatus 1 and the communication apparatus 2 by performing the n^thtime of model training, the communication apparatus 0 combines the training results of the two apparatuses, to obtain the model parameter set corresponding to the n^thtime of model training.

S708: The communication apparatus 0 sends indication information to the communication apparatus 1 and the communication apparatus 2, where the indication information indicates the model parameter set corresponding to the n^thtime of model training.

Correspondingly, the communication apparatus 1 and the communication apparatus 2 receive the indication information from the communication apparatus 0, determine an updated intelligent model based on the model parameter set corresponding to the n^thtime of model training, and then perform an (n+1)^thtime of model training based on the updated intelligent model. After determining that a model training completion condition is met, the communication apparatus 0 notifies the communication apparatus 1 and the communication apparatus 2 to stop model training. Each communication apparatus obtains a trained intelligent model, and the intelligent model may be used for inference.

For example, multiple communication apparatuses perform training of an intelligent model in a federated learning manner. The multiple communication apparatuses may include an access network node and a terminal, or the multiple communication apparatuses may be communication apparatuses in Internet of Things, Internet of Vehicles, or Industrial Internet of Things. Federated learning can be used to improve diversity of model training data and sharing of a processing capability of the apparatus.

After model training, the intelligent model may be used to perform a physical layer inference task. For example, the intelligent model is used to infer compressed CSI based on CSI. The access network node and multiple terminals may jointly train the intelligent model, and the multiple terminals may perform CSI compression based on the intelligent model, to feed back compressed CSI to the access network node. Alternatively, the intelligent model may be used to execute an application layer inference task after model training. For example, the intelligent model is used for speech recognition. If application layers of multiple terminals need to execute a speech recognition task, the access network node may configure the multiple terminals to jointly train the intelligent model. After the training is completed, the multiple terminals separately execute the speech recognition inference task by using the intelligent model.

According to the solution provided in this disclosure, when multiple communication apparatuses jointly perform model training, and a gradient set or a weight set of a neuron layer needs to be transferred between communication apparatuses, an input data set and a corresponding gradient accumulation factor set that are obtained through decomposing the gradient set corresponding to the neuron layer may be transmitted, and a receiving apparatus may determine the gradient set or the weight set based on the input data set and the gradient accumulation factor set. Transmission overheads can be reduced and resource utilization can be improved by using this manner of model parameter transmission.

It may be understood that to implement functions in the foregoing embodiments, the base station and the terminal include corresponding hardware structures and/or software modules for performing various functions. A person skilled in the art should be easily aware that, in this disclosure, the units and method steps in the examples described with reference to embodiments disclosed in this disclosure can be implemented by hardware or a combination of hardware and computer software. Whether a function is performed by hardware or hardware driven by computer software depends on particular application scenarios and design constraint conditions of the technical solutions.

FIG. 8 to FIG. 10 are diagrams of structures of possible communication apparatuses according to embodiments of this disclosure. The communication apparatuses may be configured to implement functions of the communication apparatus in the foregoing method embodiments, and therefore can also implement beneficial effects of the foregoing method embodiments. In this embodiment of this disclosure, when the communication apparatus is configured to implement the functions of the communication apparatus in the foregoing method embodiments, the communication apparatus may be one of the terminals 120a to 120j shown in FIG. 1A. Alternatively, the communication apparatus may be the RAN node 110a or 110b shown in FIG. 1A, or may be a module (for example, a chip or a chip system) used in a terminal or a network device.

The communication apparatus 800 shown in FIG. 8 includes a transceiver unit 820, and the transceiver unit 820 may be configured to receive or send information. The communication apparatus 800 may further include a processing unit 810, and the processing unit 810 may be configured to process instructions or data, to implement corresponding operations.

It should be further understood that when the communication apparatus 800 is a chip configured in (or used in) a communication device, the transceiver unit 820 in the communication apparatus 800 may be an input/output interface or a circuit in the chip, and the processing unit 810 in the communication apparatus 800 may be a processor in the chip.

Optionally, the communication apparatus 800 may further include a storage unit. The storage unit may be configured to store instructions or data. The processing unit 810 may execute the instructions or the data stored in the storage unit, to cause the communication apparatus to implement a corresponding operation.

The communication apparatus 800 may be configured to implement a function of the second communication apparatus or the communication apparatus 0 in the foregoing method embodiments. When the communication apparatus 800 is configured to implement a function of the second communication apparatus or the communication apparatus 0 in the foregoing method embodiments, the transceiver unit 820 is configured to receive first information, where the first information indicates an input data set of an ^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, and is a positive integer. The processing unit 810 is configured to determine, based on the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, a gradient set corresponding to the ^thlayer, where one gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in the gradient set, and the gradient set is used to determine the intelligent model.

The communication apparatus 800 may be configured to implement a function of the first communication apparatus, the communication apparatus 1, or the communication apparatus 2 in the foregoing method embodiments. When the communication apparatus 800 is configured to implement a function of the first communication apparatus, the communication apparatus 1, or the communication apparatus 2 in the foregoing method embodiments, the processing unit 810 is configured to perform a model training process of an intelligent model, and determine an input data set of an ^thlayer of the intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, where is a positive integer. The transceiver unit 820 is configured to send first information, where the first information indicates the input data set of the ^thlayer and the gradient accumulation factor set corresponding to the ^thlayer, and one gradient accumulation factor in the gradient accumulation factor set is used to determine multiple gradients in a gradient set corresponding to the ^thlayers.

For more detailed descriptions of the processing unit 810 and the transceiver unit 820, refer to related descriptions in the foregoing method embodiments.

It should be understood that the transceiver unit 820 in the communication apparatus 800 may be implemented by using a transceiver, a transceiver circuit, an input/output interface, or a pin. When the transceiver unit 820 is a transceiver, the transceiver may include a receiver and/or a transmitter. The processing unit 810 in the communication apparatus 800 may be implemented by using at least one processor. The processing unit 810 in the communication apparatus 800 may alternatively be implemented by using at least one logic circuit. Optionally, the communication apparatus 800 further includes a storage unit, and the storage unit may be implemented by using a memory.

As shown in FIG. 9, the communication apparatus 900 includes a processor 910 and an interface circuit 920. The processor 910 and the interface circuit 920 are coupled to each other. It may be understood that the interface circuit 920 may be a transceiver or an input/output interface. Optionally, the communication apparatus 900 may further include a memory 930, configured to store instructions executed by the processor 910, store input data required by the processor 910 to run instructions, or store data generated after the processor 910 runs instructions.

In an implementation, the memory 930 may alternatively be integrated into the processor 910, or may be independent of the processor 910.

When the communication apparatus 900 is configured to implement the method in the foregoing method embodiments, the processor 910 is configured to implement functions of the processing unit 810, and the interface circuit 920 is configured to implement functions of the transceiver unit 820.

FIG. 10 is another diagram of a structure of a communication apparatus 1100. It may be understood that the communication apparatus 1100 includes components in necessary forms such as module, unit, element, circuit, or interface, and the components are properly configured together to implement functions of the first communication apparatus or the second communication apparatus in the foregoing embodiments of this disclosure. The communication apparatus 1100 may be a RAN node, a terminal, a core network device, or another network device in FIG. 1A, or may be a component (for example, a chip) in these devices, and is configured to implement the method described in the foregoing method embodiments. The communication apparatus 1100 includes one or more processors 1010. The processor 1010 may be a general-purpose processor, a dedicated processor, or the like. For example, the processor may be a baseband processor or a central processing unit. The baseband processor may be configured to process a communication protocol and communication data. The central processing unit may be configured to: control the communication apparatus (for example, a RAN node, a terminal, or a chip), execute a software program, and process data of the software program.

Optionally, in a design, the processor 1010 may include a program 1011 (which may also be referred to as code or instructions in some cases), and the program 1011 may be run on the processor 1010, to cause the communication apparatus 1100 to perform the method described in the foregoing embodiments. In another possible design, the communication apparatus 1100 may include a circuit (not shown in FIG. 10). The circuit is configured to implement a function of receiving a signal and/or sending a signal by the communication apparatus in the foregoing embodiments.

Optionally, the communication apparatus 1100 may include one or more memories 1020, and a program 1021 (which may also be referred to as code or instructions in some cases) is stored in the memory 1020. The program 1021 may be run on the processor 1010, to cause the communication apparatus 1100 to perform the method described in the foregoing method embodiments.

Optionally, the processor 1010 and/or the memory 1020 may include an AI module 1012 and an AI module 1022, and the AI module is configured to implement an AI-related function. The AI module may be implemented by using software, hardware, or a combination of software and hardware. For example, the AI module may include a RAN intelligent controller (RIC) module. For example, the AI module may be a near-real-time RIC or a non-real-time RIC.

Optionally, the processor 1010 and/or the memory 1020 may further store data. The processor and the memory may be separately disposed, or may be integrated together.

Optionally, the communication apparatus 1100 may further include a transceiver 1030 and/or an antenna 1040. The processor 1010 may also be referred to as a processing unit in some cases, and controls the communication apparatus (for example, the RAN node or the terminal). The transceiver 1030 may also be referred to as a transceiver unit, a transceiver machine, a transceiver circuit, a transceiver, or the like in some cases, and is configured to implement receiving and sending functions of the communication apparatus via the antenna 1040.

When the communication apparatus is a chip used in a terminal, the chip may implement functions related to the terminal in the foregoing method embodiments. The chip receives information/data from another module (for example, a radio frequency module or an antenna) in the terminal, where the information/data may be sent by a network device to the terminal; or the chip sends information/data to another module (for example, a radio frequency module or an antenna) in the terminal, where the information/data may be sent by the terminal device to a network device.

When the communication apparatus is a module used in a network device, the module may implement functions related to the network device in the foregoing method embodiments. The module receives information/data from another module (for example, a radio frequency module or an antenna) in the network device, where the information/data may be sent by a terminal to the network device; or the module sends information/data to another module (for example, a radio frequency module or an antenna) in the network device, where the information/data may be sent by the network device to a terminal. The module in the network device herein may be a chip of the network device, or may be a DU or another module (for example, a CU or a radio unit (RU)). The RU may include a remote radio unit (RRU). The DU may be an O-DU in an O-RAN architecture, the CU may be an O-CU in the O-RAN architecture, and the RU may be an O-RU in the O-RAN architecture.

It can be understood that the processor in embodiments of this disclosure may be a central processing unit (CPU), or may be another general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or another programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The general-purpose processor may be a microprocessor, any regular processor, or the like.

The method steps in embodiments of this disclosure may be implemented in hardware, or may be implemented in software instructions that may be executed by the processor. The software instructions may include a corresponding software module. The software module may be stored in a random-access memory (RAM), a flash memory, a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a register, a hard disk, a removable hard disk, a compact disc read-only memory (CD-ROM), or any other form of storage medium well-known in the art. For example, a storage medium is coupled to a processor, so that the processor can read information from the storage medium and write information into the storage medium. The storage medium may alternatively be a component of the processor. The processor and the storage medium may be disposed in an ASIC. In addition, the ASIC may be located in an access network device or a terminal device. The processor and the storage medium may alternatively exist as discrete components in the access network device or the terminal device.

According to the methods provided in embodiments of this disclosure, embodiments of this disclosure further provide a computer program product. The computer program product includes computer program code. When the computer program code is executed by one or more processors, an apparatus including the processor is caused to perform the method in the foregoing method embodiments.

All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used for implementation, all or some of the embodiments may be implemented in a form of computer program product. The computer program product includes one or more computer programs or instructions. When the computer programs or instructions are loaded and executed on a computer, the procedures or functions in embodiments of this disclosure are all or partially executed. The computer may be a general-purpose computer, a dedicated computer, a computer network, a network device, user equipment, or another programmable apparatus.

According to the methods provided in embodiments of this disclosure, embodiments of this disclosure further provide a computer-readable storage medium. The computer-readable storage medium stores the computer programs or instructions. When the computer programs or instructions are run by one or more processors, an apparatus including the processor is caused to perform the method in the foregoing method embodiments.

The computer programs or the instructions may be stored in the computer-readable storage medium, or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer programs or the instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired or wireless manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device like a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium, for example, a floppy disk, a hard disk, or a magnetic tape; or may be an optical medium, for example, a digital video disc; or may be a semiconductor medium, for example, a solid-state drive. The computer-readable storage medium may be a volatile or non-volatile storage medium, or may include both a volatile storage medium and a non-volatile storage medium.

According to the methods provided in embodiments of this disclosure, embodiments of this disclosure further provide a communication system, including one or more terminals described above. The system may further include one or more network devices described above.

In several embodiments provided in this disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatuses are merely examples. For example, the unit division is merely logical function division and may be other division in actual implementation. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electric, mechanical, or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.

In embodiments of this disclosure, unless otherwise stated or there is a logic conflict, terms and/or descriptions in different embodiments are consistent and may be mutually referenced, and technical features in different embodiments may be combined into a new embodiment based on an internal logical relationship thereof.

The foregoing descriptions are merely specific implementations of this disclosure, but are not intended to limit the protection scope of this disclosure. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this disclosure shall fall within the protection scope of this disclosure. Therefore, the protection scope of this disclosure shall be subject to the protection scope of the claims.

Claims

1. A method, comprising:

receiving first information indicating an input data set of an ^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, wherein is a positive integer; and

determining, based on the input data set and the gradient accumulation factor set, a gradient set corresponding to the ^thlayer,

wherein a first gradient accumulation factor in the gradient accumulation factor set is for determining first gradients in the gradient set, and

wherein the gradient set is for determining the intelligent model.

2. The method of claim 1, further comprising:

determining, based on the gradient set, a weight set of the intelligent model, wherein the gradient set comprises a second gradient corresponding to each weight in the weight set; and

determining, based on the weight set, an updated intelligent model.

3. The method of claim 1, wherein the ^thlayer is a fully connected layer, wherein the ^thlayer comprises N nodes, wherein an (+1)^thlayer of the intelligent model comprises M nodes, wherein N and M are positive integers greater than 2, wherein the input data set comprises N pieces of first input data corresponding to the N nodes, wherein the gradient accumulation factor set comprises M gradient accumulation factors corresponding to the M nodes, wherein the gradient set comprises T gradients, and wherein T is a first product of N and M.

4. The method of claim 3, wherein determining the gradient set comprises determining, based on second input data of each of the N nodes and a second gradient accumulation factor corresponding to each of the M nodes, the T gradients, wherein a gradient g_mnin the T gradients is a second product of third input data of an n^thnode in the N nodes and a third gradient accumulation factor corresponding to an m^thnode in the M nodes, wherein n is a positive integer less than or equal to N, and wherein m is a positive integer less than or equal to M.

5. The method of claim 1, wherein the input data set comprises first input data of P channels, wherein the ^thlayer comprises Q convolution kernels, wherein the Q convolution kernels correspond to Q channels of output data of the ^thlayer, wherein the gradient accumulation factor set comprises Q gradient accumulation factor subsets corresponding to the Q convolution kernels, wherein the gradient set comprises R gradient subsets, wherein the R gradient subsets comprise P gradient subsets corresponding to each of the Q convolution kernels, wherein the P gradient subsets correspond to the P channels, wherein P and Q are positive integers, and wherein R is a product of Q and P.

6. The method of claim 5, wherein determining the gradient set comprises:

determining, by performing a first convolution operation and based on second input data of each of the P channels and a gradient accumulation factor subset corresponding to each of the Q convolution kernels, the R gradient subsets; and

obtaining, by performing a second convolution operation and based on third input data of a p^thchannel in the P channels and a second gradient accumulation factor corresponding to a q^thconvolution kernel in the Q convolution kernels, a gradient subset G_qpin the R gradient subsets.

7. The method of claim 1, further comprising sending the gradient set to determine the intelligent model.

8. A method, comprising:

performing a model training process of an intelligent model;

determining an input data set of an ^thlayer of the intelligent model and a first gradient accumulation factor set corresponding to the ^thlayer, wherein is a positive integer; and

sending first information indicating the input data set and the first gradient accumulation factor set,

wherein a first gradient accumulation factor in the first gradient accumulation factor set is for determining first gradients in a gradient set corresponding to the ^thlayer.

9. The method of claim 8, wherein determining the first gradient accumulation factor set comprises determining, based on a first weight set of an (+1)^thlayer of the intelligent model, a second gradient accumulation factor set corresponding to the (+1)^thlayer, and a first output data set of the ^thlayer, the first gradient accumulation factor set.

10. The method of claim 9, wherein the intelligent model comprises L layers, wherein/is less than or equal to L, wherein determining the first gradient accumulation factor set further comprises determining, based on a second output data set and a label set of an L^thlayer, a third gradient accumulation factor set corresponding to the L^thlayer when is equal to L, and wherein the label set is for determining a second weight set of the intelligent model.

11. The method of claim 9, wherein the ^thlayer is a fully connected layer and comprises N nodes, wherein the (+1)^thlayer comprises M nodes, wherein N and M are positive integers greater than 2, wherein the input data set comprises N pieces of input data corresponding to the N nodes, wherein the first gradient accumulation factor set comprises M gradient accumulation factors corresponding to the M nodes, wherein the gradient set comprises T gradients, and wherein T is a product of N and M.

12. The method of claim 11, wherein determining the second gradient accumulation factor set comprises determining, by using at least one of a partial derivative operation or a multiplication operation, the second gradient accumulation factor set.

13. The method of claim 9, wherein the input data set comprises input data of P channels, wherein the ^thlayer comprises Q convolution kernels, wherein the Q convolution kernels correspond to Q channels of output data of the ^thlayer, wherein the first gradient accumulation factor set comprises Q gradient accumulation factor subsets corresponding to the Q convolution kernels, wherein the gradient set comprises R gradient subsets, wherein the R gradient subsets comprise P gradient subsets corresponding to each of the Q convolution kernels, wherein the P gradient subsets corresponding to each of the Q convolution kernels correspond to the P channels, and wherein R is a product of Q and P.

14. The method of claim 13, wherein determining the first gradient accumulation factor set comprises determining, by using at least one of a partial derivative operation or a convolution operation, the first gradient accumulation factor set.

15. An apparatus, comprising:

a memory configured to store instructions; and

one or more processors coupled to the memory and configured to execute the instructions to cause the apparatus to:

receive first information indicating an input data set of an ^thlayer of an intelligent model and a gradient accumulation factor set corresponding to the ^thlayer, wherein is a positive integer; and

determining, based on the input data set and the gradient accumulation factor set, a gradient set corresponding to the ^thlayer,

wherein a first gradient accumulation factor in the gradient accumulation factor set is for determining first gradients in the gradient set, and

wherein the gradient set is for determining the intelligent model.

16. The apparatus of claim 15, wherein the one or more processors are further configured to execute the instructions to cause the apparatus to:

determine, based on the gradient set, a weight set of the intelligent model, wherein the gradient set comprises a second gradient corresponding to each weight in the weight set; and

determine, based on the weight set, an updated intelligent model.

17. The apparatus of claim 15, wherein the ^thlayer is a fully connected layer, wherein the ^thlayer comprises N nodes, wherein an (+1)^thlayer of the intelligent model comprises M nodes, wherein N and M are positive integers greater than 2, wherein the input data set comprises N pieces of first input data corresponding to the N nodes, wherein the gradient accumulation factor set comprises M gradient accumulation factors corresponding to the M nodes, wherein the gradient set comprises T gradients, and wherein T is a first product of N and M.

18. The apparatus of claim 17, wherein the one or more processors are further configured to execute the instructions to cause the apparatus to further determine the gradient set by determining, based on second input data of each of the N nodes and a second gradient accumulation factor corresponding to each of the M nodes, the T gradients, wherein a gradient g_mnin the T gradients is a second product of third input data of an n^thnode in the N nodes and a third gradient accumulation factor corresponding to an m^thnode in the M nodes, wherein n is a positive integer less than or equal to N, and wherein m is a positive integer less than or equal to M.

19. The apparatus of claim 15, wherein the input data set comprises first input data of P channels, wherein the ^thlayer comprises Q convolution kernels, wherein the Q convolution kernels correspond to Q channels of output data of the ^thlayer, wherein the gradient accumulation factor set comprises Q gradient accumulation factor subsets corresponding to the Q convolution kernels, wherein the gradient set comprises R gradient subsets, wherein the R gradient subsets comprise P gradient subsets corresponding to each of the Q convolution kernels, wherein the P gradient subsets correspond to the P channels, wherein P and Q are positive integers, and wherein R is a product of Q and P.

20. The apparatus of claim 19, wherein the one or more processors are further configured to execute the instructions to cause the apparatus to further determine the gradient set by:

obtaining, by performing a second convolution operation and based on third input data of a p^thchannel in the P channels and a gradient accumulation factor corresponding to a q^thconvolution kernel in the Q convolution kernels, a gradient subset G_qpin the R gradient subsets.

Resources