US20190251447A1
2019-08-15
16/262,947
2019-01-31
A computing device for training a fully-connected neural network (FCNN) comprises at least one storage device; and at least one processing circuit, coupled to the at least one storage device. The at least one storage device stores, and the at least one processing circuit is configured to execute instructions of: computing a block-diagonal approximation of a positive-curvature Hessian (BDA-PCH) matrix of the FCNN; and computing at least one update direction of the BDA-PCH matrix according to an expectation approximation conjugated gradient (EA-CG) method.
Get notified when new applications in this technology area are published.
G06N3/08 IPC
Computing arrangements based on biological models using neural network models Learning methods
G06F17/16 » CPC further
Digital computing or data processing equipment or methods, specially adapted for specific functions; Complex mathematical operations Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
G06N3/084 » CPC main
Computing arrangements based on biological models using neural network models; Learning methods Back-propagation
G06N3/04 » CPC further
Computing arrangements based on biological models using neural network models Architectures, e.g. interconnection topology
This application claims the benefit of U.S. Provisional Applications No. 62/628,311 filed on Feb. 9, 2018, No. 62/630,278, Filed on Feb. 14, 2018 and No. 62/673,143, filed on May. 18, 2018, which are incorporated herein by reference.
The present invention relates to a device and a method used in a computing system, and more particularly, to a device and a method of training a fully-connected neural network.
Neural networks have been applied to solve problems in several application domains such as computer vision, natural language processing, disease diagnosis, etc. When training a neural network, model parameters of the neural network according to a backpropagation process, stochastic gradient descent (SGD), Broyden-Fletcher-Goldfarb-Shanno and one-step secant are representative algorithms used for realizing the backpropagation process.
SGD minimizes a function by using a function's first derivative, and has been proven to be effective for training large models . However, stochasticity in a gradient slows down convergence for all gradient methods such that none of these gradient methods can be asymptotically faster than simple SGD with Polyak averaging. Besides the gradient methods, second-order methods utilize curvature information of a loss function within neighborhood of a given point to guide an update direction. Since each update becomes more precise, the second-order methods converge faster than first-order methods in terms of update iterations.
To solve a convex optimization problem, a second-order method converges to a global minimum in fewer steps than SGD. However, a problem of training the neural-network can be non-convex, and an issue of a negative curvature occurs. To avoid the issue, a Gauss-Newton matrix with a convex criterion function or a Fisher matrix may be used to measure a curvature, since these matrices are guaranteed to be positive semi-definite (PSD).
Although these matrices can alleviate the issue of the negative curvature, computing the Gauss-Newton matrix or the Fisher matrix even for a modestly-sized fully-connected neural network (FCNN) is intractable. O(N2) complexity is needed for a second derivative, if O(N) complexity is needed for computing a first derivative. Thus, several methods in the prior art are proposed to approximate these matrices. However, none of the methods are computationally feasible and more effective than the first-order methods. Thus, a computationally feasible and effective second-order methods for training the FCNN is needed.
The present invention therefore provides a device and a method for training a FCNN to solve the abovementioned problem.
A computing device for training a FCNN comprises at least one storage device; and at least one processing circuit, coupled to the at least one storage device. The at least one storage device stores, and the at least one processing circuit is configured to execute instructions of: computing a block-diagonal approximation of a positive-curvature Hessian (BDA-PCH) matrix of the FCNN; and computing at least one update direction of the BDA-PCH matrix according to an expectation approximation conjugated gradient (EA-CG) method.
A method for training a FCNN, comprises computing a BDA-PCH matrix of the FCNN; and computing at least one update direction of the BDA-PCH matrix according to an EA-CG method.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
FIG. 1 is a schematic diagram of a computing device according to an example of the present invention.
FIG. 2 is a flowchart of a process according to an example of the present invention.
FIG. 1 is a schematic diagram of a computing device 10 according to an example of the present invention. The computing device 10 includes at least one processing circuit 100 such as a microprocessor or Application Specific Integrated Circuit (ASIC), at least one storage device 110 and at least one communication interfacing device 120. The at least one storage device 110 maybe any data storage device that may store program codes 114, accessed and executed by the at least one processing circuit 100. Examples of the at least one storage device 110 include but are not limited to a subscriber identity module (SIM), read-only memory (ROM), flash memory, random-access memory (RAM), hard disk, optical data storage device, non-volatile storage device, non-transitory computer-readable medium (e.g., tangible media), etc. The at least one communication interfacing device 120 is used to transmit and receive signals (e.g., information, data, messages and/or packets) according to processing results of the at least one processing circuit 100. The at least one communication interfacing device 120 may be at least one transceiver, at least one interfacing circuit or at least one interfacing board, and is not limited herein. An abovementioned communication interfacing device may be Universal Serial Bus (USB), Institute of Electrical and Electronics Engineers (IEEE) 1394, Serial Advanced Technology Attachment (SATA), Integrated Drive Electronics (IDE), Peripheral Component Interconnect (PCI), or Ethernet.
The present invention provides a block-diagonal approximation of a positive-curvature Hessian (BDA-PCH) matrix, which is memory-efficient. The BDA-PCH matrix can be applied to any fully-connected neural network (FCNN), where an activation function and a criterion function are twice differentiable. The BDA-PCH matrix can handle non-convex criterion functions, which cannot be handled by Gauss-Newton methods. In addition, an expectation approximation (EA) is combined with a conjugated gradient (CG) method, which is termed as an EA-CG method, to derive update directions for training the FCNN in a mini-batch setting. The EA-CG method significantly reduces space complexity and time complexity of conventional CG methods.
A second-order method for training a FCNN is proposed in the present invention as follows:
A Newton method is one of second-order minimization methods, and includes two steps: 1) computing a Hessian matrix, and 2) solving a system of linear equations for update directions. A truncated-Newton method applies a CG method with restricted iterations to the second step of the Newton method. In the following description, the truncated-Newton method in context of a convex scenario is first discussed. Then, a non-convex scenario of the truncated-Newton method is discussed, and an important property that lays a foundation of a proposed PCH matrix is provided.
A minimization problem is formulated as follows:
minθƒ(θ), (1)
∇ƒ(θ*)=0. (2)
A quadratic polynomial is used to approximate the equation (Eq. 1) by conducting a Taylor expansion with a given point θj. Then, the equation (Eq. 1) can be expressed as follows:
mindƒ(θj+d)≈ƒ(θj)+∇ƒ(θj)Td+½dT∇2ƒ(θj)d, (3)
∇ƒ(θj0+∇2ƒ(θj)dj=0. (4)
Thus, a Newton direction can be obtained as follows:
dj=−∇2ƒ(θj)−1∇ƒ(θj). (5)
θj+1=θj+ηdj, (6)
For a non-convex scenario, a solution to the equation (Eq. 2) reflects one of three possibilities: a local minimum θmin, a local maximum θmax and a saddle point θsaddle. An important concept is introduced that curvature information of ƒ at a given point θ can be obtained by analyzing the Hessian matrix ∇2ƒ(θ). On the one hand, the Hessian matrix of ƒ at any θmin is positive semi-definite. The Hessian matrix of ƒ at any θmax and θsaddle are negative semi-definite and indefinite, respectively. After establishing the concept, a Property is used to understand how to utilize negative curvature information to resolve the issue of negative curvature.
Property: Let ƒ be a non-convex and twice-differentiable function. With a given point θj, it is supposed that there exist some negative eigenvalues {λ1, . . . , λs} for ∇2ƒ(θj). Moreover, V=span{ν1, . . . , νs} is taken, which is an eigenspace corresponding to {λ1, . . . , λs}. If the following equation is considered
g(k)=ƒ(θj)+∇ƒ(θj)Tν+½νT∇2ƒ(θj)ν, (7)
According to the Property, the equation (Eq. 4) may lead to a local maximum or a saddle point, if ∇2ƒ(θj) has some negative eigenvalues. In order to converge to a local minimum, ∇2ƒ(θj) is replaced with Pos-Eig(∇2ƒ(θj)), where Pos-Eig (A) is conceptually defined as replacing negative eigenvalues of A with non-negative ones as follows:
Post - Eig ( A ) = Q T [ γλ 1 ⋱ γλ s λ s + 1 ⋱ λ n ] Q , ( 8 )
When the number of variables in ƒ is large, the Hessian matrix becomes intractable in terms of space complexity. Alternatively, a CG method may be used to solve the equation (Eq. 4). This alternative only needs calculating Hessian-vector products rather than storing the whole Hessian matrix. Moreover, it is desirable to restrict the iteration number of the CG method, to save computation cost.
For a second-order method, a block Hessian matrix is used to compute curvature information. As a basis of a proposed PCH matrix, in the following description, notations for training a FCNN are described and the block Hessian recursion is formulated with the notations.
A FCNN with k layers takes an input vector hi0=xi, where xi is an ith instance in a training set. For the ith instance, activation values in the other layers can be recursively derived according to: hit=σ(Wthit−1+bt), t=1, . . . , k−1, where σ is an activation function and may be any twice differentiable function, and Wt and bt are weights and biases in the tth layer, respectively. nt is the number of the neurons in the tth layer, where t=0, . . . ,k, and all model parameters including all the weights and biases in each layer is formulated as θ=(Vec(W1), b1, . . . , Vec(Wk), bk), where Vec(A)=[[A.1]T, . . . , [A.n]T]T. By following the above notations, a FCNN output with k layers can be formulated as hik=F(θ|xi)=Wkhik−1+bk.
To train the FCNN, a loss function ξ which can be any twice differentiable function is needed. Training the FCNN can thus be interpreted as solving the following minimization problem:
minθΣi=1lξ(hik|yi)≡min0Σi=1lC(ŷi|yi), (9)
For a lucid exposition of a block Hessian recursion, equations of a backpropagation are formulated according to the notations defined in the previous description. The bias term bt and the weight term Wt are separated, and are treated individually during a backward propagation of gradients. The gradients of ξ with respect to the bias term and the weight term can be derived according to the formulated equations in a layer-wise manner. For the ith instance, the formulated equations are as follows:
∇bk2ξi=∇hikξi, (10)
∇bt−1ξi=diag(hi(t−1)′)WtT∇btξi, (11)
∇Wt2ξi=∇btξi⊗hi(t−1)T, (12)
∇bk2ξi=∇hik2ξi, (13)
∇bt−12ξi=diag(hi(t−1)′)WtT∇bt2ξiWtdiag(hi(t−1)′)+diag(hi(t−1)″{circle around (·)}(WtT∇btξi, (14)
∇Wt2ξi=(hi(t−1)⊗hi(t−1)T)⊗∇bt2ξi, (15)
The idea behind expectation approximation is that a covariance between [hi(t−1)⊗hi(t−1)T]uν and [∇btξi⊗∇btξiT]μν with given indices (u, ν) and (μ, ν) is shown to be tiny, and thus is ignored due to computational efficiency according to the following equation:
i[[hi(t−1)⊗hi(t−1)T]uν·[∇btξi⊗∇btξiT]μν]≈i[[hi(t−1)⊗hi(t−1)T]uν]·i[[∇btξi⊗∇btξiT]μν]. (16)
To explain this concept on the above formulations, cov-t is defined as Ele-Cov((hi(t−1)⊗hi*t−1)T)⊗1nt,nt, 1nt−1,nt−1⊗∇bt2ξi), where “Ele-Cov” is denoted as an element-wise covariance, and 1u,84 is a matrix whose elements are 1 in u×ν, t=1, . . . , k. With the definition of cov-t and the previous equations, the approximation can be interpreted as follows:
i ( ∇ W t 2 ξ i ] = EhhT t - 1 ⊗ i [ ∇ b t 2 ξ i ] + cov - t , ≈ EhhT t - 1 ⊗ i [ ∇ b t 2 ξ i ] , ( 17 )
Then, the following approximation equation can be obtained:
i [ ∇ b t - 1 2 ξ i ] ≈ i [ diag ( h i ( t - 1 ) ′ ) W tT i [ ∇ b t 2 ξ i ] W t diag ( h i ( t - 1 ) ′ ) + diag ( h i ( t - 1 ) ″ ⊙ ( W tT ∇ b t ξ i ) ] , = ( W tT i [ ∇ b t 2 ξ i ] W t ) ⊙ EhhT ( t - 1 ) ′ + i [ diag ( h i ( t - 1 ) ″ ⊙ ( W tT ∇ b t ξ i ) ) ] , ( 18 )
Ele - Cov ( W tT ∇ b t 2 ξ i W t , h i ( t - 1 ) ′ ⊗ h i ( t - 1 ) ′ T F 2 ≤ L 4 Σ μ , v Var ( [ W tT ∇ b t 2 ξ i W t ] μ v ) , ( 19 )
A computationally feasible method for training a FCNN with Newton directions is proposed in the present invention. First, a PCH matrix is constructed. Then, based on the PCH matrix, an efficient CG-based method incorporating the expectation approximation to derive the Newton directions for multiple training instances, call the EA-CG method, is proposed.
Based on the layer-wise equations and the integration of the expectation approximation, block matrices with various sizes are constructed, and are located at a diagonal of the Hessian matrix. This block-diagonal matrix [∇θ2ξi] is represented as diag(i[∇W12ξi], i[∇b12ξi], . . , i[∇Wk2ξi], i[∇bk2ξi]). Please note that [∇θ2ξi] is a block-diagonal Hessian matrix, and is not the complete Hessian matrix. According to the description for the three possibilities of the update directions, [∇θ2ξi] should be modified. Thus, [∇θ2ξi] is replaced with diag (i[ξi], i[ξi], . . . , i[ξi], i[ξi]), and the modified result is denoted as i[ξi], where
i[ξi]=Pos−Eig(i[∇hik2ξi]), (20)
i[ξi]=(WtTi[ξi]Wt){circle around (·)}EhhT(t−1)′+Pos−Eig(diag(i[hi(t−1)″{circle around (·)}(WtT∇btξi])), (21)
i[ξi]=EhhTt−1⊗i[ξi]. (22)
After obtaining the PCH matrix z,51 i[ξi], the update direction is updated by solving the following linear equation:
((1−α)i[ξi]+αI)dθ=−i[∇θξi], (23)
((1−α)i[ξi]+αI)dbt=i[∇btξi], (24)
((1−α)i[ξi]+αI)dWt=−Vec(i[∇Wtξi]), (25)
i [ ξ i ] Vec ( P ) = Vec ( i [ ξ i ] · P · i [ h i t - 1 ⊗ h i ( t - 1 ) T ] ) , ≈ Vec ( i [ ξ i ] · P · i [ h i t - 1 ] ⊗ i [ h i ( t - 1 ) T ] ) , ( 26 )
Based on the equation (Eq. 26), the Hessian-vector products of z,51 i[ξi] via z,51 i[ξi] are obtained, and the space complexity of storing the curvature information is reduced.
The above description can be summarized into a process 20 shown in FIG. 2, and can be compiled into the program codes 114. The process 20 includes the following steps:
Step 200: Start.
Step 202: Compute a BDA-PCH matrix of the FCNN.
Step 204: Compute at least one update direction of the BDA-PCH matrix according to an EA-CG method.
Step 206: End.
Details and variations of the process 20 can be referred to the above illustration, and are not narrated herein.
It should be noted that although the above examples are illustrated to clarify the related operations of corresponding processes. The examples can be combined and/or modified arbitrarily according to system requirements and/or design considerations.
Those skilled in the art should readily make combinations, modifications and/or alterations on the abovementioned description and examples. The abovementioned description, steps and/or processes including suggested steps can be realized by means that could be hardware, software, firmware (known as a combination of a hardware device and computer instructions and data that reside as read-only software on the hardware device) , an electronic system, or combination thereof. An example of the means may be the computing device 10. In the above description, the examples (including related equations) may be compiled into the program codes 214.
To sum up, a PCH matrix and an EA-CG method are proposed to achieve more computationally feasible second-order methods for training a FCNN. The proposed PCH matrix overcomes the problem of training the FCNN with non-convex criterion functions. In addition, the EA-CG method provides another alternative to efficiently derive update directions. Empirical studies show that the proposed PCH matrix performs better than the state-of-the-art curvature approximation, and the EA-CG method converges faster while having a better testing accuracy.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
1. A computing device for training a fully-connected neural network (FCNN), comprising:
at least one storage device; and
at least one processing circuit, coupled to the at least one storage device, wherein the at least one storage device stores, and the at least one processing circuit is configured to execute instructions of:
computing a block-diagonal approximation of a positive-curvature Hessian (BDA-PCH) matrix of the FCNN; and
computing at least one update direction of the BDA-PCH matrix according to an expectation approximation conjugated gradient (EA-CG) method.
2. The computing device of claim 1, wherein the BDA-PCH matrix is computed by performing at least one expectation on a plurality of layer-wise equations.
3. The computing device of claim 2, wherein the plurality of layer-wise equations comprise a gradient of a plurality of loss functions at a plurality of layers with respect to at least one bias.
4. The computing device of claim 2, wherein the plurality of layer-wise equations comprise a gradient of a plurality of loss functions at a plurality of layers with respect to at least one weight.
5. The computing device of claim 1, wherein the BDA-PCH matrix comprises at least one expectation of a Hessian of a loss function with respect to at least one bias.
6. The computing device of claim 1, wherein the instruction of computing the at least one update direction according to the EA-CG method comprises:
computing a linear equation of a weighted average of the BDA-PCH matrix and an identity matrix; and
computing the at least one update direction by solving the linear equation according to the EA-CG method.
7. The computing device of claim 6, wherein the linear equation comprises the weighted average of the BDA-PCH matrix with respect to at least one bias and the identity matrix.
8. The computing device of claim 6, wherein the linear equation comprises the weighted average of the BDA-PCH matrix with respect to at least one weight and the identity matrix.
9. A method for training a fully-connected neural network (FCNN), comprising:
computing a block-diagonal approximation of a positive-curvature Hessian (BDA-PCH) matrix of the FCNN; and
computing at least one update direction of the BDA-PCH matrix according to an expectation approximation conjugated gradient (EA-CG) method.
10. The method of claim 9, wherein the BDA-PCH matrix is computed by performing at least one expectation on a plurality of layer-wise equations.
11. The method of claim 10, wherein the plurality of layer-wise equations comprise a gradient of a plurality of loss functions at a plurality of layers with respect to at least one bias.
12. The method of claim 10, wherein the plurality of layer-wise equations comprise a gradient of a plurality of loss functions at a plurality of layers with respect to at least one weight.
13. The method of claim 9, wherein the BDA-PCH matrix comprises at least one first expectation of a Hessian of a loss function with respect to at least one bias.
14. The method of claim 9, wherein the instruction of computing the at least one update direction according to the EA-CG method comprises:
computing a linear equation of a weighted average of the BDA-PCH matrix; and
computing the at least one update direction by solving the linear equation according to the EA-CG method.
15. The method of claim 14, wherein the linear equation comprises the weighted average of the BDA-PCH matrix with respect to at least one bias and the identity matrix.
16. The method of claim 14, wherein the linear equation comprises the weighted average of the BDA-PCH matrix with respect to at least one weight and the identity matrix.