### Abstract

Abstract In previous chapters we have discussed various deep learning models for automatic speech recognition (ASR). In this chapter we introduce the computational network (CN), a unied framework for describing a wide range of arbitrary learning machines, such as deep neural networks (DNNs), convolutional neural networks (CNNs), recurrent neural networks (RNNs) including its long short term memory (LSTM) version, logistic regression, and maximum entropy models. All these learning machines can be formulated and illustrated as a series of computational steps. A CN is a directed graph in which each leaf node represents an input value or a parameter and each non-leaf node represents a matrix operation acting upon its children. We describe algorithms to carry out forward computation and gradient calculation in the CN and introduce most popular computation node types used in a typical CN. Computational Network There is a common property in the key models, such as deep neural networks (DNNs) Each column of B (1) is the bias b (1) 4: σ (.) is the sigmoid function applied element-wise 5: T (2) ← W (2) S (1) 6: Each column of B (2) is the bias b (2) 7: O ← softmax P (2) Apply softmax column-wise to get output O 8: end procedure A computational network is a directed graph {V, E}, where V is a set of vertices and E is a set of directed edges. Each vertex, called computation node, represents a computation operation. Vertices with edges toward a computation node are the operands of the associated computation and sometimes called the children of the computation node. Here the order of operands matters for some operations such as matrix multiplication. Leaf nodes do not have children and are used to represent input values or model parameters that are not result of some computation. a CN can be easily represented as a set of computation nodes n and their children {n : c 1 , · · · , c Kn }, where K n is the number of children of node n. For leaf nodes K n = 0. Each computation node knows how to compute the value of itself given the input operands (children). Forward Computation When the model parameters (i.e., weight nodes in 1: procedure DecideForwardComputationOrder(root, visited, order) Enumerate nodes in the DAG in the depth-rst order. visited is initialized as an empty set. order is initialized as an empty queue 2: if root / ∈ visited then the same node may be a child of several nodes. 3: for each c ∈ root.children do 5: call DecideForwardComputationOrder(c, visited, order) 6: The forward computation can also be carried out asynchronously with which the order of the computation is determined dynamically. This can be helpful when the CN has many parallel branches and there are more than one computing device to compute these branches in parallel. Algorithm 1.3 describes an example algorithm that carries out the forward computation of a CN asynchronously. In this algorithm, all the nodes whose children have not been computed stay in the waiting set and those whose children are computed stay in the ready set. At the beginning, all non-leaf descendents of root are in the waiting set and all leaf descendents are in the ready set. The scheduler picks a node from the ready set based on some policy, removes it from the ready set, and dispatches it for computation. Popular policies include rst come rst serve, shortest task rst, and least data movement. When the computation of the node nishes, the system calls the SignalComplete In many cases, we may need to compute the value for one node and later for another node given the same input values. To avoid duplicate computation of shared branches, we can add a time stamp to each node and only recompute the value of the node if at least one of the children has newer value. This can be easily implemented by updating the time stamp whenever a new value is provided or computed, and by excluding nodes whose children is older from the actual computation. In both Algorithms 1.2 and 1.3 each computation node needs to know how to compute its value when the operands are known. The computation can be as simple as matrix summation or element-wise application of sigmoid function or as complex as whatever it may be. We will describes the evaluation functions for popular computation node types in Section 1.4. To train a CN, we need to dene a training criterion J. Popular criteria include cross-entropy (CE) for classication and mean square error (MSE) for regression as have been discussed in Chapter ??. Since the training criterion is also a result of some computation, it can be represented as a computation node and inserted into the CN. m ) |0 ≤ m < M } using the minibatch based backpropagation (BP) algorithm similar to that described in Algorithm ??. More specically, we improve the model parameter W at each step t + 1 as are model parameters. In this solution, each edge is associated with a partial derivative, and ( . This means a large amount of memory is needed to keep the derivatives. Second, there are many duplicated computations. For example, ∂V . This requires signicantly less memory than that required in the naive solution illustrated in is computed only once and used twice when computing ∂J ∂W (1) and ∂J ∂W (2) with this approach. This is analogous to common subexpression elimination in a conventional expression graph, only here the common subexpressions are the parents of the nodes, rather than the children. Automatic dierentiation has been an active research area for decades and many techniques have been proposed (see In many cases, not all the gradients need to be computed. For example, the gradient with regard to the input value is never needed. When adapting the model, some of the model parameters don't need to be updated and thus it is unnecessary to compute the gradients with regard to these parameters. We can reduce the gradient computation by keeping a needGradient ag for each node. Once the ags of leaf nodes (either input values or model parameters) are known, the ags of the non-leaf nodes can be determined using Algorithm 1.5, which is essentially a depth-rst traversal over the DAG. Since both Algorithms 1.2 and 1.5 are essentially depth-rst traversal over the DAG and both only need to be executed once they may be combined in one function. Since every instantiation of CN is task dependent and dierent, it is critical to have a way to check and verify the gradients computed automatically. A simple technique to estimate the gradient numerically is: where w ij is the (i, j)-th element of a model parameter W, is a small constant typically set to 10 −4 , and J (w ij + ) and J (w ij − ) are the objective function values evaluated with all other parameters xed and w ij changed to w ij + and w ij − , respectively. In most cases the numerically estimated gra- which would lead to numerical round-o errors. Typical Computation Nodes For the above forward computation and gradient calculation algorithms to work we have assumed that each type of computation node implements a function Evaluate to evaluate the value of the node given the values of its child nodes, and the function ComputeP artialGradient(child) to compute the gradient of the training criterion with regard to the child node child given the node value V n (for simplicity we will remove the subscript in the following discussion) and the gradient ∇ J n of the node n and values of all its child nodes. In this section we introduce the most widely used computation node types and the related Evaluate and ComputeP artialGradient(child) functions. In the following discussion we assume λ is a scalar, X and Y are matrices of the rst and second operand, respectively, d is a column vector that represents the diagonal of a square matrix, V is the value of current node, ∇ is inner product applied to each row, δ (.) is the Kronecker delta, 1 m,n is an m × n matrix with all 1's, X α is element-wise power, and vec (X) is the vector formed by concatenating columns of X. We treat each minibatch of input as a matrix in which each column is a sample. In all the derivations of the gradients we use the identity Computation Node Types with No Operand The values of a computation node that has no operand are given instead of computed. As a result both Evaluate and ComputeP artialGradient(child) functions for these computation node types are empty. Typical Computation Nodes 11 • Parameter: used to represent model parameters that need to be saved as part of the model. • InputValue: used to represent features, labels, or control parameters that are provided by users at run time. Computation Node Types with One Operand In these computation node types, Evaluate = V (X) and ComputeP artialGradient(X) = ∇ J X • Negate: reverse the sign of each element in the operand X. (1.8) The gradient can be derived by observing and (1.10) • Sigmoid: apply the sigmoid function element-wise to the operand X. (1.12) The gradient can be derived by observing and (1.14) • Tanh: apply the tanh function element-wise to the operand X. (1.16) The gradient can be derived by observing and (1.18) • ReLU : apply the rectied linear operation element-wise to the operand X. (1.20) The gradient can be derived by observing we have (1.22) • Log: apply the log function element-wise to the operand X. (1.24) The gradient can be derived by observing (1.26) • Exp: apply the exponent function element-wise to the operand X. Typical Computation Nodes (1.28) The gradient can be derived by observing we have (1.30) • Softmax : apply the softmax function column-wise to the operand X. Each column is treated as a separate sample. (1.32) (1.33) (1.35) The gradient can be derived by observing we have (1.37) • SumElements: sum over all elements in the operand X. (1.39) The gradient can be derived by noting that υ and ∇ (1.41) • L1Norm: take the matrix L 1 norm of the operand X. ( ( (1.45) • L2Norm: take the matrix L 2 norm (Frobenius norm)of the operand X. (1.47) The gradient can be derived by noting that υ and ∇ J n are scalars, (1.49) Computation Node Types with Two Operands In these computation node types, Evaluate = V (a, Y), where a can be X, λ or d, and ComputeP artialGradient(b) = ∇ J b where b can be X, Y or d. Typical Computation Nodes 15 • Scale: scale each element of Y by λ. (1.52) The gradient ∇ ( ( (1.56) • Times: matrix product of operands X and Y. Must satisfy X.cols = Y.rows. (1.59) The gradient ∇ (1.63) • ElementTimes: element-wise product of two matrices. Must satisfy X.rows = Y.rows and X.cols = Y.cols. (1.66) The gradient ∇ J X can be derived by observing ( (1.68) The gradient ∇ J Y can be derived exactly the same way due to symmetry. • Plus: sum of two matrices X and Y. Must satisfy X.rows = Y.rows. If X.cols = Y.cols but one of them is a multiple of the other, the smaller matrix needs to be expanded by repeating itself. The gradient ∇ J X can be derived by observing that when X has the same dimension as V, we have • Minus: dierence of two matrices X and Y. Must satisfy X.rows = Y.rows. If X.cols = Y.cols but one of them is a multiple of the other, the smaller matrix needs to be expanded by repeating itself. The derivation of the gradients is similar to that for the Plus node. • DiagTimes: the product of a diagonal matrix (whose diagonal equals to d) and an arbitrary matrix Y. Must satisfy d.rows = Y.rows. (1.81) The gradient ∇ and get (1.85) • Dropout: randomly set λ percentage of values of Y to be zero and scale the rest so that the expectation of the sum is not changed: Note that λ is a given value instead of part of the model. We only need to get the gradient with regard to Y. If λ = 0 then V = X which is a trivial case. Otherwise it's equivalent to the ElementTimes node with a randomly set mask M. • KhatriRaoProduct: column-wise cross product of two matrices X and Y. Must satisfy X.cols = Y.cols. Useful for constructing tensor networks. (1.93) The gradient ∇ J y can be derived similarly. • Cos: column-wise cosine distance of two matrices X and Y. Must satisfy X.cols = Y.cols. The result is a row vector. Frequently used in natural language processing tasks. (1.96) The gradient ∇ J X can be derived by observing (1.99) The gradient ∇ J y can be derived similarly. • ClassicationError: compute the total number of columns in which the indexes of the maximum values disagree. Each column is considered as a sample and δ is the Kronecker delta. Must satisfy X.cols = Y.cols. This node type is only used to compute classication errors during the decoding time and is not involved in the model training. For this reason, calling ComputeP artialGradient(b) should just raise an error. • SquareError: compute the square of Frobenius norm of the dierence X − Y. Must satisfy X.rows = Y.rows and X.cols = Y.cols. (1.105) Note that v is a scalar. The derivation of the gradients is trivial given (1.107) • CrossEntropy: compute the sum of cross entropy computed column-wise (over samples) where each column of X and Y is a probability distribution. Must satisfy X.rows = Y.rows and X.cols = Y.cols. (1.111) Note that v is a scalar. The gradient ∇ J X can be derived by observing (1.115) • CrossEntropyWithSoftmax : same as CrossEntropy except that Y contains values before the softmax operation (i.e., unnormalized). (1.118) (1.120) The gradient ∇ J X is the same as in the CrossEntropy node. To derive the gradient ∇ J Y we note that and get (1.122) Computation Node Types for Computing Statistics Sometimes we only want to get some statistics of the input values (either input features or labels). For example, to normalize the input features we need to compute the mean and standard deviation of the input feature. In speech recognition we need to compute the frequencies (mean) of the state labels to convert state posterior probability to the scaled likelihood as explained in Chapter ??. Unlike other computation node types we just described, computation node types for computing statistics do not require a gradient computation function (i.e., this function should not be called for these types of nodes) and often need to be precomputed before model training starts. Here we list the most popular computation node types in this category. • Mean: compute the mean of the operand X across the whole training set. When the computation is nished, it needs to be marked so as to avoid recomputation. When a minibatch of input X is fed in (1.124) Note here X.cols is the number of samples in the minibatch. • InvStdDev: compute the invert standard deviation of the operand X element-wise across the whole training set. When the computation is nished, it needs to be marked so to avoid recomputation. In the accumulation step ( (1.129) • PerDimMeanVarNorm: compute the normalized operand X using mean m and invert standard deviation s for each sample. Here X is matrix whose number of columns equals to the number of samples in the minibatch and m and s are vectors that needs to be expanded before element-wise product is applied. (1.130) Convolutional Neural Network Convolutional neural network (CNN) [19, 5, 17, 18, 2, 6, ?, 22, 1, 8, 21] provides shift invariance and is critical to achieve state-of-the-art performance on image recognition. It has also been shown to improve speech recognition accuracy over pure DNNs on some tasks • Convolution: convolve element-wise products of a kernel to an image. An example of convolution operations is shown in ( This evaluation function involves many small matrix operations and can be slow. Chellapilla et al. Note that this technique enables better parallelization with large matrix operations but introduces additional cost to pack and unpack the matrices. In most conditions, the gain outweighs the cost. By composing convolution node with plus nodes and element-wise nonlinear functions we can add bias and nonlinearity to the convolution operations. • MaxPooling: apply the maximum pooling operation to input X inside a window with size K r × K c for each channel. The operation window moves along the input with strides (or subsampling rate) S r and S c at the vertical (row) and horizontal (column) direction, respectively. The pooling operation does not change the number of channels and so C v = C x . For each output channel and the (i, j)-th input slice X ij of size K r × K c we have