{"title": "Optimizing Correlation Algorithms for Hardware-Based Transient Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 678, "page_last": 684, "abstract": null, "full_text": "Optimizing Correlation Algorithms for \nHardware-based Transient Classification \n\nR. Timothy Edwardsl , Gert Cauwenberghsl, and Fernando J. Pineda2 \n\n1 Electrical and Computer Engineering, Johns Hopkins University, Baltimore, MD 21218 \n\n2 Applied Physics Laboratory, Johns Hopkins University, Laurel, MD 20723 \n\ne-mail: {tim, gert, fernando}@bach.ece. jhu. edu \n\nAbstract \n\nThe perfonnance of dedicated VLSI neural processing hardware depends \ncritically on the design of the implemented algorithms. We have pre(cid:173)\nviously proposed an algorithm for acoustic transient classification [1]. \nHaving implemented and demonstrated this algorithm in a mixed-mode \narchitecture, we now investigate variants on the algorithm, using time \nand frequency channel differencing, input and output nonnalization, and \nschemes to binarize and train the template values, with the goal of achiev(cid:173)\ning optimal classification perfonnance for the chosen hardware. \n\n1 Introduction \n\nAt the NIPS conference in 1996 [1], we introduced an algorithm for classifying acoustic \ntransient signals using template correlation. While many pattern classification systems use \ntemplate correlation [2}, our system differs in directly addressing the issue of efficient im(cid:173)\nplementation in analog hardware, to overcome the area and power consumption drawbacks \nof equivalent digital systems. In the intervening two years, we have developed analog cir(cid:173)\ncuits and built VLSI hardware implementing both the template correlation and the frontend \nacoustic processing necessary to map the transient signal into a time-frequency representa(cid:173)\ntion corresponding to the template [3, 4]. In the course of hardware development, we have \nbeen led to reevaluate the algorithm in the light of the possibilities and the limitations of \nthe chosen hardware. \n\nThe general architecture is depicted in Figure 1 (a), and excellent agreement between simu(cid:173)\nlations and experimental output from a prototype is illustrated in Figure 1 (b). Issues of im(cid:173)\nplementation efficiency and circuit technology aside, the this paper specifically addresses \nfurther improvements in classification perfonnance achievable by algorithmic modifica(cid:173)\ntions, tailored to the constraints and strengths of the implementation medium. \n\n\fOptimizing Correlation Algorithms/or Transient Classification \n\n679 \n\nTemplate \n\nCorrelator NxM \n\nN \n\nShift and Accumulate \n\n70 \n\n60 \n\n50 \n'5 \n.9- 40 \n::I o \nc 30 \no \n~ 20 \n~ \n8 \n\n-20 \n\n-\n\nSimulated \n\n\u2022 Measured \n\ncJtl \n\n20 \n\n40 \n\n60 \n\n80 \nTime \n\n100 \n\n120 \n\n140 \n\n(a) \n\n(b) \n\nFigure 1: (a) System architecture of the acoustic transient classifier (b) Demonstration of \naccurate computation in the analog correlator on a transient classification task. \n\n2 The transient classification algorithm \n\nThe core of our architecture performs the running correlation between an acoustic input \nand a set of templates for distiguishing between Z distinct classes. A simple template \ncorrelation equation for the acoustic transient classification can be written: \n\nM \n\nN \n\ncz[tJ = Kz L L x[t - n, mJ pz[n, mJ \n\n(1) \n\nm=l n=l \n\nwhere M is the number of frequency channels of the input, N is the maximum number of \ntime bins in the window, and x is the array of input signals representing the energy content \nin each of the M bandpass frequency channels. The inputs x are normalized across chan(cid:173)\nnels using an L-l normalization so that the correlation is less affected by volume changes \nin the input. The matrix pz contains the template pattern values for pattern z out of a total \nof Z classes; K z is a constant gain coefficient for class z, and t is the current time. This \nformula produces a running correlation Cz [tJ of the input array with the template for class \nz. A signal is classified as belonging to class z when the output Cz exceeds the output for \nall other classes at a point in time t determined by simple segmentation of the input. \nTo train and evaluate the system, we used a database of 22 recorded samples of 10 different \nclasses of \"everyday\" transients such as the sounds made by aluminum cans, plastic tubs, \nhandclaps, and the like. \n\nEach example transient recording was processed through a thirty-two channel constant-Q \nanalog cochlear filter with output taps spaced on a logarithmic frequency scale [6]. For \nthe simulations, the frontend system outputs were sampled and saved to disk, then digitally \nrectified and smoothed with a lowpass filter function with a 2 ms time constant. These \nthirty-two channel outputs representing short-term average energy in each frequency band \nwere decimated to 500 Hz and normalized with the function \n\nx[t, mJ = y[t, mJ/ Ly[t, kJ, \n\nM+l \n\n(2) \n\nwhere y[t, M + 1J is a constant-valued input added to the system in order to supress noise \nin the normalized outputs during periods of silence. The additional output x[t, M + 1J \n\nk=l \n\n\f680 \n\nR. T. Edwards, G. Cauwenberghs and F. J. Pineda \n\nbecomes maximum during the periods of silence and minimum during presentation of a \ntransient event. This extra output can be used to detect onsets of transients, but is not used \nin the correlation computation of equation (1). \n\nTemplate values pz are learned by automatically aligning all examples of the same class in \nthe training set using a threshold on the normalization output x[t, M + 1], and averaging \nthe values together over N samples. starting a few samples before the point of alignment. \nClass outputs are normalized relative to one another by mUltiplying each output by a gain \nfactor K z , computed from the template values using the L-2 norm function \n\nKz = \n\nN \n\nM \nL LPz[n,m]2. \nm=l n=l \n\n(3) \n\nWe evaluated the accuracy of the system with a cross-validation loop in which we train the \nsystem on all of the database except one example of one class, then test on that remaining \nexample. repeating the test for each of the 220 examples in the database. The baseline \nalgorithm gives a classification accuracy of 96.4%. \n\n3 Single-bit template values \n\nA major consideration for hardware implementations (both digital and analog) is the mem(cid:173)\nory storage required by the templates, one of which is required for each class. Minimal \nstorage space in terms of bits per template is practical only if the algorithm can be proved \nto perform acceptably well under decreased levels of quantization of the template values. \n\nAt one bit per template location (i.e., M x N bits per template), the complexity of the hard(cid:173)\nware is greatly simplified, but it is no longer obvious what method is best to use for learn(cid:173)\ning the template values, or for calculating the per-class gains. The choice of the method is \nguided by knowledge about the acoustic transients themselves, and simulation to evaluate \nits effect on the accuracy of a typical classification task. \n\n4 Simulations of different zero-mean representations \n\nOne bit per template value is a desirable goal, but realizing this goal requires reevaluating \nthe original correlation equation. The input values to be correlated represent band-limited \nenergy spectra, and range from zero to some maximum determined by the L-l normaliza(cid:173)\ntion. To determine the value of a template bit, the averaged value over all examples of the \nclass in the training set must be compared to a threshold (which itself must be determined), \nor else the input itself must be transformed into a form with zero average mean value. In \nthe latter method, the template value is determined by the sign of the transformed input, \naveraged over all examples of the class in the training set. \n\nThe obvious transformations of the input which provide a vector of zero-mean signals to the \ncorrelator are the time derivative of each input channel, and the difference between neigh(cid:173)\nboring channels. Certain variations of these are possible, such as a center-surround compu(cid:173)\ntation of channel differences, and zero-mean combinations of time and channel differences. \nWhile there is evidence that center-surround mechanisms are common to neurobiological \nsignal processing of various sensory modalities in the brain, including processing in the \nmammalian auditory cortex [5], time derivatives of the input are also plausible in light of \nthe short time base of acoustic transient events. Indeed, there is no reason to assume a \npriori that channel differences are even meaningful on the time scale of transients. \n\nTable 1 shows simulation results, where classification accuracy on the cross-validation test \nis given for different combinations of continuous-valued and binary inputs and templates, \n\n\fOptimizing Correlation Algorithms for Transient Classification \n\n681 \n\nTable I: Simulation results with different architectures. \n\nMethod \n\nOne-to-One \n\nTime Difference \n\nChannel Difference \n\nCenter-Surround \n\nBoth \nCont. \n96.40% \n85.59% \n90.54% \n92.79% \n\nBinary \nInput \n-\n\nBoth \nBinary \n\n-\n\n65.32% \n59.46% \n53.60% \n95.05% \n53.60% 95.05% \n\nBinary (1, -1) Binary (1,0) \n\nTemplate \n\n-\n\n82.43% \n94.59% \n92.34% \n\nTemplate \n\n-\n\n81.98% \n94.14% \n92.34% \n\nand different zero-mean transformations of the input. There are several significant points \nto the results of these classification tasks. The first is to note that in spite of the fact that \nacoustic transient events are short-term and the time steps between the bins in the template \nas low as 2 ms, using time differences between samples does not yield reliable classification \nwhen either the input or the template or both is reduced to binary form. However, reliability \nremains high when the correlation is performed using channel differences. The implication \nis that even the shortest transient events have stable and reliable structure in the frequency \ndomain, a somewhat surprising conclusion given the impulsive nature of most transients. \n\nAnother interesting point is that we observe no significant difference between the use of \npairwise channel differences and the more complicated center-surround mechanism (twice \nthe channel value minus the value of the two neighboring channels). The slight decrease in \naccuracy for the center-surround in some instances is most likely due only to the fact that \none less channel contributes information to the correlator than in the pairwise channel dif(cid:173)\nference computation. When accuracy is constant, a hardware implementation will always \nprefer the simpler mechanism. \nVery little difference in accuracy is seen between the use of a binary (1, -1) representation \nand a binary (1,0) representation, in spite ofthe fact that all zero-valued template positions \ndo not contribute to the correlation output. This lack of difference is a result of the choice \nof the L-l normalization across the input vector, which ensures that the part of the correla(cid:173)\ntion due to positive template values is roughly the same magnitude as that due to negative \ntemplate values, leading to a redundant representation which can be removed without af(cid:173)\nfecting classification results. In analog hardware, particularly current-mode circuits, the \n(1,0) template representation is much simpler to implement. \n\nTime differencing of the input can be efficiently realized in analog hardware by commuting \nthe time-difference calculation to the end of the correlation computation and implementing \nit with a simple switch-capacitor circuit. Taking differences between input channel values, \non the other hand, is no so easily reduced to a simple hardware form. To find a reasonable \nsolution, we simulated a number of different combinations of channel differencing and \nbinarization. Table 2 shows a few examples. The first row is our standard implementation \nof channel differences using binary (1,0) templates and continuous-valued input. The \ndrawback of this method in analog hardware is the matching between negative and positive \nparts of the correlation sum. We found two ways to get around this problem without greatly \ncompromising the system performance: The first, shown in the second row of Table 2 is to \nadd to the correlation sum only if the channel difference is positive and the template value \nis 1 (one-quadrant multiplication). Another (shown in the last row) is to add the maximum \nof each pair of channels if the template value is 1, which is preferable in that it uses the \ninput values directly and does not require computing a difference at all. Unfortunately, \nit also adds a large component to the output which is related only to the total energy of \nthe input and therefore is common to all class outputs, reducing the dynamic range of the \nsystem. \n\n\f682 \n\nR. T Edwards. G. Cauwenberghs and F. J Pineda \n\nTable 2: Simulation results for different methods of computing channel differences \n\nmethod \n\nchannel difference \n\none-quadrant multiply \n\nmaximum channel \n\naccuracy \n94.14% \n92.34% \n93.69% \n\n5 Optimization of the classifier using per-class gains \n\nThe per-class gain values Kz in equation (1) are optimal for the baseline algorithm when us(cid:173)\ning the L-2 normalization. The same normalization applied to the binary templates (when \nthe template value is assumed to be either +1 or -1) yields the same K z value for all \nclasses. This unity gain on all class outputs is assumed in all the simulations of the previ(cid:173)\nous section. A careful evaluation of errors from several runs indicated the possibility that \ndifferent gains on each channel could improve recognition rates, and simple experiments \nwith values tweaked by hand proved this suspicion to be true. \n\nTo automate the process of gain optimization, we consider the templates, as determined by \naveraging together examples of each class in the training set, to be fixed. Then we compute \nthe correlation between each template and the aligned, averaged inputs for each class which \nwere used to generate the templates. The result is a Z x Z matrix, which we denote C, of \nexpected values for the correlation between a typical example of a transient input and the \ntemplate for its own class (diagonal elements Cii ) and the templates for all other classes \n(off-diagonal elements Cij, i '=I j). Each column of C is like the correlator outputs on \nwhich we make a classification decision by choosing the maximum. Therefore we wish to \nmaximize Cii with respect to all other elements in the same column. The only degree of \nfreedom for adjusting these values is to multiply the correlation output of each template z \nby a constant coefficient K z . This corresponds to multiplying each row of C by K z . This \nper-class gain mechanism is easily transferred to the analog hardware domain. \n\nIn the case of continuous-valued templates, an optimal solution can be directly evaluated \nand yields the L-2 normalization. However, for all binary forms of the template and/or \nthe input, direct evaluation is impossible and the solution must be found by choosing an \nerror function E to minimize or maximize. The error function must assign a large error to \nany off-diagonal element in a column that approaches or exceeds the diagonal element in \nthat column, but must not force the cross-correlations to arbitrarily low negative values. A \nminimizing function that fits this description is \n\nE = L L exp (KjCji - KiCii ). \n\n(4) \n\ni #i \n\nThis function unfortunately has no closed-form solution for the coefficients Ki, which must \nbe determined numerically using Newton-Raphson or some other iterative method. \n\nImprovements in the recognition rates of the classification task using this optimization of \nper-class gains is shown in Table 3, where we have considered only the case of inputs and \ntemplates encoding channel differences. Although the database is small, the gains of 2 to \n4% for the quantized cases are significant. For this particular simulation we used a different \ntype of frontend section to verify that the performance of the correlation algorithm was \nnot linked to a specific frontend architecture. To generate these performance values, we \nused sixteen channels with the inputs digitally processed through a constant-Q bandpass \nfilter having a Q of 5.0 and with center frequencies spaced on a mel scale from 100Hz \nto 4500 Hz. The bandpass filtering was followed by rectification and smoothing with a \nlowpass filter function with a cutoff frequency scaled logarithmically across channels, from \n60 Hz to 600 Hz. The channel output data were decimated to a 500 Hz rate. Half of the \n\n\fOptimizing Correlation Algorithms/or Transient Classification \n\n683 \n\ndatabase was used to train the system, and half used to test. Performance is similar to that \nreported in the previous section in spite of the fact that the number of channels was cut in \nhalf, and the number of training examples was also cut in half. Slight gains in performance \nare most likely due to the cleaner digital filtering of the recorded data. \n\nTable 3: System accuracy with and without per-class normalization. \n\naccuracy, optimized \nbinarization \nnone \n100% \ntemplate only \n93% \ntemplate & input 95% \n\naccuracy, non-optimized \n100% \n91% \n91% \n\n6 System Robustness \n\nWe performed several additional experiment in addition to those covered in the previous \nsections. One of these was an evaluation of recognition accuracy as a function of the tem(cid:173)\nplate length N (number of time bins), to determine what is a proper size for the templates. \nThe result is shown in Figure 2 (a). This curve reaches a reliable maximum at about 50 time \nbins, from which our chosen size for the hardware implementation of 64 bins provides a \nsafe margin of error. However, it is interesting to note that recognition accuracy does not \ndrop to that of random chance until only two time bins are used (64 bits per template), and \naccuracy is nearly 50% with only 3 time bins (96 bits per template). \n\n100~--~~~====~======~ \n90 \n\n100r---~--~----~--~----~--~ \n\n~ 80 \n