Michael E. Brandt received his B.S. in physics from the Polytechnic Institute of New York, and his M.S. and Ph.D. in biomedical engineering from the University of Houston. He is an assistant professor in the Department of Psychiatry and Behavioral Sciences at the University of Texas Medical School at Houston, and teaches a graduate course in digital signal processing at the University of Houston. His resarch interests are in the analysis of brain electrical signals. Dr. Brandt can be contacted at the University of Texas Medical School, 6431 Fannin, Room 5.240, Houston, TX 77030.
An important problem in systems science is the determination of the relationship between two signals in terms of their primary sources. This problem has implications in diverse fields. In communication engineering, for example, you often want to determine the characteristics of the channel connecting a transmitter and a receiver. In the study of epilepsy, you infer the focus of a brain seizure from the microvolt electrical signals measured from the scalp surface (the EEG).
Two related questions arise:
- Are the signals related by a single source, dependent sources, or independent sources?
- If two signals are related by a common source, is there a time-sequential relationship (a delay) between them?
The relationship that we are seeking between two signals will ultimately be of a causal nature. However, two signals can be highly dependent on one another yet bear no causal relationship. This is the case if the two signals are produced by a primary source. Unless the signal produced by this common source is included in the systems model, one cannot justify inferences concerning the causal nature of the two signals. The best one can hope for is a characterization of the degree of the association between the signals. This may in turn lead to insights into the nature of the sources themselves.
To illustrate the issues, I present here a program that lets you use two different methods for computing the degree of association and the time delay (if any) between two simulated signals.
Linear Correlation
Two signals x and y can be related through a simple linear system in the presence of additive noise by:
(1) y[n] = L{x[n]} + e[n]where L{} is any linear transformation acting on the input signal x[n], e[n] represents white noise independent of x[n], n is the discrete index, and y[n] is the output signal. The relationship between the two signals can be unambiguously represented by the correlation coefficient (r) in which the points of a scattergram of y vs. x are fitted with a straight line using the least mean-square method. The correlation coefficient between x and y can then be expressed as the square root of A/B, where A is the total variance in y minus the amount of variance in y not explainable from its linear dependence on x and B is the total variance in y.
The correlation coefficient takes on continuous values from -1 to +1. Numbers close to +1 or -1 indicate a high linear relationship (either direct or inverse respectively), while coefficients close to 0 indicate a small or nonexistent linear relationship. The square of r is then a measure of the proportion of the total variance in y explainable by its linear dependence on x. It can be estimated computationally as in Equation (2).
Note that the r2 for x vs. y is equal to the r2 for y vs. x.
When a delay is present between y and x it can be estimated by computing the cross-correlation between them. This is done in practice by computing r2 as a function of a shift between the two signals. The maximum r2 represents the linear association between the two signals, and the time shift at which the maximum is found is an estimate of the delay.
Coherence Analysis
Coherence analysis is a frequency-domain method for determing the linear relationship between two signals. The coherence is computed by dividing the estimate of the cross-power spectrum of the two signals by the product of their auto-spectra as a function of frequency. It measures the linear dependence of each frequency component of signal y on signal x. I won't go into further detail here. The cross-phase spectrum measures the phase lag of each frequency component between x and y.
Nonlinear Analysis
If the relationship between the two signals is known to be nonlinear, then in general the linear correlation coefficient cannot be expected to properly quantify the association between them. A generalized model of the system relationship is:
(3) y[n] = N{x[n]} + e[n]where N{} now represents a nonlinear transformation of the input signal, with the other variables as in eq. (1) above. If the form of the functional relationship between x and y is known beforehand then that relationship can be used to perform a least-squares fit of the x vs. y scattergram. A "nonlinear regression coefficient" can then be arrived at analogously to eq. (2) by computing the proportion of the variance in y accounted for by this functional fit. In practice, you approximate the x-y relationship with as low an order polynomial as you can get away with. Once you determine the order, you perform curvilinear regression, again using the least-squares-fit method.
In many signal-analysis problems, curvilinear regression cannot be used effectively for two main reasons:
- the order of the polynomial fit may be too high to implement practically
- the statistical characteristics of the signals change in time (they are "nonstationary") which means that the order of the polynomial fit would have to be recomputed whenever the signals change character.
(total y variance) - (unexplained y variance) ---------------------------------------------- (total y variance)Pijn's approach consists of approximating the regression curve using a piecewise-linear fit through the scattergram. The approximation procedure is a very simple one as follows:
- The x data range is divided up into a number (L) of equal sized bins.
- For each bin i, the midpoint of the x values (p[i]) and the average amplitude of the y values (q[i]) is obtained.
- Finally, the (p[i], q[i]) points are connected by straightline segments to form the piecewise linear approximation to the regression curve (f(x)).
where N is the number of samples, and <y> is the mean value of y[n].
One of the advantages of using h2 over r2 is that if the x-y relationship is linear, then h2 will be equal to r2, whereas if the relationship is nonlinear, r2 will not be close to 1 while h2 will be close to 1.
In general, the h2 of x vs. y will not be equal to the h2 of y vs. x. This will most likely be the case if x is not a single-valued function of y, or y is not a single valued function of x. If, however, the relationship is single-valued, the asymmetry should be small. Also, if the number of bins (L) is equal to the number of data points (N) then h2 will be 1. (All data points will be perfectly predicted.) Therefore, by increasing the number of bins, you can obtain an arbitrarily high fit to the scattergram. However, in practice h2 rapidly converges to its theoretical value as a function of the number of bins, and then increases very slowly thereafter up to 1. Empirically, Pijn found that a maximum of 10 bins for N > 512 was sufficient for characterizing the h2 between two EEG signals.
The h2 can also be used to estimate a delay between two signals that are nonlinearly related. You do this in a manner comparable to the computation of the cross-correlation function. Compute h2 as a function of lag between the two signals. The maximum h2 obtained and the corresponding lag represent the association and delay (respectively) between x and y.
Listing 1 is a test program for computing r2, h2, and the delay for several predetermined functions. These functions, y(x), are as follows:
- a linear function, y[n] = ax[n] + be[n], and four non-linear functions
- y[n] = sin(x[n]) + be[n]
- y[n] -- sign(x[n])(x[n]2) + be[n]
- y[n] = 3x[n]3 +be[n]
- y[n] = 5x[n]5 + be[n]
The function estimate_delay is capable of both advancing or lagging one signal with respect to the other with a user-selected time shift. The sub-routine will then compute both r2 and h2 as a function of increasing lag between the two signals in order to estimate the delay between them. For each of the functions included in the program, the two methods are equally able to find the proper delay.
Conclusion
The h2 measure represents one possible time-domain approach for estimating either the linear or nonlinear relationship between two signals. Other possible approaches include polynomial regression and spline fitting. The non-linvar function provides a starting point for these alternative methodologies.
Reference
Pijn, J.P.M. (1990). "Quantitative Evaluation of EEG Signals in Epilepsy." Ph.D. Thesis, University of Amsterdam.