### Pearson’s correlation coefficient-based analysis

Suppose that the time series data for factors

*X* and

*Y* with replicates are measured simultaneously. We denote them as

*X* =

*X*
_{[1:n][1:m]} and

*Y* =

*Y*
_{[1:n][1:m]}, where

*n* is the number of samples (time points) and

*m* is the number of replicates. Let

*X*
_{
i
}
_{[1:m]} and

*Y*
_{
j
}
_{[1:m]}, or, in more abbreviated form,

*X*
_{
i
} and

*Y*
_{
j
}, be the vectors containing the

*m* replicates from the

*i*-th time spot of

*X* and the

*j*-th time spot of

*Y*, respectively. The application of Pearson’s Correlation Coefficient (PCC) requires taking the profile means,

*i.e.*
and

. Then the PCC between

*X* and

*Y* is defined as:

where
,
,
and
are the means of *X* and *Y*, respectively. The statistical significance of *r* is tested by the fact that
follows a *t*-distribution (degree of freedom: *v* = *n* – 2, mean: 0 and variance *v* / (*v* – 2)) when *m* = 1. For a pair of non-replicated series where *m* = 1, PCC is a straightforward and powerful method to test and identify linear relationship between two bivariate normally distributed random variables. It is widely adopted in the literature but with limitations. Specifically, when the real relationships are more complex, for example, the association between the two factors only occurs in a subinterval of the region of interest or the change of one factor has a time-delay in response to the change of another factor. Several methods, including the original LSA method, have been proposed to overcome such difficulties [11, 16].

### Local similarity analysis with replicates

The original LSA method considers only data without replicates. In this paper, we extend the Local Similarity Analysis (LSA) method [11] to samples with replicates. To formulate the algorithm, we suppose each sample have *m* replicates and let *F*(·) be some summarizing function for the repeated measurements. Thus, we extend the original LSA dynamic programming algorithm to data with replicates as follows:

(1) For *i*, *j* in {1,2,…,*n*}^{2}:

*P*_{0,j
} = 0, *P*_{i,0} = 0, and *N*_{0,j
} = 0, *N*_{i,0} = 0.

(2) For *i*, *j* in {1,2,…,*n*}^{2} with |*i* – *j*|≤ *D*:

*P*_{i+1,j+1} = max{0,*P*_{i,j} + *S*_{
XY
}[*F*(*X*_{
i
}),*F*(*Y*_{
j
})]} and

*N*_{i+1,j+1} = max{0,*N*_{i,j} + *S*_{
XY
}[*F*(*X*_{
i
}),*F*(*Y*_{
j
})]}.

(3) *P*
_{
max
}(*X*,*Y*) = max_{1≤i,j≤n
}
*P*
_{i,j} and

*N*_{
max
}(*X*,*Y*) = max_{1≤i,j≤n
}*N*_{i,j}.

(4)
and

*S*_{
sgn
}(*X*,*Y*) = sgn[*P*_{
max
}(*X*,*Y*) – *N*_{
max
}(*X*,*Y*)].

The *S*
_{
max
}(*X*,*Y*) obtained is the maximum local similarity score possible for all configurations of *m*-replicated time series *X* and *Y* within time-delay *D*. In this extended algorithm, the scalars *x*
_{
i
} ’s and *y*
_{
i
} ’ s from the non-replicated series in *Ruan et al.*[11] are replaced by vector functions *F*(*X*
_{
i
})’s and *F*(*Y*
_{
j
})’s to handle data with replicates. Alternatively, we can also consider *F*(*X*
_{
i
})’s and *F*(*Y*
_{
j
})’s as the same input data for the original algorithm in *Ruan et al.*[11], except that they are *F*-transformed data. In addition, this extended LSA framework easily accommodates the original version of LSA without replicates using *m* = 1 as a special case.

### Different ways of summarizing the replicate data

Notice that the only additional component we introduced in the eLSA algorithm is the function *F*. Many reports have suggested different possible forms for *F*, and several computational methods have been proposed for summarizing the additional information available from replicates, including the simple average method (abbreviated as ‘simple’) and the Standard Deviation (SD)-weighted average method (abbreviated as ‘SD’), and the multivariate correlation coefficient method [17–19]. However, the result of the multivariate correlation coefficient method from *Zhu et al.*[17] can be shown to be the same as the ‘simple’ method. Therefore, in eLSA, we used the first two methods. We also propose the use of median in place of average and Median Absolute Deviation (MAD) in place of SD when robust statistics are needed to handle outliers [20]. The corresponding methods are named simple median method (abbreviated as ‘Med’) and MAD-weighted median method (abbreviated as ‘MAD’), respectively.

The ‘simple’ method is, in spirit, to take the mean profiles to represent the replicated series. In practice, we take *F* to be the simple average of repeated measurements:
. The ‘SD’ method, on the other hand, takes the standard deviation of the replicates into account. Here we take *F* to be the replicate average divided by its standard deviation (SD):
. Importantly, this method utilizes the variability information available, and, as such, it is claimed to be better than the ‘simple’ method in estimating the true correlation [18]. However, in order for the ‘SD’ method to be effective, a relatively large number of replicates, *m*, are needed, *e.g.,*
*m* ≥ 5. For a small number of replicates, the ‘SD’ method may not work well since the standard deviation may not be reliably estimated. Further, if we replace average with median and SD with MAD, we obtain the ‘Med’ method: *F*(*X*
_{
i
}) = *Median*(*X*
_{
i
}) and the ‘MAD’ method:
, where *MAD*(*X*
_{
i
}) = *Median*(|*X*
_{
i
} – *Median*(*X*
_{
i
})|). The two transformations have similar properties as their corresponding average and SD versions, but they are more robust.

### Bootstrap confidence interval for the LS score

With replicate data, researchers can study the variation of quantities of interest and to give their confidence intervals. Due to the complexity of calculating the LS score, the probability distribution of the LS score is hard to study theoretically. Thus, we resort to bootstrap to give a bootstrap confidence interval (CI) for the LS score. Bootstrap is a re-sampling method for studying the variation of an estimated quantity based on available sample data [21]. In this study, we use bootstrap to estimate a confidence interval for the LS score. For a given type I error α, the 1 – α confidence interval is the estimated range that covers the true value with probability 1 – α. Thus, for a given number, *B*, of bootstraps, we construct the bootstrap sample set
, where each
and
are samples with replacement from *X*
_{
i
} and *Y*
_{
j
}, respectively. The rest of the calculation is the same as that used for the original data, and we obtain
. Without the loss of generality, we suppose that these values are sorted in ascending order:
. Then, a 1 – α bootstrap CI of *S*
_{
max
} can be estimated by
, as suggested by *Efron et al.*[21].

### Data normalization

eLSA analyses require the series of factors

*X* and

*Y* to be normally distributed, but this may not be the case in the real dataset. Therefore, through normalization, the normality of the data can be enforced. To accommodate possible nonlinear associations and the variation of scales within the raw data, we apply the following approach [

22] to normalize the raw dataset before any LS score calculations. We use

*F*(

*X*
_{
i
}) to denote the

*F*-transformed data of the

*i*-th time spot of an variable

*X*
_{
i
}. First, we take

where Φ is the cumulative distribution function of the standard normal distribution. We will take *Z* = *Z*
_{[1:n]} obtained through the above procedure as the normalization of *X*. Therefore, the normalization steps are taken after the *F*-transformation.

### Permutation test to evaluate the statistical significance of LSA association

It is important to evaluate the statistical significance of the LS score measured by the p-value, the probability of observing a LS score no smaller than the observed score when two factors are not associated locally or globally. To achieve this objective, permutation test is used. To perform the test, we fix

*Y* and reshuffle all the columns of

*X* for each permutation. For a fixed number of permutations

*L*, suppose {

*X*
^{(1)},

*X*
^{(2)},…,

*X*
^{(L)}} is the permuted set of

*X*; then the p-value

*P*
_{
L
} is obtained using

where *I*(·) is the indicator function. With large enough number of permutations, we can evaluate the p-value to any desired accuracy.

### False discovery rate (FDR) estimation

In most biological studies, a large number of factors need to be considered. If there are *T* factors, there will be
eLSA pairwise calculations, representing its quadratic growth in *T*. In order to avoid many falsely declared associated pairs of factors, we need to correct for multiple testing. Many methods have been developed to correct for multiple testing and here we use the method by *Storey et al.*[23] to address this issue. In particular, we report the q-value, *Q*, for each pair of factors. The q-value for a pair of factors is the proportion of false positives incurred when that particular pair of factors is declared significant.

### Computation complexity and implementation

For a single pair of time series, the time complexity for calculating the LS score using the dynamic programming algorithm is *O*(*n*), where *n* is the number of time points. The estimation of the bootstrap confidence interval for the LS score using *B* bootstraps will need *O*(*Bn*) calculations. The estimation of statistical significance for each pair of factors using *L* permutations will need *O*(*Ln*) calculations. Thus, the number of calculations for a full analysis of each pair of factors will be *O*(*BLn*). If there are a total of *T* factors, there are a total of
pairs of factors that need to be compared. Thus, the number of calculations for a full analysis of *T* factors will be in the order of *O*(*T*
^{2}
*BLn*), which can be computationally intensive.

In summary, the internal support for replicates and the use of CI estimates are the two major methodological enhancements to LSA. The eLSA software, however, also incorporates other new features, such as faster permutation and false discovery rate evaluations and more options to handle missing values. Other implementation details are available from the software documentation.