Skip to content

Capture data patterns using running mean / moving average – Part one

An important goal of statistics and machine learning is to find patterns in data and explore potential relationships between variables.

In a previous blog post, I explored polynomial regression and highlighted its limitation regarding complex data patterns.

The objective of this blog post is to explore if/how the simple smoothing method of running mean (also called moving average) can allow to capture complex data patterns.

Figure 1:  Motor-cycle experiment crash data (little grey circles)  together with running mean (blue dots).  K  is the neighbourhood  radius (see below).

For illustration, let us consider the dataset corresponding to the little grey circles on Figure 1.
This is the same dataset which was used in our previous post. It depicts the value of the Acceleration over Time in an  experiment about the efficacy of helmets in a motor-cycle crash.

We want to summarise the overall relation between Acceleration (\(y\) ) and Time (\(x\)).

Let us denote by \(y_i\) the observed values of Acceleration at Time \(x_i\), \(i=1,…,n\).
A simple way to summarise the relation between these two variables is to consider the following

\[ y_i = f(x_i) + \varepsilon_i, \quad i=1,…,n, \]
where \(f\) is some function to be determined and \(\varepsilon\) is the error term.

In the method of running mean, \(f\) is estimated at each point using the average of the nearest neighbours.
Specifically, the estimate \(\hat f(x_i)\) of \(f(x_i)\) is obtained by averaging the values of  the response (i.e. Acceleration) in the vicinity of \(x_i\).

To implement the method of running mean, an important decision to make is the choice of the structure of the vicinity/neighbourhood (this has a massive impact as we shall see shortly).

Various structure of the neighbourhood can be considered. One simple structure consists
of taking the target data point itself, together with some \(K\) points on its left and \(K\) points on its right.

For example, if we set \(K=2\), then, \(\hat f(x_i) = (y_{i-2}+y_{i-1}+y_{i}+y_{i+1}+y_{i+2})/5  \)

Clearly it would not be possible to have \(K\) points when approaching the edges, in which case we take as many points as possible.

We refer to \(K\) as the neighbourhood radius in this post.

In general, for a given neighbourhood radius \(K\),  it can be shown that the running mean  estimator  \(\hat f(x_i)\) of \(f(x_i)\) is given by
\[\hat f(x_i) = \dfrac{\sum_{j=J_0}^{J_1}\limits y_j}{J_1-J_0+1}\]
where \(J_0=\max\{i-K,1\}\) and \(J_1=\min\{i+K, n\}\).

The moving blue dots on Figure 1 show the fitted running mean for different values of the neighbourhood radius\(K \in \{0,1,2,\cdots\}\). A number of comments can be made.

For example, this figure shows that the fitted running mean converges to the data as the neighbourhood radius decreases; and in particular, the running mean becomes identical to the  data  when \(K=0\). At the other end, the running mean converges to the overall mean in the data as the neighbourhood radius increases.

A great advantage of the running mean is its simplicity. In addition, the running mean estimator appears more flexible and able to capture some complex data patterns.

In this post, we have concentrated on a specific neighbourhood structure consisting of up to \(K\) points to the left and to the right of the target point. In the upcoming post we shall look at other neighbourhood structures.

 

Leave a Reply

Your email address will not be published. Required fields are marked *