# 3. Working with Time-series Data¶

```
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from neighbors import NNMF_sgd, estimate_performance, load_toymat, create_sparse_mask
def plot_mat(mat, vmin=1, vmax=100, cmap="Blues", ax=None, title=None):
"Quick helper function to nicely plot a user x item matrix"
ax = sns.heatmap(mat, cmap=cmap, vmin=vmin, vmax=vmax, square=False, cbar=False, ax=ax)
ax.xaxis.set_major_locator(ticker.MultipleLocator(5))
ax.xaxis.set_major_formatter(ticker.ScalarFormatter())
ax.yaxis.set_major_locator(ticker.MultipleLocator(5))
ax.yaxis.set_major_formatter(ticker.ScalarFormatter())
ax.set(xlabel="Time-point")
if title:
ax.set(title=title)
def plot_timeseries(model, dilate_by_nsamples):
_, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
arr1 = model.masked_data.iloc[0, :].copy()
model.dilate_mask(dilate_by_nsamples)
arr2 = model.masked_data.iloc[0, :].copy()
ax1.plot(range(len(arr1)), arr1)
ax1.set(ylabel="Rating", xlabel="Time-point", title="Single User Observed Ratings")
ax2.plot(range(len(arr2)), arr2)
ax2.set(ylabel="Rating", xlabel="Time-point", title="Single User Dilated Ratings")
sns.despine()
```

In the last two tutorials we saw how to work with **dense** and **sparse** data. In this last tutorial we'll demonstrate a unique feature of the toolbox when working with **sparse time-series** data. Such data encompass situations when users provide continuous ratings *over time* rather than over unique items, for example when watching a single movie or listening to a single audio track. However, as before these ratings may be sparse such that not every user has a rating for every time-point.

Like before we'll begin with the `load_toymat`

function to generate a sample dataset where **50 users** provided ratings for **200 time-points** on a scale from 1-100. Additionally, we'll sparsify this dataset by masking out 50% of the ratings using the `create_sparse_mask`

function.

Let's plot the mask below:

```
toy_data = load_toymat(users=50, items=200, random_state=0)
mask = create_sparse_mask(toy_data, n_mask_items=.5)
masked_data = toy_data[mask]
plot_mat(mask, vmin=0, vmax=1, cmap='Greys')
```

# Dilating a time-series a model¶

One way to make higher quality predictions given such data is to leverage the fact that time-series often have intrinsic *autocorrelation*. In other words, successive time-points are more likely to have similar ratings rather than more distant time-points. The amount of similarity will be dictated by the autocorrelation function of the data, which may be difficult or impossible to estimate with a sparse time-series. Nonetheless, model predictions can often benefit from some initial interpolation or *temporal smoothing* whereby missing time-points are filled in with values computed from neighboring observed time-points.

Models can use the `.dilate_mask`

method or the `dilate_by_nsamples`

argument in their `.fit`

method to perform this operation prior to estimation. Let's see what that looks like using a single user's ratings where we **dilate** their observed time-series by 20 samples. Dilation occurs by convolving the observed time-series with a box-car function of the requested width (in number of samples). Overlapping dilated samples are simply averaged:

```
model = NNMF_sgd(masked_data, random_state=0)
# This is just a convenience function that plots a single user's time-series
# It makes use of model.dilate_mask under-the-hood
# See the function definition at the top of this tutorial
plot_timeseries(model, dilate_by_nsamples=20)
```

data contains NaNs...treating as pre-masked

Performing this for all users we can now see the difference between the original sparse time-series data and the dilated data. Notice that this has the effect of making the data **dense** by smoothing, which in many cases can dramatically improve model predictions.

```
_, (ax1, ax2) = plt.subplots(1,2, figsize=(10,4))
# Calling dilate_mask with no arguments "undilates" the data
model.dilate_mask()
plot_mat(model.masked_data, ax=ax1, title="Observed Ratings")
# Now dilate by 20 time-points
model.dilate_mask(20)
plot_mat(model.masked_data, ax=ax2, title="Dilated Ratings")
```

For convenience it's possible to request dilation directly from a model's `.fit`

method without the need to call `.dilate_mask`

first. To verify whether a model has been dilated, you can check it's `.is_dilated`

attribute:

```
model.fit(dilate_by_nsamples=20)
model.is_mask_dilated
```

True

Because we're dealing with sparse data we don't have ground-truth observations to evaluate model performance. But as demonstrated in the previous tutorial, we can use the `estimate_performance`

function to approximate model performance using cross-validation. In particular this function takes a `fit_kwargs`

argument which is a dictionary of arguments passed to a model's `.fit`

method. We can make use of this to request dilation during estimation

```
group_res_d, user_res_d = estimate_performance(
NNMF_sgd, masked_data, fit_kwargs={"dilate_by_nsamples": 20}
)
```

Data sparsity is 50.0%. Using cross-validation...

# A Few Notes on Dilation¶

- It's critical to note that when using dilation alongside the
`estimate_performance`

function for**dense data**, we're never "leaking" observed ratings to the held out observations. The process of dilation simply makes use of the*observed*ratings to fill-in missing observations. The values that get filled in, make no use of the true ratings at these time-points. - Likewise observation splitting for cross-validation of
**sparse**data, occurs*before*dilation so that observed ratings treated as test time-points have no influence on the dilation operation - The optimal amount of dilation to use for a particular dataset will vary based on a number of factors including the type and quality of the data, intrinsic auto-correlation, time-series length, and time-series sparsity
- Dilation can often be most helpful in extremely sparse cases (e.g. < 50% of ratings are observed) when undilated model estimates are poor. However, over-dilating can actually make model estimates
*worse*in some cases. - It's therefore important to consider what the plausible rate-of-change of a user's responses are are given the sampling rate. For example, dilating emotion ratings collected at 1hz (once every second) by 60 samples (1 minute) maybe too slow and smooth over important temporal dynamics.