NetTS tutorial 2: Window size choices

Install netTS, load some libraries

#if netTS is not yet installed you can use the following code to install it from github:
#devtools::install_github("tbonne/netTS")

library(netTS)
library(lubridate)
library(ggplot2)
library(igraph)
library(reshape2)

Introduction

This vignette is meant to provide thoughts on how to choose the scale of the aggregation for time-aggregated network analysis (i.e., windowsize). It will introduce:

How to visualize the effects of changing window sizes
Bootstrap methods to estimate uncertainty.
Estimate how window size alters variation in network measures through time.

We use grooming data of vervet monkeys as an example:

head(groomEvents)

##     ID PartnerID                date
## 1 Laur      Malc 2015-07-01 12:32:19
## 2 Malc      Laur 2015-07-01 12:33:01
## 3 Ubun      Wall 2015-07-01 16:08:26
## 4 Wall      Ubun 2015-07-01 16:09:52
## 5 Dott      Jasm 2015-07-02 09:59:27
## 6 Jasm      Dott 2015-07-02 10:01:58

And we use a measure of out-degree and mean out-degree to explore the data:

#create a function that calculates the out degree of each node
strength_nodes <- function(graph){
  #out degree
  my.value <- strength(graph)
  
  #return the value 
  return(my.value)
}

#create a function that calculates the out degree of each node
strength_graph <- function(graph){
  #out degree
  my.value <- mean(strength(graph,mode="out"))
  
  #return the value 
  return(my.value)
}

1) Visualize the effects of scale choice

Choosing the window size will affect the time scale at which change is measured. The lower bounds of the window size will be in part based on the density of data collection, whereas the upper bounds of the window size choice will be constrained by the total extent of time that the data was collected. Below we look at how changing the window size affects the measure of change in the network

#look at the change in the network at different time scales: 10, 30, 60, 90, and 120 days
net.str.10<-graphTS(groomEvents, windowsize = days(10), windowshift = days(10), measureFun = strength_graph, directed=TRUE)

## [1] "50 networks extracted"

net.str.30<-graphTS(groomEvents, windowsize = days(30), windowshift = days(10), measureFun = strength_graph, directed=TRUE)

## [1] "48 networks extracted"

net.str.60<-graphTS(groomEvents, windowsize = days(60), windowshift = days(10), measureFun = strength_graph, directed=TRUE)

## [1] "45 networks extracted"

net.str.90<-graphTS(groomEvents, windowsize = days(90), windowshift = days(10), measureFun = strength_graph, directed=TRUE)

## [1] "42 networks extracted"

#combine them all into one dataset
net.str.10$windowsize <- 10
net.str.30$windowsize <- 30
net.str.60$windowsize <- 60
net.str.90$windowsize <- 90
net.str.all <- rbind(net.str.10,net.str.30,net.str.60,net.str.90)

#plot the results
p.str.all<-ggplot(data=net.str.all,aes(y=measure, x=windowstart, color=factor(windowsize)), ylab="Mean out-strength", xlab="Date")+geom_point()+geom_path() + theme_classic()
p.str.all

We can see that as the window size gets larger, so does the mean out-degree. We also see that the changes are smoothed out, and that the extent is shorter.

2) Bootstrapping

Next we try to estimate the lower bound using an approach based on bootstrapping and subsampling. This approach constructs networks from bootstrapped samples of the event data frame. Once these bootstrapped networks are made, it is possible to compare the similarity between graph/node/dyadic values of the bootstrapped networks and those of the observed network. This procedure is repeated for the observed data frame as well as subsamples of the observed data frame to test for sensitivity to missing data.

Compare node level measures between the observed network and bootstrapped networks

#calculate the similarity between node degree in the bootstrapped samples to the observed
check.boot.10<-check.windowsize(groomEvents, windowsize = days(10), windowshift = days(10), measureFun = strength_nodes, directed=TRUE, subsamples = c(1,0.8, 0.6), plot=F)
check.boot.30<-check.windowsize(groomEvents, windowsize = days(30), windowshift = days(10), measureFun = strength_nodes, directed=TRUE, subsamples = c(1,0.8, 0.6), plot=F)
check.boot.60<-check.windowsize(groomEvents, windowsize = days(60), windowshift = days(10), measureFun = strength_nodes, directed=TRUE, subsamples = c(1,0.8, 0.6), plot=F)
check.boot.90<-check.windowsize(groomEvents, windowsize = days(90), windowshift = days(10), measureFun = strength_nodes, directed=TRUE, subsamples = c(1,0.8, 0.6), plot=F)

#generate the plots
p.boot.10<-ggplot(data=check.boot.10,aes(y=mean, x=windowstart, color=factor(fracData)))+geom_point()+geom_path() + geom_hline(yintercept = 1, linetype="dashed") + geom_ribbon(aes(ymin=CI.low,ymax=CI.high, fill=factor(fracData)), alpha=0.1, color=NA)+ theme_classic() + ylab("Similarity") + xlab("Date") + ylim(0.5,1)

p.boot.30<-ggplot(data=check.boot.30,aes(y=mean, x=windowstart, color=factor(fracData)))+geom_point()+geom_path() + geom_hline(yintercept = 1, linetype="dashed") + geom_ribbon(aes(ymin=CI.low,ymax=CI.high, fill=factor(fracData)), alpha=0.1, color=NA)+ theme_classic() + ylab("Similarity") + xlab("Date")+ ylim(0.5,1)

p.boot.60<-ggplot(data=check.boot.60,aes(y=mean, x=windowstart, color=factor(fracData)))+geom_point()+geom_path() + geom_hline(yintercept = 1, linetype="dashed") + geom_ribbon(aes(ymin=CI.low,ymax=CI.high, fill=factor(fracData)), alpha=0.1, color=NA)+ theme_classic() + ylab("Similarity") + xlab("Date")+ ylim(0.5,1)

p.boot.90<-ggplot(data=check.boot.90,aes(y=mean, x=windowstart, color=factor(fracData)))+geom_point()+geom_path() + geom_hline(yintercept = 1, linetype="dashed") + geom_ribbon(aes(ymin=CI.low,ymax=CI.high, fill=factor(fracData)), alpha=0.1, color=NA)+ theme_classic() + ylab("Similarity") + xlab("Date")+ ylim(0.5,1)


#plot the results
p.boot.10

p.boot.30

p.boot.60

p.boot.90

We can see in this plot that window sizes 30 days and below start to become noisy, in terms of the match between the observed and bootstrapped samples.

3) Choosing a temporal scale

Once you’ve decided on a measure you’d like to look at and are starting to generate time series of that measure using something like graphTS, it is possible to look at multiple time scales to see how window size choices alter this time series. For example, you can look at how the magnitude of time series variability changes, or at how the magnitude of autocorrelation changes. The idea here is to see how a time series changes depending on the time scale you’ve choosen to aggregate. By quantifying how the time series changes, it is possible to then look for natural time scales (sensu) while also considering biologically meaningful window sizes.

In this example we use a measure of sample entropy to estimate how changing temporal scales affects the amount of structure in the time series. Here structure is capturing how predictable the next value in a time series is, given you know the preceding sequences of values. This measure is highest when the time series is completely random, or completelty fixed, with lowest values indicating more structured time series.

library(pracma)

df.var<-check.timescale(groomEvents, windowsize_min = days(10), windowsize_max = days(200), by=days(10), windowshift = days(1), measureFun = strength_graph, summaryFun = sample_entropy, boot=3 )

## [1] "497 networks extracted"
## [1] "487 networks extracted"
## [1] "477 networks extracted"
## [1] "467 networks extracted"
## [1] "457 networks extracted"
## [1] "447 networks extracted"
## [1] "437 networks extracted"
## [1] "427 networks extracted"
## [1] "417 networks extracted"
## [1] "407 networks extracted"
## [1] "397 networks extracted"
## [1] "387 networks extracted"
## [1] "377 networks extracted"
## [1] "367 networks extracted"
## [1] "357 networks extracted"
## [1] "347 networks extracted"
## [1] "337 networks extracted"
## [1] "327 networks extracted"
## [1] "317 networks extracted"
## [1] "307 networks extracted"
## [1] "497 networks extracted"
## [1] "487 networks extracted"
## [1] "477 networks extracted"
## [1] "467 networks extracted"
## [1] "457 networks extracted"
## [1] "447 networks extracted"
## [1] "437 networks extracted"
## [1] "427 networks extracted"
## [1] "417 networks extracted"
## [1] "407 networks extracted"
## [1] "397 networks extracted"
## [1] "387 networks extracted"
## [1] "377 networks extracted"
## [1] "367 networks extracted"
## [1] "357 networks extracted"
## [1] "347 networks extracted"
## [1] "337 networks extracted"
## [1] "327 networks extracted"
## [1] "317 networks extracted"
## [1] "307 networks extracted"
## [1] "497 networks extracted"
## [1] "487 networks extracted"
## [1] "477 networks extracted"
## [1] "467 networks extracted"
## [1] "457 networks extracted"
## [1] "447 networks extracted"
## [1] "437 networks extracted"
## [1] "427 networks extracted"
## [1] "417 networks extracted"
## [1] "407 networks extracted"
## [1] "397 networks extracted"
## [1] "387 networks extracted"
## [1] "377 networks extracted"
## [1] "367 networks extracted"
## [1] "357 networks extracted"
## [1] "347 networks extracted"
## [1] "337 networks extracted"
## [1] "327 networks extracted"
## [1] "317 networks extracted"
## [1] "307 networks extracted"
## [1] "497 networks extracted"
## [1] "487 networks extracted"
## [1] "477 networks extracted"
## [1] "467 networks extracted"
## [1] "457 networks extracted"
## [1] "447 networks extracted"
## [1] "437 networks extracted"
## [1] "427 networks extracted"
## [1] "417 networks extracted"
## [1] "407 networks extracted"
## [1] "397 networks extracted"
## [1] "387 networks extracted"
## [1] "377 networks extracted"
## [1] "367 networks extracted"
## [1] "357 networks extracted"
## [1] "347 networks extracted"
## [1] "337 networks extracted"
## [1] "327 networks extracted"
## [1] "317 networks extracted"
## [1] "307 networks extracted"

#plot the change in variance
ggplot(df.var, aes(x=windowsize, y=var, color=boot))+geom_line()+theme_classic() + labs(y="Sample entropy", x="Date") + geom_vline(xintercept = 70)

## Warning: Removed 4 rows containing missing values (geom_path).

We can see here that there is a low point around 70 days, with the bootstrapped estiamtes diverging after this point.

Conclusion

By visualizing, comparing bootstrapped networks and looking at how time series change with window size choices, it is possible to get a better idea of what scale choice is appropriate. More generally, the way a particular network measure changes as time scale changes could be an interesting question in and of itself, apart from finding a single optimal time scale.