From Equations to Code

In class we talked about how we can think about equations in terms of R code. I wanted to give you a formal example here using the Moran’s I measure we just covered. The equation is shown below. This looks big and complicated but if we break it down the way we discussed in terms of code it becomes more tractable.

\[I=\frac{n}{\sum_{i}(x_i-\overline{x})^2}\frac{\sum_{i}\sum_jw_{ij}(x_i-\overline{x})(x_j-\overline{x})}{\sum_{i}\sum_{j}w_{ij}}\]

\(n\) here simply refers to the number of observations.

\(x_i\) and \(x_j\) refer to the \(i\)th and \(j\)th value of the variable we are considering and \(\overline{x}\) is the mean of that variable

\(w\) refers to the matrix of weights which is a symmetric matrix of 1s and 0s which determines which observations are in the same neighborhood.

\(\sum_{i}\) and \(\sum_{j}\) indicate that we should sum the results across all values for \(i\) and \(j\) respectively (think of this like our for loops).

Let’s explore this example using the Maya terminal dates data from Premo (2004). We wil first read in that datafile and then create a neighborhood based on a distance cut-off of 75 Kms.

#read file of Maya site locations and terminal dates
maya <- read.csv('maya.csv', header = T)

#create distance matrix in Km for all sites in 'maya'
w <- as.matrix(dist(maya[,3:4]))
#define all distances greater than 75 Km as 0 (non-neighbors)
w[w > 75] <- 0
#define the remaining values > 0 as 1 (neighbors from 1-27 Km)
w[w > 0] <- 1

Next, we’ll create a variable for \(n\) which is just the number of rows in our data.frame. Then we can create a variable for \(x\) and \(\overline{x}\) based on the “Date” column in the data.frame we just imported.

n <- nrow(maya)
x <- maya$Date
xbar <- mean(x)

The next thing we need is \(\sum_{i}\sum_jw_{ij}(x_i-\overline{x})(x_j-\overline{x})\)

There are a lot of analytical shortcuts we could use in R to get this but I’m going to show you the “expanded” version here to be clear about how to read this equation. In this example, I’ll use a nested pair of for loops to iterate over every value of \(i\) and \(j\) which here represent the rows and columns of the symmetric matrix of weights and outputs. To get the results we’re looking for we’ll simply sum the entire matrix.

res <- matrix(0, n, n) # a symmetric matrix of 0s with n rows and n columns

## for loop
for (i in 1:n) {
  for (j in 1:n) {
    res[i, j] <- w[i, j] * (x[i] - xbar) * (x[j] - xbar)
  }
}

res.sum <- sum(res)

The next step is to divide this by the denominator of the same part of the equation \(\sum_{i}\sum_{j}w_{ij}\) which is just the sum of all values in the \(w\) matrix.

spmw <- res.sum/sum(w)

Now to calculate the lefthand side of the equation \(\frac{n}{\sum_{i}(x_i-\overline{x})^2}\) which is the inverse variance.

vr <- n/sum((x-xbar)^2)

Now, to calculate Moran’s I, we simply multiply the lefthand and righthand sides of the equation together. We can also very quickly calculate the expected value for I which is defined as \(EI=\frac{-1}{(n-1)}\)

I <- spmw*vr
I

## [1] 0.07573934

EI <- -1/(n-1)
EI

## [1] -0.02173913

We could also simplify this by rolling the whole thing into a function.

moran.manual <- function(x,n,w) {
  res <- matrix(0, n, n) # a symmetric matrix of 0s with n rows and n columns

  for (i in 1:n) {
    for (j in 1:n) {
      res[i, j] <- w[i, j] * (x[i] - xbar) * (x[j] - xbar)}}

  out <- n/sum((x-xbar)^2)*(sum(res)/sum(w))
  return(out)}

moran.manual(x=x,n=n,w=w)

## [1] 0.07573934

Now, let’s compare the results to the spdep function we used in class. That also calculates a p-value based on a randomization of values in x and I’ll show you how to do that once we’ve compared the results.

library(spdep)
moran.test(x, mat2listw(w))

## 
##  Moran I test under randomisation
## 
## data:  x  
## weights: mat2listw(w)    
## 
## Moran I statistic standard deviate = 1.4967, p-value = 0.06723
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.075739343      -0.021739130       0.004241697

Great! We got the same results for Moran’s I and the Expected Moran’s I as above.

Now let’s use our custom function to calculate a p-value. What we’ll do is recalculate Moran’s I some large number of times using a loop and then see how many of the resampled tables produce results that differ from the expected by more than our observed. That number of replicates satisfy this criteria divided by the number of random runs provides an estimated p-value. Let’s try it.

rand.I <- NULL
nsim <- 100
set.seed(4535)
for (i in 1:nsim) {rand.I[i] <- moran.manual(sample(x,replace=F),n,w)}
hist(rand.I,breaks=15)
abline(v=I,lty=2,lwd=2,col='red')

p.value <- length(which(rand.I>I))/nsim

p.value

## [1] 0.07

Great! Our results are slightly different that what the spdep function provides but in the ballpark. If we wanted to reproduce the results very precisely I’d dig into the code for moran.test or the original article that the R package cites to see exactly how the resampling is happening under the hood, but this gets us in the neighborhood.