1 of 107

Matrix Models

Matrices for Population Studies

Time after Time - Age and Stage Structured Models

Modelling with Markov Chains

The Next Flu Pandemic

2 of 107

Matrix Models

Matrices for Population Studies

Population Matrices and HEP

Vectors

Vector Additions

Multiplication by a Scalar

Dot Product

Matrices

3 of 107

Introduction – Population Matrices

Blue crabs (Callinectes sapidus) are very important to life along the Gulf Coast of the United States. Essential to the complex, estuarine food webs, these animals also represent the second-largest commercial fishery in the area, and thereby provide livelihoods for many—and they are delicious! ( estuarine meaning - the wide part of a river where it nears the sea; fresh and salt water mix)
Because the ecological and economic impact of population fluctuations of this species is immense, our understanding of the dynamics of crab populations is crucial. Human intrusion (oil spills, pollution, overfishing, habitat degradation, etc.) and natural disasters (hurricanes, etc.) in addition to Natural oscillations (climate cycles, dispersal, etc.) all impact populations.
Hence its is important that proper management is necessary for healthy, sustained populations.
We need to understand better the factors critical to the quality and quantity of native populations.
Blue crabs are mainly found in the western Atlantic Ocean, from Nova Scotia to Argentina, in the Gulf of Mexico and in the Caribbean.
Also, blue crabs have been introduced into the North Sea, southwest of France, parts of the Mediterranean, and Japan. Found to depths of 90m, they eat about anything—plants, benthic invertebrates, small fish, detritus, carrion (FAO 2012; Zinski 2006).

4 of 107

Population Matrices

Females mate only one time, right after what is called a “pubertal” molt. As she readies for this molt, she calls male crabs with chemical signals.
Drawn to her allure and attractants, as with many animals, the males may squabble over mating rights.
After mating when the shell of female crab hardens, the female migrates to estuaries, where she buries herself in mud to overwinter.
With the arrival of spring, the female fertilizes and transfers her eggs to form a mass (sponge) attached to her body, which often contains about 2 million individuals (some up to 8 million; Zinski 2006).
Hatching into larvae (zoeae) in about 2 weeks, they are carried out into the open ocean, where some feed, grow, and molt several times over a month or more before they are transformed into megalops.

5 of 107

Population Matrices

Over 1 to 3 weeks, these swimming larvae are transported closer to shore. On shore, they molt into juvenile crabs and then head up the estuaries (primary habitat along the Gulf Coast), where they reside, grow, and undergo numerous molts.
Maturity normally is achieved by the following summer. Adult males tend to stay in the upper estuaries (lower salinity), whereas adult females, after mating, remain in the lower reaches (higher salinity).
Of course, most of the millions of fertilized eggs/larvae never reach adulthood, because they become food for other organisms - including their own kind (Zinski 2006).

6 of 107

Population Matrices

But we still do not know enough to understand the population dynamics of this or any other species that is passively dispersed over large areas.
The countless larval stages are of small sizes and at the mercy of predators, currents, and winds. There is considerable drift of immatures from their birth estuary among other estuaries connecting the adults of different sites.
How is it possible to understand the population dynamics of this species when we so obviously do not understand dispersal?
Gulf coast populations are considered meta-populations, which means that they are spatially fragmented. The extent of connectivity, or exchange of individuals among these populations, is extremely important for population stability and recolonization following local extirpation (eradication) events.

7 of 107

Metapopulation�

Metapopulations are important in fisheries. The local population (1.) serves as a source for hybridization with surrounding subspecies populations (1.a, 1.b, and 1.c).The populations are normally spatially separated and independent but spatial overlap during breeding times allows for gene flow between the populations

8 of 107

Population Matrices

Larval dispersal is very much influenced by mortality, duration of planktonic stages, and behaviour in the water column and upon settling.
To assess connectivity, scientists must quantify the controlling influences for transport, stocks, and maintenance.
Given the scale and complexity of this problem, scientists are turning to computer modelling and simulation to work out spatially explicit models for blue crab populations.
These multifaceted, ecological models are now possible because finely tuned hydrodynamic models of coastal areas are available.
Before they can develop any useful population model, scientists must determine the influence of larval dispersal, settlement, and survival rates on fluctuations in blue crab numbers and also a connectivity matrix for the estuaries, which is a rectangular array of numbers indicating contacts various estuary populations have with one another.

9 of 107

Population Matrices

Over a 3-year period, scientists have collected more than a terabyte (space for 10¹² characters) of data for their study.
On a 2010 sequential computer, they estimate that the simulation time for one larva is 5 min and for 2000 larvae is 1 week.
Thus, averaging the results of 300 simulations involving 2000 larvae each would take about 5.7 years!
With such massive amounts of data and such intensive computations, researchers must use high-performance computing with multiple computer processors to store the data and large matrices and to perform the needed simulations in a reasonable amount of time (Taylor 2010).
In this unit, we examine populations that change with time. To make long-term predictions about these populations, we store their data in structures called vectors and matrices and perform calculations on these structures

10 of 107

Matrix Models - Vectors

11 of 107

Review Question 1�

For b = (15.00, 5.37, 4.84, 6.00, 10.83, 27.43), where indices begin with 1, give the value of b4.

Ans :

The value of b4 is 6.00

Definition : Vectors (x1, x2, . . . , xn) and (y1, y2, . . . , yn) are equal if and only if xi = yi for i = 1, 2, . . . , n.

12 of 107

Vectors – Addition

Suppose the only sharks in the given area are BTS and WTS.
To obtain the total number of sharks each month (vector s), we add corresponding values of the vectors element by element, as follows:

13 of 107

Review Question 2

Suppose two scientists, Drs. Chang and Morris, are leading research teams studying red-footed boobies on two islands in Galapagos. Each group counts the number of eggs, hatchlings, juveniles, and nesting pairs over a 1-week period. Suppose the values for Dr. Chang’s team are 35, 16, 240, and 351 and for Dr. Morris’ team are 18, 10, 103, and 153, respectively.

a. Express the values for Dr. Chang’s team in a vector, c, and for Dr.

Morris’ team in a vector, m

c= (35, 16, 240,351 ) m= (18, 10, 103,153)

14 of 107

b. Compute t = c + m.

t = [53, 26,343,504]

c. What does t represent?

data totals by category

Definition: Let x = (x1, x2, . . . , xn) and y = (y1, y2, . . . , yn) be vectors of n elements each. The sum of x and y is the vector

x + y = (x1 + y1, x2 + y2, . . . , xn + yn).

15 of 107

Vectors - Multiplication by a Scalar

16 of 107

Review Question 3�

The scientists in Quick Review Question 2 estimate that the actual numbers of boobies in each category to be 1.1 as many as they observed. The vector of data for Dr. Chang’s team is c = (35, 16, 240, 351).
Using 1.1 and the variable name c, give the expression for the vector of estimated values.

Ans: 1.1 (36,16,240,351)

b. Give the vector with values rounded to the nearest integers.

(39, 18, 264, 386)

17 of 107

Vectors – Dot Product

Consider the population of green sea turtles.
Here we deal with a different type of multiplication in estimating the number of eggs laid by Hawaiian green sea turtles in one year.
We can consider their life cycle to be in five stages, with egg layers in two stages, novice breeders of age 25 years, and mature breeders from ages 26 through 50 years.
On the average, a novice breeder lays 280 eggs in a year, and a mature breeder lays 70 eggs per year. We can combine these data in a vector

e = (280, 70).

Suppose also that there are 291 novice and 9483 mature breeders, which we store in the vector

b = (291, 9483).

To approximate the total green sea turtle egg production in a year, we multiply together corresponding terms and add the results, as follows:

e · b = (280, 70) · (291, 9483) = 280 · 291 + 70 · 9483 = 81,480 + 663,810

= 745,290 eggs

This type of multiplication, the dot product, involves two vectors of the same size and results in a number.

18 of 107

Vectors – Dot Product

19 of 107

Review Question 4�

The first stage in the life of the Hawaiian green sea turtle, consisting of eggs and

hatch lings, occurs during the first year. Stage 2, juveniles, extends from year 1 to

year 16. Suppose 23% of the hatchlings survive and move to stage 2, while 67.9% of

those in Stage 2 remain in that stage each year. In one year, suppose Stage 1 has

808,988 individuals, and Stage 2 has 715,774 (Green Sea Turtle).

Give a vector, p, with real-number elements representing the percentages.

(0.23, 0.679)

b. Give a vector, s, storing the individuals in Stages 1 and 2.

(808988, 715774)

20 of 107

c. Using variables p and s, not the data, give the vector operation to determine the number of individuals that will be in Stage 2 the following year.

p. s

d. Calculate this value.

0.23 . 808988 + 0.679 . 715774 = 672078

21 of 107

Matrix Models - Matrices

In our discussion on “Vectors,” we considered the data structure of a one-dimensional array, or vector. One example involved vector w, which stored under that one name the simulated number of whitetip sharks from 0 through 5 months.
Quite often, however, more features need to be stored and manipulated. In such cases 2D arrays may be helpful. For example, we can store the data about the number of whitetip sharks (WTS) and blacktip sharks (BTS) in the following 2D array:

22 of 107

Matrices – Scalar Multiplication and Matrix Sums

Similar to vectors, two matrices are equal if they have the same size and corresponding elements are identical.
To compute the sum of two matrices that have the same size, we add corresponding elements.
For the product of a scalar times a matrix, we multiply each element by the scalar.

23 of 107

Matrices – Matrix Multiplication

24 of 107

Matrices – Square Matrices

A number of examples in biology involve square matrices. For example, the hypothetical data presents the distribution of A,B, AB and O blood types for mothers and newborns (multiple births omitted) in a county over a year.
The corresponding matrix is as follows:

According to the datum in the second row, first column, only 37 mothers with type B blood gave birth to a child with type A blood in that county during the year of the study.
The diagonal values, which are in boldface, indicate the numbers of mother-newborn pairs that share the same blood type.
For instance, 1068 type A mothers gave birth to type A children in the county that year.

25 of 107

Matrices – Square Matrices

As another example involving a square matrix, the table presents similarity measures (specifically, Euclidean distances) of the 18S rRNA sequences of pairs of animals, where smaller numbers indicate closer relationships.

Thus, with a Euclidean distance of 0.028, the rRNA sequences of a human and a rabbit are more closely related than that of a human and a frog (distance 0.350).
Because the distance from animal A’s rRNA sequence to animal B’s sequence is the same as the distance from B to A, the table and the corresponding matrix, which follows, are symmetric around the diagonal:

26 of 107

Matrices – Matrices and Systems of Equations

The following are examples of linear equations:

6x = 1

5x + 7y = 3

-2x + πy + 3z = 9

1/2x₁ + 33.2x₂+ 15x₃+ 13x₄= 33/4

The equations derive their name from the fact that when they have only one, two, or three variables, as in the first three examples, their graphs are straight lines. The general linear equation is:

a₁x₁ + a₂x₂ + ∙ ∙ ∙ + a_nx_n = c

where a_i and c are numbers for i = 1, 2, . . . , n.

27 of 107

Matrices – Matrices and Systems of Equations

We can use matrix multiplication for a system of linear equations. Returning to the example involving whitetip sharks (WTS) and blacktip sharks (BTS)

Instead of writing each equation separately, we can employ matrix multiplication to system of six equations, as shown in the figure:

28 of 107

Matrix Models

Time after Time - Age and Stage Structured Models

Age Structured Models

Leslie Matrix

Age Distribution over Time

Projected Population – Growth Rate

Stage Structured Models

Algorithms

Sensitivity Analysis - Age and Stage Structured Models

29 of 107

Introduction – Age and Stage Structured Models

The Problem:

We may want to classify many animals/birds by discrete ages to determine reproduction and mortality.

For example, suppose a bird of a certain kind has a maximum life span of 3 years. During the first year, the animal does not breed. On the average, a typical female of this hypothetical species lays 10 eggs during the second year but only 8 eggs during the third year.

Suppose 15% of the young birds live to the second year of life, while 50% of the birds from age 1 to 2 years live to their third year of life, age 2 to 3 years.

Usually, we consider only the females in the population; and in this example, we assume that half the offspring are female.

30 of 107

Introduction – Age and Stage Structured Models

Under such circumstances, we may be interested in:

Determining the projected population growth rate?
In the case of declining populations, what would be the predicted time of extinction?
As time progresses, does the population reach a stable distribution?
If so, what is the proportion of each age group in such a stable age distribution?
How sensitive is the long-term population growth rate or predicted time of extinction to small changes in parameters?

31 of 107

Age Structured Models

The Figure presents a state diagram for the problem with the states denoting ages (year 1, 2, or 3) of the bird.
The information indicates that an age-structured model might be appropriate.
In age-structured models we ignore the impact of other factors, such as population density and environmental conditions.
We can use such models to answer questions about the rate of growth of the population and the proportion of each age group in a stable age distribution.

5x₂(t) + 4x₃(t) = x₁(t + 1)

For the example in the previous section, three clear age classes emerge, one for each year. Thus, in formulating this deterministic model, we employ the following variables: x_i= number of females of such a bird at the beginning of the breeding season in year i (age i – 1 to i) of life, where i = 1, 2, or 3.
Thus, x₁ is the number of eggs and young birds in their first year of life. Time, t, of the study is measured in years immediately before breeding season, and we use the notation x_i(t) to indicate the number of year i females at time t.
For example, x₂(5) represents the number of females during their second year, ages 1 to 2 yr old, at the start of breeding season 5. Some of these survive to time t + 1 = 6yr and progress to the next class, those females in their third year of life. At that time (at time 6yr of the study), the notation for number of year 3 females is x₃(6).

32 of 107

Age Structured Models

To establish equations, we use these data to project the number of female birds in each category for the following year.
The number of eggs/chicks depends on the number of adult females, x₂ and x₃.
Because on the average a year 2 (ages 1 to 2 years old) mother has 5 female offspring and a year 3 (ages 2 to 3 years old) mother has 4 female offspring, the number of year 1 (ages 0 to 1 year old) female eggs/chicks at time t + 1 is as follows:

5x₂(t) + 4x₃(t) = x₁(t + 1)

However, at time t + 1, the number of year 2 (ages 1 to 2 years old) females, x₂(t + 1), depends only on the number of year 1 (ages 0 to 1 year old) females this year, x₁(t), that live.
The latter survives with a probability of P₁ = 15% = 0.15, so that we estimate next year’s number of year 2 females to be as follows:

0.15x₁(t) = x₂(t + 1)

Similarly, to estimate the number of year 3 (ages 2 to 3 years old) females next year, we need to know only the number of year 2 (ages 1 to 2 years old) females, x₂(t), and their survival rate (here, P₂= 50% = 0.50). Thus, the number of year 2 females next year will be approximately the following:

0.50x₂(t) = x₃(t + 1)

33 of 107

Age Structured Models

34 of 107

Age Structured Models

35 of 107

Age Structured Models - Leslie Matrices

L is an example of a Leslie matrix, which is a particular type of projection matrix, or transition matrix.
Such a square matrix has a row for each of a finite number (n) of equal-length age classes.
Suppose F_i is the average reproduction or fecundity rate of class i; and P_i is the survival rate of those from class i to class (i + 1).
With x_i(t) being the number of females in class i at time t, x₁(t) is the number of females born between time t – 1 and time t and living at time t.
The model has the following system of equations:

36 of 107

Age Structured Models - Leslie Matrices

Therefore, the corresponding n × n Leslie matrix is as follows:

With x(t) being the population female distribution vector at time t, (x₁(t), x₂(t), . . ., x_n(t)), and x(t + 1) being the female distribution vector at time t + 1, (x₁(t + 1), x₂(t + 1), . . ., x_n(t + 1)), both expressed as column vectors, we have the following matrix equivalent of the system of Equations:

Lx(t) = x(t + 1)

37 of 107

Age Structured Models – Age Distribution over Time

39 of 107

Age Structured Models – Age Distribution over Time

40 of 107

Age Structured Models – Projected Population Growth rate

Thus, with each age group ultimately changing by a factor of 1.0216 = 102.16% annually, eventually we estimate the population will increase by 2.16% per year.
Thus, for an initial total population of P₀, the estimated population at time t is

P = P₀(1.0216)^t.

With an annual increase in population of 2.16% per year and, correspondingly, λ = 1.0216 > 1, we expect that this bird population will increase with time. Had the population been projected to decline each year with 0 < λ < 1, we would expect the birds eventually to become extinct.

A value of λ = 1 would signal a stable population in which, on the average, an adult female produces one female offspring that will live to adulthood. Thus, λ is an important concept related to the stability of a population.

41 of 107

We also have the following physical interpretation for λ .
λ < 1 means the population will decline exponentially.
λ > 1 means the population will grow exponentially.
λ = 1 means the population is stable, it does not change.

42 of 107

Matrices are frequently used in all the biological fields as organizers of numerical information since they greatly increase the ease and efficiency of analyzing data.
One of the most common uses of matrix algebra in the biological sciences is in the area of population dynamics where matrix models have been used to simulate population growth since the early 1940s.
In this module we will describe the matrix methods developed by Leslie [l] and Lewis [2] to describe age-dependent population growth.
Using Eigenvectors to Find Steady State Population Flows

45 of 107

In which Lx = lamda x for some vector x # 0.

In other words, if we premultiply some particular nonzero x by lamda we only change the magnitude of the resultant vector and not its direction. The nonzero vector x is called an eigenvector of the matrix.

46 of 107

Age Structured Models – Eigen Value and Eigen Vector

47 of 107

Using mathematics, which we do not show here, or a computational tool, we can obtain three eigenvalues and three corresponding eigenvectors for the 3 × 3 matrix L (Table 13.3.3). Two of the three eigenvalues are imaginary numbers, and the corresponding two eigenvectors contain imaginary numbers.
The eigenvalue with the largest magnitude (for real numbers, the largest absolute value), 1.0216, is the dominant eigenvalue and is the projected annual growth rate associated with the Leslie matrix. Leslie matrices always have such a unique positive eigenvalue.

49 of 107

Age Structured Models – Eigen Value and Eigen Vector

Thus, multiplication of the population distribution vector by the constant 1.0216 is identical to the product of the Leslie matrix by the distribution vector. λ is an eigenvalue for the matrix L, and v is a corresponding eigenvector for L

50 of 107

Stage Structured Models

In Age-structured model, we distinguish life stages by age, is in fact a special case of a stage-structured model, where we divide the life of an organism into stages.

Frequently, it is convenient or necessary to consider the life of a species in stages instead of equally spaced time intervals, such as years.

Perhaps the animal, such as a loggerhead sea turtle, typically lives for a number of years, and we cannot accurately determine the age of an adult.
Conceivably the stages differ greatly in lengths of time.
Also, rates may be strongly associated with developmental stages or animal size..

51 of 107

Stage Structured Models

A lionfish goes through three life stages: larva (L, about 1 month), juvenile (J, about 1 year), and adult (A). With 1 month being the basic time step, the probability that a larva survives and grows to the next stage is G_L= 0.00003, while the probability that a juvenile survives and remains a juvenile in a 1-month period is P_J = 0.777.
In 1 month, G_L= 0.071 of the juveniles mature to the adult stage, while P_A = 0.949 of the adults survive in a month.
Only adults give birth, and the number of female larvae she produces per month is R_A= 35,315. The Figure presents a state diagram for these circumstances.

52 of 107

Stage Structured Models

Thus, if x_L(t), x_J(t), and x_A(t) represent the number of female larvae, juveniles, and adults at time t, respectively, we have the following system of equations for the distribution at time t + 1:

53 of 107

Stage Structured Models

Using these values, the lionfish monthly growth rate (λ) is about 1.13.
Because adult lionfish reproduce monthly over the entire year, adult survivorship has a great impact on the population’s growth rate.
With all else being the same, not until the probability of an adult surviving in a 1-month period is reduced approximately 30%, from P_A = 0.949 to P_A = 0.66 or less, could a negative population growth be achieved.
Harvesting 30% of the adult lionfish each month is quite a challenge. However, simultaneous reductions of 17% for the probabilities of juvenile and adult survivorship could also produce a declining population.
Thus, “results indicate that an eradication program targeting juveniles and adults jointly would be far more effective than one targeting either life stage in isolation” (Morris et al. 2011).

54 of 107

Matrix Models - Algorithms

For an age-structured or a stage-structured problem, we form the appropriate matrix, L, and the vector representing initial female population distribution, x, and determine the distribution at time t by calculating L^tx.
For the projected population growth rate, we calculate the eigenvalue, λ, which is available through a command with many computational tools.
When it is not available, to estimate the projected population growth rate to within m decimal places, we keep calculating the ratio of age distributions, x(t + 1)/x(t), until two subsequent ratios differ by no more than 10^-^m.
For the preceding example with birds, to estimate population growth to within four decimal places, we consider any one of the components, say the first, of x(t + 1)/x(t) = L^t⁺¹x / L^tx.
After repeated calculation, we discover with x(15)/x(14) = L¹⁵x/L¹⁴x = (1.02153, 1.02173, 1.02136) and x(16)/x(15) = L¹⁶x/L¹⁵x = (1.02162, 1.02153, 1.02173) that the first components are sufficiently close to each other:

| 1.02153 – 1.02162 | = 0.00009 < 10^-4 = 0.0001

55 of 107

Sensitivity Analysis – Age Structured Models

We can use sensitivity analysis to examine how sensitive values, such as long-term population growth rate (dominant eigenvalue λ) or predicted time of extinction, are to small changes in parameters, such as survivability and fecundity.
Suppose in the preceding bird example, we wish to examine the sensitivity of the long-term population growth rate to small changes in survivability of year 1 and year 2 birds, P₁ and P₂, respectively.
Adjusting P₁ and P₂ individually by ±10% and ±20%, The table shows the corresponding new values of λ and the change in projected population growth rate, λ_new – λ, as calculated using a computational tool.
Relative sensitivity or sensitivity of λ with respect to P_i measures the numeric impact on λ of a change in P_i, or the instantaneous rate of change of λ with respect to P_i(the partial derivative of λ with respect to P_i, ∂λ/∂P_i).
Thus, to approximate this relative sensitivity, we divide the change in projected population growth rate by the corresponding small change in Pi:

56 of 107

Sensitivity Analysis

Table 13.3.4, we see that λ is most sensitive to changes in survivability of year 1 birds, P1.
This analysis indicates that conservationists might concentrate their efforts on helping eggs and nestlings survive.

57 of 107

Using sensitivity analysis, Morris et al. (2011) determined that lionfish population growth λ is very sensitive to lower-level mortality parameters of larval, juvenile, and adult mortality and is “most sensitive to the lower-level parameter of larval mortality.”

Sensitivity Analysis – Stage Structured Models

Applicability of Leslie and Lefkovitch Matrices

Leslie or Lefkovitch matrices are appropriate to use when we can classify individuals in a species by age or stage, respectively.
The dynamics of the populations are based only on the females, and an adequate number of males for fertilization is assumed.
The models in this module accommodate only population growths that do not depend on the densities of the populations so that the fecundity and survival rates remain constant.

58 of 107

Need for High-Performance Computing

Typically, a Leslie matrix is small enough so that high-performance computing (HPC) is unnecessary to model the long-term situation for one type of animal.
However, one species might be a small part of a much bigger network of other species of animals and plants and their environment. Execution of models of such larger problems involves extensive computation that can benefit from HPC.
For example, PALFISH is a parallel, age-structured population model for freshwater small planktivorous fish and large piscivorous fish, structured by size, in south Florida.
The model contains 111,000 landscape cells, with each cell corresponding to a 500-m by 500-m area and containing an array of floating-point numbers representing individual fish density of various age classes.

59 of 107

Need for High-Performance Computing

Researchers reported a significant improvement in runtime of PALFISH over the corresponding sequential version of the program. The mean simulation time of the sequential model was about 35 hours, while the parallel version with 14 processors and dynamic load balancing was less than 3 hours (Wang et al. 2006).
Another use of HPC can be in parameter sweeping, or executing a model for each element in a set (often a large set) of parameters or of collections of parameters.
We can obtain a better overall picture of the model’s behaviour, determine the relationships among the variables, find variables to which the model is most sensitive, find ranges where small variations in parameters cause large output changes, locate particular parameter values that satisfy certain criteria, and ascertain variables that might be eliminated to reduce model complexity (Luke et al. 2007).

60 of 107

Stage Structured Models

In Age-structured model, we distinguish life stages by age, is in fact a special case of a stage-structured model, where we divide the life of an organism into stages.
Frequently, it is convenient or necessary to consider the life of a species in stages instead of equally spaced time intervals, such as years.
Perhaps the animal, such as a loggerhead sea turtle, typically lives for a number of years, and we cannot accurately determine the age of an adult.
Conceivably the stages differ greatly in lengths of time. Also, rates may be strongly associated with developmental stages or animal size.
Morris, Shertzer, and Rice generated a stage-structured model of the Indo-Pacific lionfish Pterois volitans to explore control of this invasive and destructive species to reef habitats (Morris et al. 2011).
Such consideration is very important because in a Caribbean region study, Albins and Hixon found lionfish reduced recruitment of native fishes (addition of new native fishes) by an average of 79% over a 5-week period (Albins and Hixon 2008).

61 of 107

However, the larvae have venomous spines, probably making them less appealing prey than many of the native reef fish.

62 of 107

Introduction – Markov Chains

To the Navy and shipping companies around the world, barnacles can be a real problem.
How can such a small animal be so despised by so many? Though seemingly insignificant, barnacles are one of the main causes of fouling of ship hulls.
Growing on the submerged hull surfaces, they interfere with the smooth movement of ships through the water, causing ships to use more fuel, which adds up to tremendous costs.
Millions of dollars have been expended to find ways to eliminate or at least greatly inhibit attachment. Ship owners have tried various types of paints, but many of them leach toxic compounds into the water.
Recently, researchers have developed some nontoxic coatings, which help to change the mechanical properties of the hull surface, so that barnacle larvae and other fouling organisms are less likely to attach.

63 of 107

Introduction

Barnacles are filter feeders that occur in large numbers in their communities.
They form hiding places for small animals and serve as food for others. Their role, or niche, is interwoven into the community structure and function, and their loss might have serious ramifications.
Each species is integrated so that it has multiple interactions with other community members and the environment.
The extinction of a barnacle species would certainly affect other constituents of the ecosystem (“Barnacles” 2012; Cornell University 2003; Natural History Museum 2012; Stout 2009).
Therefore, understanding the effects of losses in diversity is and will continue to be critical to the implementation of judicious conservation policies, but that understanding is problematic in the multifarious, natural ecosystems.
Mathematical models may offer us an effective approach to estimating the impact of species losses to a community.
For this type of study, we can employ Markov chain models (MCM), which are based on the probability of passing from one state to another.

64 of 107

Problems - From Psychology to Genetics

Besides predicting effects of species loss to a community, Markov chain models are useful in quite a variety of problems, from predicting the behaviour of animals to locating genes in the DNA.
In this unit, we start with a problem from psychology in which we have observed the various activities of an animal and the likelihood of moving from one pursuit to another.
Using this information, with MCMs we can estimate the average amount of time a typical animal spends performing each endeavour and, given the activities of a group of animals, predict their behaviour in the near future.
We can employ this same to pursue vastly different problems in genetics.

65 of 107

Probability Theory – A Primer

Markov chain models involve matrices in which all the elements are probabilities, so let us start with a brief introduction to probability theory.
The probability of an event, or the occurrence of something, is a number between 0 and 1, inclusive, indicating the chance of the event happening. A probability of 0 means that the event can never occur, while 1 says that that the situation is always true.
As an example, suppose a certain kind of seed has a 50–50 chance of germinating. Thus, the probability or chance of germinating is P(germinating) = ½ = 0.5 = 50%.
For each seed, one of two events can occur, germination or no germination; and the results are equally likely to occur. We expect that if we observe many seeds, about half the seeds will germinate.

66 of 107

The sum of all the possible events for a situation, such as germinating and not germinating, sums to 1.
If a seed has only a 30% chance of germinating, P(germinating)= 0.3, then it has a 70% chance of not germinating:

P(not germinating) = 1 – P(germinating) = 1 – 0.3 = 0.7

Suppose an ant is equally likely to go in any one of eight directions: N, NE, E, SE, S, SW, W, or NW.
For example, P(N) = 1⁄8 and P(S) = 1⁄8. The probability that the ant will move in the north or south direction is P(N or S) = P(N) + P(S) = 1⁄8 + 1⁄8 = 1⁄4.
The ant cannot move in two directions at the same time, so moving to the north and moving to the south are mutually exclusive; the events cannot occur at the same time.
If events E1 and E2 are mutually exclusive, then the probability of E1 or E2 is the sum of the probabilities of the individual events, that is,

P(E1 or E2) = P(E1) + P(E2).

67 of 107

To calculate the probability that the ant will go in a northerly (N, NE, NW) or westerly (W, NW, SW) direction, we must subtract the probability of where the events overlap, going NW, as follows:

P(northerly or westerly) = P({N, NE, NW}) + P({W, NW, SW}) - P(NW)

= 3⁄8 + 3⁄8 - 1⁄8 = 5⁄8

We subtract P(NW) to avoid counting that direction twice. The two events, heading in a northerly direction and heading in a westerly direction, are not mutually exclusive.
If events are not mutually exclusive, then for the probability of one or the other we must subtract the probability of overlap from the sum of the probabilities.
Considering again the seeds that have a 30% chance of germinating, suppose we have two seeds, S1 and S2. Each has 0.3 probability of germinating, and the state of one seed has no bearing on the state of the other. We say these events are independent.
Certainly, the probability of both seeds germinating is even less likely than any one germinating. In fact, the probability of S1 germinating and S2 germinating is the product of their individual probabilities:

P(S1 germinating and S2 germinating) = P(S1 germinating) ∙ P(S2 germinating)

= (0.3)(0.3) = 0.09

Only a 9% chance exists of both seeds germinating.

68 of 107

N NE

W E

SW SE

69 of 107

Frequently, we wish to know the probability of one event, E2, given the occurrence of another event, E1. The notation for such a conditional probability is P(E2|E1).
For example, suppose a public health agency wages an aggressive campaign to stop the spread of a particular disease (maybe COVID-19) by trying to quarantine any individual who has come in contact with someone who has the disease.
The probability that an exposed individual is quarantined can be written as a conditional probability, P(quarantined |exposed), the probability of quarantine given exposure.
This quantity is equal to probability of the individual being quarantined and exposed divided by the probability of being exposed:

P(quarantined | exposed) = P(quarantined and exposed)/P(exposed)

For example, suppose in a group of 100 people, 10 have been exposed and 2 have been exposed and quarantined.
Thus, picking an individual at random from the group of 100, we have a 10/100 = 10% = 0.10 chance of selecting an exposed person and a 2/100 = 2% = 0.02 chance of the person being quarantined and exposed.
However, if our selection is only from the subset of 10 exposed people, then the probability of picking one of the 2 individuals who is also quarantined is 2/100 = 0.02 = 2%; the probability that an exposed individual is quarantined is 0.02/0.10 = 0.2 = 20%.

70 of 107

Transition Matrix

We can employ a matrix of conditional probabilities to estimate the long-term behaviour of an animal. For example, the red howler monkey’s primary food is leaves.
Because leaves are hard to digest, the monkey spends about half of its waking hours resting. Resting requires less energy than other activities and gives time for digestion.
Suppose we consider a simplified system where the monkey is in only two states, eating (E) and resting/sleeping (R); and S = {E, R} is the state space, or set of possible states.
Let us consider some hypothetical data. If one state (X_n) of the monkey is eating, then the probability that the state of the monkey 1 hour later (X_n₊₁) is eating is 0.6.
We express this information as a conditional probability, P(X_n₊₁= E | X_n = E) = 0.6.
Because we assume the monkey is either eating or resting at any time, the probability that the monkey is resting 1 hour after eating is P(X_n₊₁= R | X_n = E) = 1 – 0.6 = 0.4.

74 of 107

Conclusion

In general, λ = 1 is always an eigenvalue for the transition matrix of a Markov chain. Moreover, if each of the components of the corresponding eigenvector x is nonnegative and s is the sum of these components, then (1/s)x is the equilibrium vector for T, and this vector is a probability vector.
If we start with a probability vector, v₀, where each component gives the fraction in each corresponding state, such as eating (E) and resting/sleeping (R), then Tⁿv₀ converges to v = (1/s)x as n becomes larger and larger.
Moreover, each coordinate of this equilibrium vector, v, is the ultimate proportion of the corresponding state.

75 of 107

Matrix Models

The Next Flu Pandemic

76 of 107

A Preamble

In USA, every year one can hear the warnings from public health officials to get flu shots.
Some comply, but many do not. In fact, the CDC reported that less than 40% of the U.S. population was vaccinated during the 2008–2009 flu season (CDC 2009).
Influenza can attack any age, race, or sex. It not only makes us feel miserable, it costs millions of days of lost productivity at school or at work.
Although the highest rates of infection are in children, the most severe, even life-threatening effects are on those over the age of 65.

Charlie Bates is a college sophomore who wakes up one morning feeling really bad. He assumes that it is just a hangover. He had a pretty wild night of drinking at his fraternity’s welcome-back party that traditionally begins the spring semester. His head is pounding, and he is exhausted. Charlie feels that he can sleep this one off, so he decides to cut his 9 o’clock economics class. He resets his alarm and rolls over. Four hours later he is distressed to find that he has also slept through his 11:00 government class. Even more disturbing is that he feels even worse. He has never had a sore throat from a hangover, and he is feeling very achy. So, he gets up, dresses, and stumbles over to the campus infirmary. The nurse finds that he has a temperature of 102.5 °F. She thinks he has the flu.

77 of 107

The Problem

Public health organizations worldwide are trying to find ways of effectively blunting the inevitable epidemics/pandemics of this disease. As part of their efforts, officials are using the results of computational science models, which employ computer science and mathematics along with the science, to make informed decisions on how to combat the menace.
Computational scientists model the spread of disease in a number of ways. System dynamics models consider the changing sizes of complex interrelated systems, such as susceptibles, infecteds, and recovereds, as time progresses.
Cellular automaton simulations model reality with 1D, 2D, or 3D grids that change with time. A grid site has a state, such as susceptible, infected, or recovered. Rules, such as an infected person recovers in 5 days, regulate the behaviour of the system.
The results of cellular automaton simulations are challenging to verify, and system dynamics models do not provide some of the specificity that would be helpful in making public health decisions in the face of an epidemic.

78 of 107

The Problem

As Bisset and Madhav (2009) write, “these modelling approaches were limited in their ability to capture the complexity of human interaction that underlies disease transmission.” Grid-based agent-based models, which simulate and visualize the behaviours and interactions of individuals, can overcome some of these limitations; but the restricted movement of an individual from one cell to a neighbouring cell at a discrete time step and the inability for such systems to handle large datasets hinder their utility.
A related modelling technique, individual-based (or network-based) epidemiology simulation, which employs matrices, tracks the simulated behaviour of individuals in a community. Thus, this method provides the desired specificity and is easier to verify, but its simulations incorporate massive amounts of data that require extensive effort to gather and need massive computing power to process.

79 of 107

The Problem

To help meet this challenge, scientists at Los Alamos National Laboratory developed the Epidemiological Simulation System (EpiSims) to simulate the spread of epidemics in a large city at the individual level, considering realistic contacts among the people and characteristics of disease transmission (VBI 2008).
Employing transportation information, census data, and activities surveys from a sample of about 2000 people, researchers have generated hypothetical data that model the movements and demographics of Portland, Ore.
With EpiSims, scientists can study a number of important issues: the efficacy of prevention measures, such as vaccinating particular segments of the population; the value of early detection measures, such as placing fever sensors at high traffic buildings; the effectiveness of public health interventions, such closing schools; and fundamental questions, such as patterns of the spread of disease.

80 of 107

The Problem

Individual-based epidemiology simulations can estimate some of the following metrics:

A smallest set of locations (minimum dominating set) that a given proportion of the population visits. Such information can be helpful in determining sites for fever sensors or in closing of particular public buildings during an epidemic.
The distribution of the number of contacts people have with other people (degree distribution). Targeted vaccination of individuals who have many contacts can offer significantly better results than vaccinating people at random (Mason and Verwoerd 2007).
The probability that two contacts of a randomly chosen person have contact with one another (clustering coefficient; Newman et al. 2002). A large probability indicates that a disease can spread rapidly through a community.
The average smallest number of contacts for a disease to spread from one arbitrary individual to another (mean shortest path length) A small mean also indicates the probable rapid spread of a disease.

81 of 107

Graph Theory – A Primer

Individual-based epidemiology simulations use graphs that represent the contacts between people to predict the spread of disease and to analyze health-care interventions.
A graph is a set of nodes with undirected or directed edges connecting some of the points, as illustrated in Figure in the next slide.
In that contact network, or social network, which is a type of graph, the nodes represent people or groups of people, such as members of a household that can become infected, and places, where the disease can spread from an infected person to a susceptible individual.
Numbers indicate the households, while the places are school, hospital, work, shop, and cloister. Each edge represents an association that can lead to transmission of the disease.
For example, one or more individuals in household-6 go to work, shop, and school, locations where they can contract or spread the disease.

82 of 107

An undirected graph G = (V, E) consists of a set V of vertices (singular, vertex), or nodes or points, and a set E of edges, or arcs, connecting pairs of points. In the graph of the figure,

V = {1, 2, 3, 4, 5, 6, 7, School, Hospital, Work, Shop, Cloister}

We denote an undirected edge e between nodes u and v as (u, v) or as (v, u). Here, (u, v) is not an ordered pair. In the preceding figure, e = (6, Hospital) = (Hospital, 6) is an edge because there is contact between at least one member of household-6 and people who are at the Hospital.
The size of the graph is its number of nodes, so the graph in figure has size 12.
We need to know several terms to speak the language of graph theory. In the figure points 6 and Hospital are adjacent because there is an edge, e, connecting them. We say that edge e, which can be written as (6, Hospital) or as (Hospital, 6), is incident to points 6 and Hospital.
Vertex Cloister is isolated, having degree 0, or no incident lines, while point 5 has degree 2, or is the endpoint of two edges.

83 of 107

Graph Theory – Biological Networks and Social Networks

Much work on the structural properties of biological networks, such as social networks, has focused on the distribution of degrees. If n is the number of nodes in a network and n_k is the number of nodes of degree k, then the degree distribution is P(k) = n_k/n, which is the proportion of nodes having degree k, for k = 0, 1, 2, ….
For example, in the previous figure, which is a graph of size n = 12, one node (Cloister) has degree 0 and one node (node 7) has degree 1, so P(0) = P(1) = 1/12.
With five nodes (nodes 1, 2, 3, 4, and 5) having degree 2, five-twelfths of the nodes (P(2) = 5/12) are incident to two nodes.
Studies of this and many other biological networks, such as the central metabolic networks of 43 organisms and protein interaction networks for various organisms, have degree distributions that appear to follow power laws. A function f follows a power law if f(x) is proportional to x^b for some constant b; that is,

f(x) ∝ x^b, or f(x) = cx^b, for some constants c and b.

In the case of many biological networks, the degree distribution P(k) is proportional to k^-^r for some constant r.

84 of 107

Specifically for metabolic networks, P(k) ∝ k^–^r for 2 < r < 3, or P(k) = ck^–^r for 2 < r < 3 and some constant c.
A degree distribution following this power law implies that nodes with small degree are extremely common, while nodes with large degree are quite rare.
The figure on the next slide shows the graph of P(k) = k^–2.5with the typical broad-tail, or long, stretched-out portion to the right, of such a power law form.
Networks that follow the power law P(k) ∝ k^–r with r > 1 are also called scale-free networks.
Interestingly, the Internet is a scale-free network.
In a scale-free network, most nodes have relatively low degree, but a few nodes, called hubs, have high degrees.
Removal of hub nodes can easily result in the network being disconnected.

85 of 107

Thus, scale-free networks are particularly vulnerable to attack and failure at the hubs.
Biologists have suggested that in a genetic or protein network, a hub node, which is a gene or protein that participates in a large number of interactions, may be more significant for the survival of an organism than nodes that have small degrees (Mason and Verwoerd 2007).
In a social network that is scale-free, a hub location, which has numerous visitors each day, is a prime site for the spread of disease.

86 of 107

Graph Theory – Paths

Paths through the contact network in the figure can help illuminate the epidemiology of the disease. Suppose initially someone in household-1 has the disease.
One way for someone from household-5 to contract the disease indirectly from household-1 is by the path: 1 – Shop – 6 – School – 5.
Someone from household-1 goes shopping, infecting someone at the shop. Likewise, an individual from household-6 goes shopping and catches the disease. That person or someone in household-6 who becomes ill from contact with that individual goes to school, spreading the disease further.
An individual from household-5 also attends the school and contracts the disease there. With four edges along this path, we say the path length is 4.

87 of 107

Small World Property

In general, for biological networks, such as protein, gene, or metabolic networks, the average length of a path between nodes is small in comparison to the size of the graph. Thus, we say such networks exhibit the small-world property.
Specifically, if such a graph has n nodes, by definition the average shortest-path length is on the order of magnitude of log n or smaller.
For example, metabolic networks have between 200 and 500 metabolites (nodes), but the average path length is between 3 and 5.
Genetic networks contain about 1000 genes (nodes) and 4000 interactions (edges), with an average path length of 3.3.
Because average path length indicates how readily the network can communicate information, biological networks are efficient communicators.
For instance, a metabolic network needs few interactions for one metabolite to influence the behaviour of another metabolite (Mason and Verwoerd 2007).

88 of 107

The graph shown in figure (bottom-left) is a subgraph of the graph (top-left). This is because every node and every edge of the subgraph is present in the graph (top-left).
The subgraph of top-left graph includes every node and edge except Cloister is connected because there is a way to get from any point to any other point in that subgraph by following edges.
Thus, the disease has the potential of spreading to every node in the subgraph. Suppose household-6 of Figure (bottom-left) represents a family of two parents and three children in which every member of the family has contact with every other member shown in graph (right).
The graph on right illustrates the graph of this household-6, which is complete, having every point (vertex or node) connected to every other point directly by exactly one edge. An ill member of the household will expose everyone in the house.
In the graph of Figure on right, each of the five points has four incident edges connecting that node to the remaining nodes. In other words, each node has degree 4.
Therefore, the sum of the degrees of all the points is (5)(4) = 20. Because in summing the degrees we count each edge twice, once for each endpoint, the number of edges in this complete graph is half the sum, (5)(4)/2 = 20/2 = 10.
In general, a complete graph with n points has n(n – 1)/2 edges.

Graph Theory – Clustering

90 of 107

The concept of completeness is central to the calculation of clustering coefficients, a measure of how rapidly a disease can spread.
The clustering coefficient for a vertex, v, is the probability that two nodes adjacent to v are themselves adjacent. That is, the chance that two arbitrary edges incident to v are part of a triangle of edges in the graph is the clustering coefficient for v.
Thus, if A is the set of nodes adjacent to v, then the clustering coefficient is the quotient of the number of edges in the subgraph with points from A and the number of edges in a complete graph with that number of points.
If a node has degree zero or one, then its clustering coefficient is 0. The clustering coefficient of v indicates of how close v and its adjacent nodes are to being a complete graph.
For example, node 2 in the figure is adjacent to 4 nodes, nodes 1, 3, 4, and 5. Three edges appear in the subgraph within these adjacent nodes, V = {1, 3, 4, 5}.
However, a complete graph with four nodes has (4)(3)/2 = 6 edges. Thus, the clustering coefficient for node 2 is 3/6 = 0.5.
A 50% probability exists that two neighbours of node 2 are themselves adjacent. One research project calculated the clustering coefficient of people using the Internet to be 0.1078; but with an edge connecting two actors if they were in a movie together, the clustering coefficient of movie actors is significantly higher, 0.79 (Eggemann and Noble 2008).

Typically, a small-world network not only has a small mean path length, but it also has a large mean clustering coefficient. With these characteristics, disease can spread rapidly in social networks.

91 of 107

Bipartite Graphs

The adjoining figure shows a contact network of Wards A, B, and C in a psychiatric hospital with health-care workers, indicated by numbers.
This bipartite graph has its vertices split into two sets, a set of wards and a set of workers, with edges only between vertices in different sets. As another example, a metabolic network is a directed bipartite graph.
Its nodes are partitioned into the set of metabolites and the set of reactions catalyzed by the metabolism’s enzymes.

92 of 107

Matrix Representation of Graphs

For a graph to be manipulated in a computer, its structure must be stored in some convenient manner.
Often, we use one- and two-dimensional arrays, or vectors and matrices, respectively, to represent graphs. Such arrays, depending on the representation, can specify such information as adjacent nodes, data stored at nodes or along edges, and existence of paths between nodes.
We will be using adjacency matrices and connection matrices, which can store graphs where we are not interested in the values at the nodes but only in the graphs themselves. An associated vector can store nodal values.
In the adjacency matrix, A, with indexing beginning at 1, the element in row i and column j, a_ij, indicates the number of edges between node i and node j for i and j = 1, 2, 3, 4. For instance, because two edges connect points 1 and 4 in the graph, elements a₁₄ and a₄₁ are both 2.
As the following matrix illustrates, the adjacency matrix (left) for an undirected graph is symmetric about the diagonal.
A connection matrix, C, indicates only existence of an edge from one point to another, not the number of such edges. As the following connection matrix (right) illustrates, if at least one edge exists between node i and node j, then the element in row i and column j, c_ij, is 1; while the value is 0 otherwise

93 of 107

People-Location Graphs

Suppose we have a file of activities of people in an area, where each record includes, among other data, an identification number for a person (personID) and an identification number for a location (locationID) that person visited during a day.
For example, suppose the file contains the following person-location pairs:

(7, 2938), (7, 27618), (7, 2938)
(8, 2938), (8, 6270), (8, 21032), (8, 2938), (8, 15370), (8, 2938)
(9, 10628), (9, 29740), (9, 10628)
(18, 2938), (18, 5212), (18, 2938), (18, 19815), (18, 2938)

Thus, the person with ID 7 started out at location with ID 2938, presumably home; travelled to location 27618, perhaps work; and then returned home.
The person with ID 8, who lived in the same home (2938), probably did a couple of errands in the morning (to locations 6270 and 21032), returned home for lunch, and made another trip (to 15370) in the afternoon before returning home for dinner.

95 of 107

People – Location Graphs: Algorithms

96 of 107

People – Location Graphs: Minimal Dominating Set

During an epidemic or pandemic, public health department might want to quickly place fever detectors or vaccination sites in a relatively small set of locations to which a high percentage of people travel.
With realistic contact patterns from a community, determination of such a minimum dominating set is in the realm of graph theory and computer science.
To compute a minimum dominating set, or a smallest set of locations that a given proportion of the population visits, we can use the FastGreedy Algorithm to obtain an approximation.
First, we arrange the locations in decreasing order of degree in the people-location matrix.
With the FastGreedy Algorithm, we keep selecting locations from largest degree down until a given population percentage has visited the set of selected locations (Eubank et al. 2004).
Thus, if we want to “dominate” 75% or less of the people, which in this case is less than or equal to three people, we need to select only the first location, 2938, from our list.
Fast-Greedy is called a greedy algorithm because at each iteration, the most advantageous near-term choice is picked.

97 of 107

The Fast-Greedy Algorithm

For the design of the FastGreedy Algorithm, we break the technique into several smaller functions. First, we define a function, degLocation, to calculate the degree of a location index in the people-location graph. For example, in the graph of with connection matrix, connMat, location 2938 with index 1 has degree 3 because 3 people visit that location. Thus, degLocation(connMat, 1) returns 3.
We must sort the location degrees while continuing to associate the location index with its degree. Thus, we form a list of ordered pairs of location indices and their degrees. A function, sortSecond, sorts this list by its second coordinates.
Depending on the sorting algorithm, the function might return the following list for the graph:

{(1, 3), (9, 1), (8, 1), (7, 1), (6, 1), (5, 1), (4, 1), (3, 1), (2, 1)}

The order of the first coordinates (location indices) is unimportant, but the order of the second coordinates (location degrees) must be nonincreasing.

100 of 107

Degree Distribution

For the distribution of the number of contacts people have with other people (degree distribution), we first generate a connection matrix for a people-people graph associated with a social network, such as the people-location graph.
For simplicity, we ignore timing and assume two people are adjacent if they visited the same location in a day.
Thus, in the adjacent figure is the people-people graph and the associated connection matrix.
We define a function, peopleToPeople, to return the connection matrix (connPeopleMat) for a people-people graph.
Going through every column of the people-location connection matrix, connMat, we go down each column looking for 1s. For every 1, we look through the rest of the column for 1s.
The people corresponding to these indices are adjacent. For example, connMat has 1 in element connMat(1, 1). Because connMat(1, 2) is also 1, we define the 1–2 and 2–1 elements of connPeopleMat, connPeopleMat(1, 2) and connPeopleMat(2, 1), to be 1.
That is, people with indices 1 and 2 are adjacent. Similarly, in the developing symmetric matrix, we assign 1 to connPeopleMat(1, 4) and connPeopleMat(4, 1).
Then, with 1 in the first column, second row of connMat and another 1 in the last row, we define connPeopleMat(2, 4) and connPeopleMat(4, 2) as 1. Because the matrix is symmetric, we have already indicated that people with indices 2 and 1 are adjacent.

101 of 107

To obtain a degree distribution, we have a function, degPersonPPG, to calculate the degree of each node in the person-person graph.
For the connection matrix connPeopleMat in Figure 13.5.11 with graph in Figure 13.5.10, degPersonPPG(connPeopleMat, 1) = degPersonPPG(connPeopleMat, 2) = degPersonPPG(connPeopleMat, 4) = 2, while degPersonPPG(connPeopleMat, 3) = 0.
That is, the people with IDs 7, 8, and 18 and indices 1, 2, and 4, respectively, have degree 2. ID 9 (with index 3) is isolated because that person does not go anywhere the other 3 people visit.
Finally, we define a function, pLst, that returns the a list of ordered pairs, (k, P(k)), of a degree, k, and the corresponding degree distribution value, P(k).
For k = 0, 1, 2, . . ., maximum(degree), P(k) = n_k/n, where n_k is the number of nodes of degree k and n is the total number of nodes in a people-people graph. Thus, for connection matrix, pLst(connPeopleMat) returns {(0, ¼), (1, 0), (2, ¾)}.
For the n = 4 nodes, one-fourth of the nodes have degree 0; no nodes have degree 1; and three-fourths of the nodes have degree 2.

104 of 107

Clustering Coefficient

The clustering coefficient of a person provides an indication of the rapidity with which a disease can spread among his or her associates, and the mean clustering coefficient is a metric on local connectivity of a population.
To determine the clustering coefficient of a node with a given index in a connection matrix, we start by defining a function, adjacentPeople, to return a list of indices of its adjacent nodes.
Another function, numPeopleEdges, returns the number of edges in a subgraph with a given collection of vertex indices.
Thus, for this function, we count the number of ones in the connection matrix with row and column indices in this set. Because each edge is counted twice, the result is divided by 2.
With these two functions and degPersonPPG, we can define a function, clusteringCoeff, to return the clustering coefficient of a vertex, given its index.
Using clusteringCoeff, we determine that the mean clustering coefficient for our very small people-people graph in Figure 13.5.11 is the high value of 0.75.

106 of 107

Assessment of a Model

Clearly, the results involving 1000 randomly selected people and their activities from the NDSSL synthetic data set are more realistic than the illustrative example with 4 people.
Although the FastGreedy Algorithm does not necessarily yield a minimum dominating set even for the larger data set above, Eubank et al. (2004) proved that the technique does provide a fast method for obtaining a good dominating set.
Thus, public health officials can use the results to designate which locations should have fever-detection sensors or should close during epidemics.
The degree distribution of the larger data set with fitted function f(k) = –0.0219242 + 0.259918k^–1.2does approximate the power law, P(k) = ck^–^r, common for scale-free networks.
The shape reveals only a few nodes, in this case 8 out of 1000, or 0.8% of the population, have degree 6 or more. Using further demographic information, public health officials might target such “well-networked” people for immediate vaccination.
Moreover, with simulations, they can investigate the impact of this and other vaccination policies on combating the spread of influenza. The mean clustering coefficient of 0.118524 for our group of 1000 individuals is indicative of small-world networks, which exhibit higher cliquishness of an average neighbourhood.
By contrast, random graphs with about 300 to 5000 nodes have clustering coefficients of approximately 0.05 to 0.005, respectively, exhibiting almost no clustering (Watts and Strogatz 1998).

107 of 107

Computing Power

As Bisset and Marathe (2009) state, “These far more complex network-based models present a new set of computational challenges that require the use of high-performance computing.” Several reasons exist for requiring this power:

Social networks are very large, have unstructured shapes, and change frequently. Thus, computing systems require enormous storage to hold and significant power to manipulate the data.
Because these models are stochastic, involving chance, scientists must execute a simulation with various intervention scenarios and compliance rates and must replicate each simulation run many times to obtain meaningful results on which to base policy recommendations.
Incorporation of the diverse characteristics of people and their activities is important in studying disease propagation in space and time.

1 of 107

2 of 107

3 of 107

4 of 107

5 of 107

6 of 107

7 of 107

8 of 107

9 of 107

10 of 107

11 of 107

12 of 107

13 of 107

14 of 107

15 of 107

16 of 107

17 of 107

18 of 107

19 of 107

20 of 107

21 of 107

22 of 107

23 of 107

24 of 107

25 of 107

26 of 107

27 of 107

28 of 107

29 of 107

30 of 107

31 of 107

32 of 107

33 of 107

34 of 107

35 of 107

36 of 107

37 of 107

38 of 107

39 of 107

40 of 107

41 of 107

42 of 107

43 of 107

44 of 107

45 of 107

46 of 107

47 of 107

48 of 107

49 of 107

50 of 107

51 of 107

52 of 107

53 of 107

54 of 107

55 of 107

56 of 107

57 of 107

58 of 107

59 of 107

60 of 107

61 of 107

62 of 107

63 of 107

64 of 107

65 of 107

66 of 107

67 of 107

68 of 107

69 of 107

70 of 107

71 of 107

72 of 107

73 of 107

74 of 107

75 of 107

76 of 107

77 of 107

78 of 107

79 of 107

80 of 107