The Poisson distribution is the appropriate distribution of the number of events observed, if these events occur independently in continuous time at a constant instantaneous probability rate or incidence rate , see for example Clayton and Hills Therefore, the variance of the Poisson distribution is not constant, but equal to the mean. Unlike the normal distribution, the Poisson distribution has no separate parameter for the variance and the same is true of the Binomial distribution. For the normal distribution, the deviance is simply the residual sum of squares. These residuals may be used to assess the appropriateness of the link and variance functions.
A relatively common phenomenon with binary and count data is overdispersion, i. A more pragmatic way of accommodating overdispersion in the model is to assume that the variance is proportional to the variance function, but to estimate the dispersion rather than assuming the value 1 appropriate for the distributions.go to link
A Handbook of Statistical Analyses Using R, Second Edition / Edition 2
This is analogous to the estimation of the residual variance in linear regresion models from the residual sums of squares. If the variance is not proportional to the variance function, robust standard errors can be used, see next section. This approach of assuming a variance function that does not correspond to any probability distribution is an example of quasi-likelihood, see also Chapter 9.
In maximum likelihood estimation, the standard errors of the estimated parameters are derived from the Hessian of the log-likelihood. However, these standard errors are correct only if the likelihood is the true likelihood of the data. Another approach to estimating the standard errors without making any distributional assumptions is bootstrapping Efron and Tibshirani, In Monte-Carlo simulation, the required samples are drawn from the assumed distribution.
Bootstrapping works as follows. Repeat this a number of times to obtain a sample of estimates. From this sample, estimate the variancecovariance matrix of the parameter estimates. See Manley and Efron and Tibshirani for more information on the bootsrap. The syntax is analogous to logistic and regress except that the options family and link are used to specify the probability distribution of the response and the link function, respectively.
A Handbook of Statistical Analyses using S-Plus, Second Edition
In Chapter 3, the U. The dispersion parameter represents the residual variance given under Residual MS in the analysis of variance table of the regression analysis in Chapter 3. The data are read using infile cond status resp using slim. The logistic regression analysis of chapter 6 may also be repeated using glm.
This table suggests that the variance associated with the Poisson distribution is not appropriate here since squaring the standard deviations to get the variances results in values that are greater than the means, i. The values of McCullagh and Nelder use the Pearson X 2 divided by the degrees of freedom to estimate the dispersion parameter for the quasi-likelihood method for Poisson models. These terms will be removed from the model.
We have treated class as a continuous measure. We now look at the residuals for this model. The post-estimation function predict that was used for regress and logistic can be used here as well. There is one large outlier. We now also check the assumptions of the model by estimating robust standard errors. Since glm does not have the robust option, we use the poisson command here. To be on the safe side, we will ask for bootstrap samples using the option reps First, we set the seed of the pseudorandom number generator using the set seed command so that we can run the sequence of commands again in the future and obtain the same results.
This proneness called frailty in a medical context multiplies the rate predicted by the covariates so that some children have higher or lower rates of absence from school than other children with the same covariates. If the frailties are assumed to have a gamma distribution, then the marginal distribution of the counts has a negative binomial distribution. Fit the model using only status as the independent variable, using robust standard errors. How does this compare with a t-test with unequal variances?
Sixty one women with major depression, which began within 3 months of childbirth and persisted for up to 18 months postnatally, were allocated randomly to the active treatment or a placebo a dummy patch ; 34 received the former and the remaining 27 received the latter. The women were assessed pretreatment and monthly for six months after treatment on the Edinburgh postnatal depression scale EPDS , higher values of which indicate increasingly severe depression.
The data are shown in Table 8.
- A Marca das Runas (Portuguese Edition)?
- A handbook of statistical analyses using r second edition pdf;
- Account Options.
- HSAUR3: A Handbook of Statistical Analyses Using R (3rd Edition) version from CRAN?
Table 8. There is a large body of methods which can be used to analyze longitudinal data, ranging from the simple to the complex. Some useful references are Diggle et al. It is useful to begin examination of these data using the summarize procedure to calculate means, variances etc. To begin, plot the required scatterplot matrix, identifying treatment groups with the labels 0 and 1, using graph pre-dep6, matrix symbol [group] ps The resulting plot is shown in Figure 8.
The most obvious feature of this diagram is the increasingly strong relationship between the measurements of depression as the time interval between them decreases. This has important implications for the models appropriate for longitudinal data, as seen in Chapter 9. Figure 8. To obtain the other graphs mentioned above, the data needs to be restructured from its present wide form to the long form using the reshape command.
Here, the required grouping variable is subj, the y variable is dep and the x variable is visit.
Before plotting, the data needs to be sorted by the grouping variable and by the x variable: sort group subj visit graph dep visit, by group c L The c L option connects points only so long as visit is ascending. The remaining points for this subject are, however, connected and so on. Again, the general decline in depression scores in both treatment groups can be seen and, in the active treatment group, there is some evidence of outliers which may need to be examined.
Before printing, we have changed the line thickness of pen 3 to 4 units click into Prefs in the menu bar, select Graph Preferences etc. The resulting diagrams are shown in Figure 8. In some situations more than a single summary measure may be required. The summary measure needs to be chosen prior to the analysis of the data. The most commonly used measure is the mean of the responses over time since many investigations, e. Other possible summary measures are listed in Matthews et al.
How would you produce boxplots corresponding to those shown in Figure 8. Compare the results of the t-tests given in the text with the corresponding t-tests calculated only for those subjects having observations on all six postrandomization visits.
A Handbook of Statistical Analyses Using R | Taylor & Francis Group
Repeat the summary measures analysis described in the text using now the mean over time divided by the standard deviation over time. See also Exercises in Chapter 9. The number of seizures was counted over four 2-week periods. In addition, a baseline seizure rate was recorded for each patient, based on the eight-week prerandomization seizure count.
The age of each patient was also noted. The data are shown in Table 9. These data also appear in Hand et al. Table 9. During the last decade, statisticians have considerably enriched the methodology available for the analysis of such data see Diggle, Liang and Zeger, and many of these developments are implemented in Stata. Assuming the residual terms have a multivariate normal distribution with a particular covariance matrix, allows maximum likelihood estimation to be used; details are given in Jennrich and Schluchter, If all covariance parameters are estimated independently, giving an unstructured covariance matrix, then this approach is essentially equivalent to multivariate analysis of variance for longitudinal data.
If compound symmetry is assumed, this is essentially equivalent to assuming a split-plot design. Whatever the assumed correlation structure, all models may be estimated by maximum likelihood. The negative binomial model is an example of the model above where there is only one observation per subject, see Chapter 7. The attraction of such models, also known as generalized linear mixed models, is that they correspond to a probabilistic mechanism that may have generated the data and that estimation is via maximum likelihood.
In the generalized estimating equation approach introduced by Liang and Zeger , any required covariance structure and any link function may be assumed and parameters estimated without specifying the joint distribution of the repeated observations. Estimation is via a quasi-likelihood approach see Wedderburn, Since the parameters specifying the structure of the correlation matrix are rarely of great practical interest they are what is known as nuisance parameters , simple structures are used for the within-subject correlations giving rise to the so-called working correlation matrix.
Liang and Zeger show that the estimates of the parameters of most interest, i. The two types of model are therefore also known as conditional and marginal models, respectively.
In practice, this distinction is important only if link functions other than the identity or log link are used, for example in logistic regression see Diggle et al. A further issue with many longitudinal data sets is the occurrence of dropouts, i. A taxonomy of dropouts is given in Diggle, Liang and Zeger where it is shown that it is necessary to make particular assumptions about the dropout mechanism for the analyses described in this chapter to be valid. The xtgee command will often be used with the family gauss option, together with the identity link function, giving rise to multivariate normal regression, the multivariate analogue of multiple regression as described in Chapter 3.
We will illustrate this option on the post-natal depression data used in the previous chapter. However, treating the observations as independent is unrealistic and will almost certainly lead to poor estimates of the standard errors. Standard errors for between subject factors here group and pre are likely to be underestimated because we are treating observations from the same subject as independent, thus increasing the apparent sample size; standard errors for within subject factors here visit are likely to be overestimated.
Such a correlational structure is introduced using corr exchangeable , or the abbreviated form corr exc.
Since both the link and family corespond to the default options, the same analysis may be carried out using the shorter command xtgee dep group pre visit, corr exc see Display 9. The estimated within subject correlation matrix demonstrates the compound symmetry structure. We also obtain the variance components, with the between subject standard deviation, sigma u, estimated as 3.
The correlation between time-points predicted by this model is the ratio of the between-subject variance to the total variance and may be calculated using disp 3. Use option mle to obtain the maximum likelihood solution. This pattern was apparent from the scatterplot matrix given in the previous chapter.