Stats blog 6 (Distributions, categorical variables and generalized linear models (GLMs))

Helen Li
3 min readMar 26, 2021

Distributions

Distributions are quite important in statistics since it provides a parameterized mathematical function that can be used to calculate the probability for any individual observation from the sample space. This distribution describes the grouping or the density of the observations, called the probability density function.

There are different statistical distributions such as Bernoulli Distribution, Uniform Distribution, Binomial Distribution, Normal Distribution, Poisson Distribution and Exponential Distribution.

Categorical variables: tables, odd radios and relative risks

Calculations with tables

From this kind of table there are three types of proportions that we can calculate:

Joint:

  1. Joint proportions reflect the proportion total observation for which given levels of our categorical variables co-occur
  2. General calculation: Cell value over the grand total

Marginal:

  1. Marginal proportions sum across rows or columns
  2. General calculation: Row or column sums over the grand total

Conditional:

  1. Conditional proportions hold one variable level as given and it is a bit like zooming in to only one row or one column
  2. General calculation: Cell value over a row or column sum

Risk and odds: “Risk” refers to the probability of the occurrence of an event or outcome. Statistically, risk is equal to chance of the outcome of interest/all possible outcomes. The terms “odds” is often used instead of risks. “Odds” refers to the probability of occurrence of an event or the probability of the event not occurring. At first glance, though these two concepts seem similar and interchangeable, there are important differences that indicate where the use of either of these is appropriate.

Odds ratio and risk ratios: Risk ratios are also called ‘relative’ risks. Risk ratios and odds ratios are ratios of risks and odds respectively.

When do we use RR vs OR: In case-control studies, when such totals are not available to us, we can not calculate a relative risk. However, we can calculate odds ratios and make a comment on the strength of association between our exposure and the outcome. In cohort studies, where we do have the number exposed, we can calculate either/both. In Logistic regression, we should calculate adjusted ORs and not RRs.

Generalized linear models (GLMs)

Generalized linear models are a flexible class of models that let us generalize from the linear model to include more types of response variables, such as count, binary and proportional data.

Assumptions of the Generalized Linear Model

  1. The data Y_1, Y_2, …, Y_n are independently distributed (cases are independent) so that errors are independent, but not necessarily normally distributed.
  2. The independent variable Y_i does not need to be normally distributed, but it assumes a distribution, typically from an exponential family.
  3. GLM does not assume a linear relationship between the dependent variable and the independent variable, but it does assume a linear relationship between the transformed response (in terms of the link function) and the explanation variables.
  4. The homogeneity of variance does not need to be satisfied.
  5. It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters and relies on large-sample approximation.

Components of a Generalized Linear Model

Generalized linear models have three parts:

  1. random component: the response and an associated probability distribution
  2. systematic component: explanatory variables and the relationships among them (e.g. interaction terms)
  3. link function which tell us about the relationship between the systematic component (or linear predictor) and the response

It is the link function that allows us to generalize the linear models for count, binomial and percent data. It ensures the linearity and constrains the predictors to be within a range of possible values.

--

--