SCHOOL OF PUBLIC HEALTH & COMMUNITY MEDICINE

MULTIPLE REGRESSION - Dr Surya Raj Niraula

Regression

Sir Francis Galton was first coined the term ‘Regression’ i.e. ‘Regression toward the mean’.
Regression Line – a linear relationship between two variables
Also indicates prediction of the value of a dependent variable (y) from a known value of an independent variable
Expected change in dependent variable for a unit change in an independent variable

Ỳ = a + bX

b is the slope or gradient often referred as the regression coefficient
The above equation is an estimate of the following equation, which describes the population regression of y on x.
Y = ß0 + ß1X1 + Є
ß0 = y-axis intercept
= slope of the population regression line
Y- Ỳ = the difference between the observed and the predicted value.
The mathematical procedure to minimize the estimated error (Y- Ỳ) is least-square method.
Estimate of ß0 and ß1 by the following equations
Estimate of ß0= a = y bar – b x bar
Estimate of ß1
= ∑(x-xbar)(y-ybar)/∑ (x-xbar)2

= rxy sy/sx

Multiple regression

In laboratory experiments
But in some kind of experiments, need to analyze the interaction of several variables.
Multiple regression fits an equation that predicts one variable (the dependent variable, Y) from two or more independent (X) variables. For example, you might use multiple regression to predict blood pressure from age, weight and gender.
With the blood pressure example, your goal may be to find out which variable has the largest influence on blood pressure: age, weight or gender.


Or your goal may be to find an equation that best predicts blood pressure from those three variables.
Does blood pressure vary with age, after correcting for differences in weight and differences between the sexes?
Or you might ask: Does blood pressure differ between men and women, after correcting for differences in age and weight?
Multiple regression is more complicated than the other statistical tests

 

The multiple regression model and its assumptions

Y = ß0 + ß1X1 + ß2X2 + ß3X3 + ß4X4 . . . + random scatter
If there is only a single X variable, then the equation is Y = ß0 + ß1X1 , and the "multiple regression" analysis is the same as simple linear regression (ß0 is the Y intercept; ß1 is the slope).
Blood pressure = ß0 + ß1*age +ß2*weight +ß3*gender + random scatter
Gender - a dummy variable


On average, blood pressure increases (or decreases) a certain amount (the best- fit value of ß1) for every year of age. This amount is the same for men and women of all ages and all weights.
On average, blood pressure increases (or decreases) a certain amount per pound (the best-fit value of ß2). This amount is the same for men and women of all ages and all weights.
On average, blood pressure differs by a certain amount between men and women (the best-fit value of ß3). This amount is the same for people of all ages and weights.
The mathematical terms are that the model is linear and allows for no interaction. Linear means that holding other variables constant, the graph of blood pressure vs. age (or vs. weight) is a straight line. No interaction means that the slope of the blood pressure vs. age line is the same for all weights and for men and women.

Additionally, the multiple regression procedure makes assumptions about the random scatter. It assumes that the scatter is Gaussian, and that the standard deviation of the scatter is the same for all values of X and Y. Furthermore, the model assumes that the scatter for each subject should be random, and should not be influenced by the deviation of other subjects.