Published on February 20, 2020 by Rebecca Bevans. Revised on November 15, 2022. Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the
independent variable(s) change. Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know: Because you have two independent variables
and one dependent variable, and all your variables are quantitative, you can use multiple linear regression to analyze the relationship between them. Multiple linear regression makes all of the same assumptions
as simple linear regression: Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable. Independence of observations: the observations in the dataset were collected using statistically valid
sampling methods, and there are no hidden relationships among variables. In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in
the regression model. Normality: The data follows a normal distribution. Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor. How to perform a multiple linear regressionMultiple linear regression formulaThe formula for a multiple linear regression is: To find the best-fit line for each independent variable, multiple linear regression calculates three things:
It then calculates the t statistic and p value for each regression coefficient in the model. Multiple linear regression in RWhile it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself. Dataset for multiple linear regression (.csv) Load the heart.data dataset into your R environment and run the following code: R code for multiple linear regressionheart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data)
This code takes the data set Learn more by following the full step-by-step guide to linear regression in R. Interpreting the resultsTo view the results of the model, you can use the summary(heart.disease.lm) This function takes the most important parameters from the linear model and puts them into a table that looks like this: The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do (median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity. Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable: heart disease = 15 + (-0.2*biking) + (0.178*smoking) ± eThe most important things to note in this output table are the next two tables – the estimates for the independent variables. The The The The Because these values are so low (p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease. Presenting the resultsWhen reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means. In our survey of 500 towns, we found significant relationships between the frequency of biking to work and the frequency of heart disease and the frequency of smoking and frequency of heart disease (p < 0.001 for each). Specifically we found a 0.2% decrease (± 0.0014) in the frequency of heart disease for every 1% increase in biking, and a 0.178% increase (± 0.0035) in the frequency of heart disease for every 1% increase in smoking.Visualizing the results in a graphIt can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot. However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis. Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work. To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean, and maximum observed rates of smoking. Frequently asked questions about multiple linear regressionWhat is a regression model? A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables). A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary. How is the error calculated in a linear regression model? Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:
Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE. Cite this Scribbr articleIf you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
Is this article helpful?You have already voted. Thanks :-) Your vote is saved :-) Processing your vote... Is multiple regression A linear regression?Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. Multiple regression is an extension of linear (OLS) regression that uses just one explanatory variable.
What is the difference between a simple linear regression and a multiple linear regression quizlet?What is the difference between simple linear regression and multiple regression? Simple linear regression has one independent variable and multiple regression has two or more.
What is regression used for what is the difference between simple regression and multiple regression?The major difference between them is that while simple regression establishes the relationship between one dependent variable and one independent variable, multiple regression establishes the relationship between one dependent variable and more than one/ multiple independent variables.
Is multiple regression more accurate than linear regression?Multiple linear regression uses many variables to predict the outcome of a dependent variable. It can account for nonlinear relationships and interactions between variables in ways that simple linear regression can't. And it does so with greater accuracy!
|