Multiple Linear Regression in R: A Powerful Analysis Tool

Multiple Linear Regression In R

Overview

Have you ever thought about how to analyze the relationship between multiple variables using R? If so then this article is for you. After reading this article, you will be able to perform multiple linear regression in R.

Multiple linear regression is a popular statistical method to analyze the relationship involving multiple predictor (independent) variables with one response (dependent) variable. We can perform multiple linear regression in R or Python.  In this article, we will show you how to perform multiple linear regression in R.

In this article, you will learn:

  • What is multiple linear regression?
  • What are the various assumptions of multiple linear regression?
  • How to implement multiple linear regression in R?
  • How to check the validity and accuracy of the regression model?
  • How to use the model for prediction on new (unseen) data?

If you are ready to learn the fascinating topic then dive in!

What Is Multiple Linear Regression?

In multiple linear regression, we try to find the relationship between a response variable (dependent variable) with its predictor variables (independent variables). For simple linear regression, there is only one predictor but in multiple linear regression, there is always more than one predictor.

Let us try to understand multiple linear regression, with the help of a simple example. Suppose you are a business analyst who works for a company. Your job is to understand how the profit of each laptop (profit) depends on various factors: speed of the processor (processor), size of the screen (screen), life of the battery (battery), and rating of the customer (customer).

 

To do so, you need to collect a lot of data (say, 100 observations) on laptop models and their profits. If you decide to use multiple linear regression as a predictive model then you can express the response (profit of each laptop) as a function of  predictors as,

$$profit=b_0+b_1\times processor+b_2\times screen+b_3\times battery+b_4\times rating$$

In the above equation, b0, b1, b3, and b4 are the coefficients that need to be evaluated from the data using multiple linear regression.

In a more generic form, we can express multiple linear regression as,

 

$$y=b_0+b_1x_1+b_2x_2+b_3x_3+…..+b_nx_n$$

Where,

y is the response (independent) variable, x is the dependent variable, and n is the total number of predictor (independent) variables.

Assumptions Of Multiple Linear Regression

We need to check several assumptions before performing multiple linear regression. Let us discuss them one by one briefly

Linear Relationship

Multiple linear regression assumes a linear relationship between response and predictor variables. We can visually inspect the linear relationship with the help of a scatter plot.

No multicollinearity:

Predictor variables should not be highly correlated with each other.  We can use a correlation matrix or variance inflation factors (VIF) to measure multicollinearity.

Independence:

The observations should be independent of each other. One of the popular methods to check the autocorrelation among residuals is the Durbin-Watson test.

Multivariate normality

The residuals should be normally distributed. There are various methods such as Q-Q plots, Shapiro-Wilk, or Kolmogorov-Smirnov for checking normality.

Homoscedasticity

The variance of the residuals should be constant  across the values of the predictor variables. We can use the residual plots /Breusch-Pagan test to check heteroscedasticity.

How To Implement Multiple Linear Regression In R?

In this section, we will implement multiple linear regression in R to find relationships among the startup datasets.  You can find the code and data in the following links: Code and Data

Understanding The Data

We have a dataset containing information on 50 startups. Each row of the table below shows the profit of a startup based on R&D Spend, Administration, Marketing Spend, and State. The aim is to implement multiple linear regression in R to model the relationship between the response variable (profit) and predictor variables (R&D Spend, Administration, Marketing Spend, and State).  

Visualizing The Dataset On Which We Will Implement Multiple Linear Regression.

Install Libraries

Here, we will install caTools which is a collection of various utility functions in R.   The caTools can be used for data splitting, generating indices, and sample size calculation in data science.

				
					install.packages('caTools')
				
			

Import Dataset

				
					dataset = read.csv('StartupData.csv')
				
			

Categorical Data Encoding

In categorical data encoding, categorical variables are transformed into numerical values so that they can be used for statistical analysis. There are various methods for categorical data encoding in R, such as Level encoding, One hot-encoding, and Frequency encoding.  

The dataset we considered here has a categorical variable named “State” that can take values such as “New York”, “California” or “Florida”. We can convert the values of the categorical variable into numerical values using the following code snippet.

				
					
dataset$State = factor(dataset$State, levels = c('New York', 'California', 'Florida'), labels = c(1, 2, 3))
				
			

The above code snippet uses the Level encoding method to convert categorical data into numerical values. As we can see in the table below, the categorical data (New York, California, and Florida) are replaced with numerical values (1,2 and 3).

Visualizing The Data After Categorical Data Encoding

Split the dataset into Train set and Test set

Here we used, caTools to split the dataset into training and testing sets. We here considered a SplitRatio=0.75 to split the data.

				
					library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
				
			

Fitting Multiple Linear Regression to the Training set

Let us fit a multiple linear regression model in R using the lm function.  In the lm function, profit is defined as the response variable, and the rest of the variables in the dataset are defined as predictor variables.

				
					regressor = lm(formula = Profit ~ ., data = training_set)
				
			

Predicting the Test set results

Let us predict the model performance on the test dataset.

				
					y_pred = predict(regressor, newdata = test_set)
				
			

Visualize Model Prediction

				
					%%R
y_test=test_set$Profit
comparison <- cbind(y_pred, y_test)
print(comparison)
				
			
  • The table on the right-hand side compares the predicted value of the response variable using multiple linear regression with respect to the observed value.
  • We can see from the figure that the value predicted by multiple linear regression is quite close to the original value.
  • In the next step we will evaluate the performance of the multiple linear regression model using various metrics.
Predicted Value OF Response Variable With Observed Value In Tabular Form

Model Evaluation and Validation

We can evaluate the performances of the multiple linear regression model with the help of various matrices such as mean squared error (MSE), root mean squared error (RMSE), and R-squared. Let us evaluate those :

Compute Mean Squared Error (MSE)

				
					mse <- mean((y_pred - y_test)^2)
print(paste("Mean Squared Error (MSE):", mse))
				
			
				
					Mean Squared Error (MSE): 59614884.0721883
				
			

Compute Root Mean Squared Error (RMSE)

				
					rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))
				
			
				
					Root Mean Squared Error (RMSE): 7721.0675474437
				
			

Compute R-squared

				
					rsquared <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)
cat("R-squared:", rsquared, "\n")
				
			
				
					R-squared: 0.9210359 
				
			

Compute Adjusted R-squared

				
					n <- nrow(test_set)
p <- length(coefficients(regressor)) - 1
adjusted_rsquared <- 1 - ((1 - rsquared) * (n - 1)) / (n - p - 1)
cat("Adjusted R-squared:", adjusted_rsquared, "\n")
				
			
				
					Adjusted R-squared: 0.8223309
				
			

Conclusion

This article starts with the basics of multiple linear regression and understanding the mathematics of it that relates a response variable with multiple predictor variables. We also discussed various assumptions that need to be met before applying multiple linear regression in R to find relationships among variables.

In this article, we implemented multiple linear regression in R to find the relationship between the profit of a startup based on R&D spending, administration, marketing spending, and state. Applying multiple linear regression in R is quite straightforward due to its user-friendly interface and extensive statistical libraries.

The performance of the model is further evaluated using various matrices such as mean squared error, root mean squared error, and adjusted R-squared. By analyzing the performance metrics, one can get a comprehensive understanding of how multiple linear regression captures the relationship between a response variable and multiple predictor variables.

Multiple linear regression is a powerful method for finding relationships between a response variable and multiple predictor variables. We can apply multiple linear regression in various sectors such as finances, economics, marketing, healthcare, education, and environmental science.

Frequently Asked Questions

Multiple linear regression helps to determine how several independent variables collectively influence a dependent variable. This helps a clear and comprehensive understanding of the factors affecting the outcome.

Multiple linear regression allows us to analyze the impact of multiple independent variables simultaneously. This helps to make more accurate predictions and a deeper understanding of the relationships between variables.

References

4 thoughts on “Multiple Linear Regression in R: A Powerful Analysis Tool”

Leave a Comment

Your email address will not be published. Required fields are marked *