Multiple Linear Regression in R: A Powerful Analysis Tool

Overview

Have you ever thought about how to analyze the relationship between multiple variables using R? If so then this article is for you. After reading this article, you will be able to perform multiple linear regression in R.

Multiple linear regression is a popular statistical method to analyze the relationship involving multiple predictor (independent) variables with one response (dependent) variable. We can perform multiple linear regression in R or Python. In this article, we will show you how to perform multiple linear regression in R.

In this article, you will learn:

What is multiple linear regression?
What are the various assumptions of multiple linear regression?
How to implement multiple linear regression in R?
How to check the validity and accuracy of the regression model?
How to use the model for prediction on new (unseen) data?

If you are ready to learn the fascinating topic then dive in!

What Is Multiple Linear Regression?

In multiple linear regression, we try to find the relationship between a response variable (dependent variable) with its predictor variables (independent variables). For simple linear regression, there is only one predictor but in multiple linear regression, there is always more than one predictor.

Let us try to understand multiple linear regression, with the help of a simple example. Suppose you are a business analyst who works for a company. Your job is to understand how the profit of each laptop (profit) depends on various factors: speed of the processor (processor), size of the screen (screen), life of the battery (battery), and rating of the customer (customer).

To do so, you need to collect a lot of data (say, 100 observations) on laptop models and their profits. If you decide to use multiple linear regression as a predictive model then you can express the response (profit of each laptop) as a function of predictors as,

$$profit=b_0+b_1\times processor+b_2\times screen+b_3\times battery+b_4\times rating$$

In the above equation, b₀, b₁, b₃_, and b₄ are the coefficients that need to be evaluated from the data using multiple linear regression.

In a more generic form, we can express multiple linear regression as,

$$y=b_0+b_1x_1+b_2x_2+b_3x_3+…..+b_nx_n$$

Where,

y is the response (independent) variable, x is the dependent variable, and n is the total number of predictor (independent) variables.

Assumptions Of Multiple Linear Regression

We need to check several assumptions before performing multiple linear regression. Let us discuss them one by one briefly

Linear Relationship

Multiple linear regression assumes a linear relationship between response and predictor variables. We can visually inspect the linear relationship with the help of a scatter plot.

No multicollinearity:

Predictor variables should not be highly correlated with each other. We can use a correlation matrix or variance inflation factors (VIF) to measure multicollinearity.

Independence:

The observations should be independent of each other. One of the popular methods to check the autocorrelation among residuals is the Durbin-Watson test.

Multivariate normality

The residuals should be normally distributed. There are various methods such as Q-Q plots, Shapiro-Wilk, or Kolmogorov-Smirnov for checking normality.

Homoscedasticity

The variance of the residuals should be constant across the values of the predictor variables. We can use the residual plots /Breusch-Pagan test to check heteroscedasticity.

How To Implement Multiple Linear Regression In R?

In this section, we will implement multiple linear regression in R to find relationships among the startup datasets. You can find the code and data in the following links: Code and Data

Understanding The Data

We have a dataset containing information on 50 startups. Each row of the table below shows the profit of a startup based on R&D Spend, Administration, Marketing Spend, and State. The aim is to implement multiple linear regression in R to model the relationship between the response variable (profit) and predictor variables (R&D Spend, Administration, Marketing Spend, and State).

Install Libraries

Here, we will install caTools which is a collection of various utility functions in R. The caTools can be used for data splitting, generating indices, and sample size calculation in data science.

				
					install.packages('caTools')

Import Dataset

				
					dataset = read.csv('StartupData.csv')

Categorical Data Encoding

In categorical data encoding, categorical variables are transformed into numerical values so that they can be used for statistical analysis. There are various methods for categorical data encoding in R, such as Level encoding, One hot-encoding, and Frequency encoding.

The dataset we considered here has a categorical variable named “State” that can take values such as “New York”, “California” or “Florida”. We can convert the values of the categorical variable into numerical values using the following code snippet.

				
					
dataset$State = factor(dataset$State, levels = c('New York', 'California', 'Florida'), labels = c(1, 2, 3))

The above code snippet uses the Level encoding method to convert categorical data into numerical values. As we can see in the table below, the categorical data (New York, California, and Florida) are replaced with numerical values (1,2 and 3).

Split the dataset into Train set and Test set

Here we used, caTools to split the dataset into training and testing sets. We here considered a SplitRatio=0.75 to split the data.

				
					library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Fitting Multiple Linear Regression to the Training set

Let us fit a multiple linear regression model in R using the lm function. In the lm function, profit is defined as the response variable, and the rest of the variables in the dataset are defined as predictor variables.

				
					regressor = lm(formula = Profit ~ ., data = training_set)

Predicting the Test set results

Let us predict the model performance on the test dataset.

				
					y_pred = predict(regressor, newdata = test_set)

Visualize Model Prediction

				
					%%R
y_test=test_set$Profit
comparison <- cbind(y_pred, y_test)
print(comparison)

The table on the right-hand side compares the predicted value of the response variable using multiple linear regression with respect to the observed value.
We can see from the figure that the value predicted by multiple linear regression is quite close to the original value.
In the next step we will evaluate the performance of the multiple linear regression model using various metrics.

Model Evaluation and Validation

We can evaluate the performances of the multiple linear regression model with the help of various matrices such as mean squared error (MSE), root mean squared error (RMSE), and R-squared. Let us evaluate those :

Compute Mean Squared Error (MSE)

				
					mse <- mean((y_pred - y_test)^2)
print(paste("Mean Squared Error (MSE):", mse))

				
					Mean Squared Error (MSE): 59614884.0721883

Compute Root Mean Squared Error (RMSE)

				
					rmse <- sqrt(mse)
print(paste("Root Mean Squared Error (RMSE):", rmse))

				
					Root Mean Squared Error (RMSE): 7721.0675474437

Compute R-squared

				
					rsquared <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)
cat("R-squared:", rsquared, "\n")

				
					R-squared: 0.9210359

Compute Adjusted R-squared

				
					n <- nrow(test_set)
p <- length(coefficients(regressor)) - 1
adjusted_rsquared <- 1 - ((1 - rsquared) * (n - 1)) / (n - p - 1)
cat("Adjusted R-squared:", adjusted_rsquared, "\n")

				
					Adjusted R-squared: 0.8223309

Conclusion

This article starts with the basics of multiple linear regression and understanding the mathematics of it that relates a response variable with multiple predictor variables. We also discussed various assumptions that need to be met before applying multiple linear regression in R to find relationships among variables.

In this article, we implemented multiple linear regression in R to find the relationship between the profit of a startup based on R&D spending, administration, marketing spending, and state. Applying multiple linear regression in R is quite straightforward due to its user-friendly interface and extensive statistical libraries.

The performance of the model is further evaluated using various matrices such as mean squared error, root mean squared error, and adjusted R-squared. By analyzing the performance metrics, one can get a comprehensive understanding of how multiple linear regression captures the relationship between a response variable and multiple predictor variables.

Multiple linear regression is a powerful method for finding relationships between a response variable and multiple predictor variables. We can apply multiple linear regression in various sectors such as finances, economics, marketing, healthcare, education, and environmental science.

Frequently Asked Questions

What is one reason for performing multiple regression analysis?

Multiple linear regression helps to determine how several independent variables collectively influence a dependent variable. This helps a clear and comprehensive understanding of the factors affecting the outcome.

What is the advantage of using a multiple regression design?

Multiple linear regression allows us to analyze the impact of multiple independent variables simultaneously. This helps to make more accurate predictions and a deeper understanding of the relationships between variables.

References

Multiple Linear Regression In R

Multiple Linear Regression In Python

Multiple Linear Regression: Quick Guide

Simple Linear Regression In R

paravisionlab.co.in

Dr. Partha Majumder is a distinguished researcher specializing in deep learning, artificial intelligence, and AI-driven groundwater modeling. With a prolific track record, his work has been featured in numerous prestigious international journals and conferences. Detailed information about his research can be found on his ResearchGate profile. In addition to his academic achievements, Dr. Majumder is the founder of Paravision Lab, a pioneering startup at the forefront of AI innovation.

zskzynsifz

28 March 2024 at 08:05

Multiple Linear Regression in R: A Powerful Analysis Tool
[url=http://www.g8q4zat77bgf5x4zjq22a071md470l91s.org/]uzskzynsifz[/url]
azskzynsifz
zskzynsifz http://www.g8q4zat77bgf5x4zjq22a071md470l91s.org/

jmlsfyycoi

9 April 2024 at 10:14

Multiple Linear Regression in R: A Powerful Analysis Tool
ajmlsfyycoi
jmlsfyycoi http://www.g2j735r2aj3zij2f3x0e0k90d232dr2ls.org/
[url=http://www.g2j735r2aj3zij2f3x0e0k90d232dr2ls.org/]ujmlsfyycoi[/url]

Felix Meyer

6 July 2024 at 02:04

Great job site admin! You have made it look so easy talking about that topic, providing your readers some vital information. I would love to see more helpful articles like this, so please keep posting! I also have great posts about SEO, check out my weblog at http://rg4u.clan.su/go?http://webemail24.com/your-guide-to-email-marketing

Seoranko

23 July 2024 at 12:44

Informative articles, excellent work site admin! If you’d like more information about Party Rentals, drop by my site at https://gamai.net/bitrix/redirect.php?goto=http://seoranko.de/wie-man-effektive-landing-pages-fuer-google-adwords-erstellt Cheers to creating useful content on the web!

lunatogel

26 September 2024 at 18:33

j200m j200m j200m j200m j200m
That is very interesting, You are a very professional blogger.
I have joined your rss feed and look ahead to in the hunt
for extra of your excellent post. Additionally, I’ve shared your site
in my social networks

sgcwin88

26 September 2024 at 19:42

cr777 cr777 cr777 cr777 cr777
You can definitely see your expertise in the work you write.

The sector hopes for more passionate writers like you who aren’t afraid to mention how they believe.
Always follow your heart.

nanastoto

1 October 2024 at 06:14

dolantogel dolantogel
dolantogel
It is not my first time to pay a visit this web page, i am browsing
this website dailly and get fastidious data from here
everyday.

Inscription à Binance US

26 January 2025 at 20:19

Can you be more specific about the content of your article? After reading it, I still have some doubts. Hope you can help me.

luubet slot

27 January 2025 at 22:15

It’s very straightforward to find out any matter on web as compared to textbooks, as I found
this paragraph at this website.

Binance推荐奖金

28 January 2025 at 10:21

Your article helped me a lot, is there any more related content? Thanks!

luubet

28 January 2025 at 22:30

Hola! I’ve been following your website for some time now and finally got the courage to go ahead and give you a shout out from Atascocita Texas!
Just wanted to mention keep up the great work!

paravisionlab.co.in
29 January 2025 at 00:14

thank you so much.