Introduction to Linear Regression
Linear regression is used to predict the relationship between two variables by applying a linear equation to observed data. There are two types of variable, one variable is called an independent variable, and the other is a dependent variable. Linear regression is commonly used for predictive analysis. The main idea of regression is to examine two things. First, does a set of predictor variables do a good job in predicting an outcome (dependent) variable? The second thing is which variables are significant predictors of the outcome variable? In this article, we will discuss the concept of the Linear Regression Equation, formula and Properties of Linear Regression.
Examples of Linear Regression
The weight of the person is linearly related to their height. So, this shows a linear relationship between the height and weight of the person. According to this, as we increase the height, the weight of the person will also increase. It is not necessary that one variable is dependent on others, or one causes the other, but there is some critical relationship between the two variables. In such cases, we use a scatter plot to simplify the strength of the relationship between the variables. If there is no relation or linking between the variables then the scatter plot does not indicate any increasing or decreasing pattern. In such cases, the linear regression design is not beneficial to the given data.
Linear Regression Equation
The measure of the relationship between two variables is shown by the correlation coefficient. The range of the coefficient lies between -1 to +1. This coefficient shows the strength of the association of the observed data between two variables.
Linear Regression Equation is given below:
Y=a+bX
where X is the independent variable and it is plotted along the x-axis
Y is the dependent variable and it is plotted along the y-axis
Here, the slope of the line is b, and a is the intercept (the value of y when x = 0).
Linear Regression Formula
As we know, linear regression shows the linear relationship between two variables. The equation of linear regression is similar to that of the slope formula. We have learned this formula before in earlier classes such as a linear equation in two variables. Linear Regression Formula is given by the equation
Y= a + bX
We will find the value of a and b by using the below formula
a= \[\frac{\left ( \sum_{Y}^{} \right )\left ( \sum_{X^{2}}^{} \right )-\left ( \sum_{X}^{} \right )\left ( \sum_{XY}^{} \right )}{n\left ( \sum_{x^{2}}^{} \right )-\left ( \sum_{x}^{} \right )^{2}}\]
b= \[\frac{n\left ( \sum_{XY}^{} \right )-\left ( \sum_{X}^{} \right )\left ( \sum_{Y}^{} \right )}{n\left ( \sum_{x^{2}}^{} \right )-\left ( \sum_{x}^{} \right )^{2}}\]
Simple Linear Regression
Simple linear regression is the most straight forward case having a single scalar predictor variable x and a single scalar response variable y. The equation for this regression is given as y=a+bx
The expansion to multiple and vector-valued predictor variables is known as multiple linear regression. It is also known as multivariable linear regression. The equation for this regression is given as Y = a+bX. Almost all real-world regression patterns include multiple predictors. The basic explanations of linear regression are often explained in terms of multiple regression. Note that, in these cases, the dependent variable y is yet a scalar.
Least Square Regression Line or Linear Regression Line
The most popular method to fit a regression line in the XY plot is found by using least-squares. This process is used to determine the best-fitting line for the given data by reducing the sum of the squares of the vertical deviations from each data point to the line. If a point rests on the fitted line accurately, then the value of its perpendicular deviation is 0. It is 0 because the variations are first squared, then added, so their positive and negative values will not be cancelled. Linear regression determines the straight line, known as the least-squares regression line or LSRL. Suppose Y is a dependent variable and X is an independent variable, then the population regression line is given by the equation;
Y= B0+B1X
Where
B0 is a constant
B1 is the regression coefficient
When a random sample of observations is given, then the regression line is expressed as;
ŷ = b0+b1x
where b0 is a constant
b1 is the regression coefficient,
x is the independent variable,
ŷ is known as the predicted value of the dependent variable.
Properties of Linear Regression
For the regression line where the regression parameters b0 and b1are defined, the following properties are applicable:
The regression line reduces the sum of squared differences between observed values and predicted values.
The regression line passes through the mean of X and Y variable values.
The regression constant b0 is equal to the y-intercept of the linear regression.
The regression coefficient b1 is the slope of the regression line. Its value is equal to the average change in the dependent variable (Y) for a unit change in the independent variable (X)
Regression Coefficient
The regression coefficient is given by the equation :
Y= B0+B1X
Where
B0 is a constant
B1 is the regression coefficient
Given below is the formula to find the value of the regression coefficient.
B1=b1 = ∑[(xi-x)(yi-y)]/∑[(xi-x)2]
Where xi and yi are the observed data sets.
And x and y are the mean value.
Importance of Regression Line
A regression line is used to describe the behaviour of a set of data, a logical approach that helps us study and analyze the relationship between two different continuous variables. Which is then enacted in machine learning models, mathematical analysis, statistics field, forecasting sectors, and other such quantitative applications. Looking at the financial sector, where financial analysts use linear regression to predict stock prices and commodity prices and perform various stock valuations for different securities. Several well-renowned companies make use of linear regressions for the purpose of predicting sales, inventories, etc.
Key Ideas of Linear Regression
Correlation explains the interrelation between variables within the data.
Variance is the degree of the spread of the data.
Standard deviation is the dispersion of mean from a data set by studying the variance’s square root.
Residual (error term) is the actual value found within the dataset minus the expected value that is predicted in linear regression.
Important Properties of Regression Line
Regression coefficient values remain the same because the shifting of origin takes place because of the change of scale. The property says that if the variables x and y are changed to u and v respectively u= (x-a)/p v=(y-c) /q, Here p and q are the constants.Byz =q/p*bvu Bxy=p/q*buv.
If there are two lines of regression and both the lines intersect at a selected point (x’, y’). The variables x and y are considered. According to the property, the intersection of the two regression lines is (x`, y`), which is the solution of the equations for both the variables x and y.
You will understand that the correlation coefficient between the two variables x and y is the geometric mean of both the coefficients. Also, the sign over the values of correlation coefficients will be the common sign of both the coefficients. So, if according to the property regression coefficients are byx= (b) and bxy= (b’) then the correlation coefficient is r=+-sqrt (byx + bxy) which is why in some cases, both the values of coefficients are negative value and r is also negative. If both the values of coefficients are positive then r is going to be positive.
The regression constant (a0) is equal to the y-intercept of the regression line and also a0 and a1 are the regression parameters.
Regression Line Formula:
A linear regression line equation is written as-
Y = a + bX
where X is plotted on the x-axis and Y is plotted on the y-axis. X is an independent variable and Y is the dependent variable. Here, b is the slope of the line and a is the intercept, i.e. value of y when x=0.
Multiple Regression Line Formula: y= a +b1x1 +b2x2 + b3x3 +…+ btxt + u
Assumptions made in Linear Regression
The dependent/target variable is continuous.
There isn’t any relationship between the independent variables.
There should be a linear relationship between the dependent and explanatory variables.
Residuals should follow a normal distribution.
Residuals should have constant variance.
Residuals should be independently distributed/no autocorrelation.
Solved Examples
1. Find a linear regression equation for the following two sets of data:
Sol: To find the linear regression equation we need to find the value of Σx, Σy, Σx
2
2
and Σxy
Construct the table and find the value
The formula of the linear equation is y=a+bx. Using the formula we will find the value of a and b
a= \[\frac{\left ( \sum_{Y}^{} \right )\left ( \sum_{X^{2}}^{} \right )-\left ( \sum_{X}^{} \right )\left ( \sum_{XY}^{} \right )}{n\left ( \sum_{x^{2}}^{} \right )-\left ( \sum_{x}^{} \right )^{2}}\]
Now put the values in the equation
\[a=\frac{25\times 120-20\times 144}{4\times 120-400}\]
a= \[\frac{120}{80}\]
a=1.5
b= \[\frac{n\left ( \sum_{XY}^{} \right )-\left ( \sum_{X}^{} \right )\left ( \sum_{Y}^{} \right )}{n\left ( \sum_{x^{2}}^{} \right )-\left ( \sum_{x}^{} \right )^{2}}\]
Put the values in the equation
\[b=\frac{4\times 144-20\times 25}{4\times 120-400}\]
b=\[\frac{76}{80}\]
b=0.95
Hence we got the value of a = 1.5 and b = 0.95
The linear equation is given by
Y = a + bx
Now put the value of a and b in the equation
Hence equation of linear regression is y = 1.5 + 0.95x
FAQs on Linear Regression
1. Why are regression lines considered to be important?
Regression lines are used in the financial sector by companies where various financial analysts implement linear regressions to predict stock prices, commodity prices and to perform valuations for many different securities.
2. How do you define Slope?
Slope tells you how much your target variable will change as the independent variable increases or decreases.
The formula of the slope is y=mx+b, where m is the slope.
3. How will you explain the difference between Linear regression and multiple regression?
The main difference between linear and multiple linear regression is that linear regression contains only one independent variable whereas multiple regression contains two or more independent variables.
4. How will you define cost function in linear regression?
Cost function is the calculation of the error obtained between the predicted values and actual values, which is represented as a single number called an error.
5. What are some examples of linear regression?
Total number of sales, Agricultural scientists use linear regression to estimate the effect of fertilizer on the total crops yielded, the effect of drug dosage on blood pressure.
6. What are the Types of Linear Regression?
Different types of linear regression are:
Simple linear regression
Multiple linear regression
Logistic regression
Ordinal regression
Multinomial regression
Discriminant Analysis
7. What are the Differences Between Linear and Logistic Regression?
Linear regression is used to predict the value of a continuous dependent variable with the help of independent variables. Logistic Regression is used to predict the categorical dependent variable with the help of independent variables. It is also used to predict the values of categorical variables.
8. How Does a Linear Regression Work?
Linear Regression is the process of finding a line that best fits the data points available on the plot. So it used to predict output values for inputs that are not present in the data set. Generally, those outputs would fall on the line.