## Regration analysis is statistical tool used for desion making and business forecasting.

According to YA-Lum-Chou ,"Regression Analysis attempts to establish the nature of relationship between variables that is to study the functional relationship between the variables and their by provide a mechanism for prediction and forecasting".

Regressipn analysis studies the nature of relationship between two variables so that we can estimate one variable against the variable.

**Regression Example:**

In a chemical process,suppose that the yield of product is related to the process operating temperature.Regression analysis can be used to build that expresses yield as a function of temperature.Regression model can be

**used to predict**yield at a given temperature level.It could also be**used for process optimization**or process control purposes.There are some basic concept of regression to understand before apply regression line.

1.

**Nature of Relationship:**Regression means the act ofreturn.It is mathematical measure which shows the average relationship between two variables.

2.,

**Forecasting****:****Regression**analysis enables to make prediction .For example by using regression equation x and y we can estimate the most probable value of x on the basis the given value of y.

3.

**Cause and Effect:****Regression**analysis clearly express the cause and effect relationship between two variables.It establishes

the functional relationship.Where one variable is treated as dependent variable and other one is independent variables.

4

**Absolute and Relative Measure:****The regression**coefficient is an absolute measure .If we know the value of independent variable.we can estimate of value of dependent variable.

5.

**Effect of change in origin and scale**:Regression coefficient are independent of change in origin but not of change in scale.

6.

**Symmetry:****In**regression analysis it is required to be clearly identified which variable is dependent variable and which one is independent. If b(xy) is not equal b(yx) ,so regression coefficient are not symmetric.

## Regression Line

Regression line shows the average relationship betweentwo variables.There are two regression lines,namely regression lines of x on y and regression line of y on x.On the basis of regression line ,we can predict the value of dependent variable on the basis of given value of the independent variable. It shows the average relationship betweentwo variables is called

**Line of Best Fit**.Here we can estimate the value of x on the basis of given value of y,otherwise the value of y on the basis of given value of x.If there is a perfect correlation between two variables x and y so that r = +1, r = -1 then there will be only one regression line.If correlation is positive then the direction of regression line will be upward from left to right.In the case of negative correlation the line will slope down from left to right it is shown in figure

The importnat fact to be noted here is that either both lines will move from left to right or right to left. It is not possible that one will be left to right and other one is right to left.

If two regression lines intersect eachother at 90 degree then it means that there is no correlation between two variables so r= 0 it is clear from the diagram given below.

Both the regression lines intersect eachother at the point of average of x and y. If we draw a perpendicular from the point of intersection on x axis we will get mean value of x and if we draw perpendicular from the point of intersection from y- axis we will get mean value of y.

## Regression Analysis Used

Regression analysis is

**used in economic ,business research and all the scientific**disciplines.In economics, it is used to estimate the relation between variables like price and demand,income and saving etc.If we know income ,we can estimate the probable saving. In business we know that the quantity of scale is affected by the expenditure on advertisement like social media etc.So a problem is based on cause and effect relationship,the regression analysis is very useful.Regression analysis is used to find out the

**standard error of estimate**for regressions which is a measure of unexplained variation in dependent variable. It is helpful to determine the reliability of the prediction made by using regression regression.**Regression analysis is used**to estimate the value of dependent variable against the given value of independent variable. It describes the average relationship existing between two variables x and y.For example a businessman can estimate the probable falls in demand if he decides to increase the price of commodity.

Regression analysis is used to make prediction in policy making.On the basis of regression analysis we can make prediction which provide sound basis for policy formulation in socio-economic field.We will see some example to use software

**python and R.****Regression analysis**is

**used machine learning artificial inteliagence**,data science ,cloud computing and security etc.Interconnectivity and data explosition are realities that open a world of new oppertunities for every business that can read and interpret data in real time.Coming from a long and glorious past in the field of statistics ,econometrics,linear regression and its derived methods can provide you with a simple reliable and effective tools R,python to learn from data and act on regression analysis.If carefully trained with the right data,linear regression analysis can complete well against the most complex and artificial inteligence technologies,offering you unbetable ease of implemention and scalability for increasingly large problems.

Regression analysis is

**used in data science**to serve thousands of customers using your company's website everyday.Using available information about customers in your data warehouse,in advertising business,an application delivering targeted advertisement.Regression analysis is

**used in e-commerce**, a batch application filtering customers to make more relevant commercial offers an online app recommending products to buy on the basis of emphemeral data such as nevigation records.Regression analysis is also

**used in credit and insurance business**, an application selecting whether to proceed with online inquiries from users,basing its judgement on their credit rating and past relationship with the company.Regression analysis is

**used in python for data science**.It can work better than other platforms with in memory data because of its minimal memory foot print and excellent memory management.The memory garbage collector will often save the day when you load,transform,dice,slice,save or discard data using the various reiterations of data wrangling.## Regression Line Equation

Regression Line Equation are the algebraic expression of regression line..As there are two regression lines, so there are two regression line equation:

1. Regression line equation of x on y. This equation is used to estimate the most probable value of 'x' against the given value of variable 'y'.Here:

'x' is dependent variable and

'y' is independent variable

2. Regression line equation of y on x.This regression equation is

used to estimate the most probable value of 'y' against the given

value of variable 'x'.Here:

'y' is dependent variable and

'x' is independent variable.

## Simple linear Regression

The relationship between a single regressor variable x and

a response variable y.The regressor variable x is assumed

to be a continuous mathematical variable,controllable by

the experimenter..Suppose that the true relationship between

y and x is a straight line,and that the observation y at each

level of x is a random variable.Now the expected value

y for each value of x is

E(y|x)=B0+B1x

Where the intercept B0 and the slope B1 are unknown constant.

### Least Squares Lines

Experimental data produce points (x1,y1),......(xn,yn) that when graphed,seem to lie close to a line.We want to determine the parameters B0 and B1 that make the line as"close'' to the points as possible.

Suppose B0 and B1 are fixed,and consider the line y=B0+B1x.Corresponding to each data point (xj,yj) there is a point(xj,B0+B1xj) on the line with the same x-coordinate.We call Yj the observed value of y and B0+B1xj the predicted y-value.The difference between an observed y-value and a predicted value is called a

**residual**.There are several ways to measure how "close" the line is to the data.The ptimarily because the mathematical calculations are simple to add the squares of the residuals.The least squares line is the line y=B0+B1X that minimizes the sum of the squares of the residuals.This line is also called a line of regression of y on x,because any error in the data are assumed to beonly in the y-cordinates.The coefficients B0,B1 of the line are called regression coefficients.

If the data points were on the line,the parameters B0 and B1 would satisfy the equations

**Predicted y-value Observed y- value**

B0+B1x1 = y1

B0+B1x2 = y2

. .

. .

B0+B1xn = yn

We can write this system as

XB=y, where X= [ 1 x1

1 x2

. .

. .

1 xn ],

B=[B0

B1],

y=[y1

y2

.

.

yn]

If the data points don't lie on a line, then there are no parameter B0,B1 for which the predicted y-values in XB equal the observed y-values in y, and XB=y hasno solution. This is a least squares problem,AX=b , with different notation!

The square of the distance between the vectors XB and y is precisely the sum the squares of the residuals.The B that minimizes this sum also minimizes the distance between XB and y. Computing the least -squares solution of XB=y is equivalent to finding the B that determines the least-squares line.

**Question**.

**Find the equation y=B0+B1x of the least- squares line that best fits the data points(2,1),(5,2),(7,3),(8,3)**.

We assume that each observation,y can be described by the model

y=B0+B1x+e

Where e is random error with mean zero and variance.The {e}

are also assumed to be uncorrelated random variables.The regression

model of ginen equation involving only a single regressor variable

x is often called the

**simple linear regression**model.Analysis of residuals is frequently helpful in checking the assumption that the errors are NID(0,var) and in determining wether additional terms in the model would be useful.The experimenter can construct a frequency histogram of the residuals or plot them on normal probability paper.If the error are NID(0,var),then approximately 95% of standardized residuals should fall in the interval(-2,2).Errors far outside this interval may indicate the presence of an outlier, that is an observationthat is a typical of the rest of the data.Various rules have been proposed for discarding outliers.

## Multiple Regression

Suppose an experiment involves two independent variables-say, x1 and x2 and one dependent varoable y. A simple equation to predict y from x1 and x2 has the form

y=B0+B1x1+B2x2..........(1)

Amore general prediction equation might have the form

y=B0+B1x1+B2X2+B3x1^2+B4 x1 x2+B5x2^2.........(2)

This equation is used in geology,for instance,to model erosion surface,glacial cirques,soil PH, and other quantities.In such cases,the least squares fit is called a trend surface.Both equation (1) and (2) lead a linear model because they are linear in the unknown parameters.In general a linear model will aries whenever y is to be predicted by an equation of the form

y=B0f0(x1x2)+B1f1(x1x2)+.............+Bkfk(x1x2)

with f0.....fk any sort of known functions and B0.....Bk unknown weight.

Multiple regression model invloves more than one regressor variable

is called a

**multiple regression**model.As an example , suppose thatthe effective life of a cutting tool depends on the cutting speed and

the tool angle.Amultiple regression model that might describe this

relationship is:

y=B0+B1x1+B2x2+e

Where y represents the tool life,x1 represents the cutting speed and x2

represents the tool angle.This is multiple linear regression model with

two regressors.Alinear function of the unknown parametersB0,B1,and

B2.The model describes plane in the two dimensional x1,x2 space.The

parameter B0 defines the intercept of the plane.We sometimes called

and B2 partial regression coefficients,because B1 measures the expected

change in y per unit change in x1 when x2 is held constant and B2

measures the expected change in y per unit change in x2 when x1 is

held constant.

### Multiple Regression Analysis

Multiple regression analysis is a logical extension of simple regression analysis.In this method,two or more independent variables are used to estimate the values of dependent variable.Here average relationship among three or more variables is computed and it is used to forecast the value of dependent variable on the basis of given values of independent variables.For example, we can estimate the value of X1 where the value of X2 and X3 are given.There are three main objective of multiple correlation and regression analysis:

a) To drive an equation which provides estimates of the dependent variable from values of two or more independent variables.

b) To measure the error of estimate invilved in using this regression equation as basis of estimation.

c) To measure the proportion of variable in the dependent variable which is explained by dependent variables.

**Regression Analysis Python**

import pandas as pd

from sklearn.datasets import load_iris

data = load_iris()

iris = pd.DataFrame

(data.data, columns=data.feature_names)

iris.head()

sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 |

1 | 4.9 | 3.0 | 1.4 | 0.2 |

2 | 4.7 | 3.2 | 1.3 | 0.2 |

3 | 4.6 | 3.1 | 1.5 | 0.2 |

4 | 5.0 | 3.6 | 1.4 | 0.2 |

iris['sepal length (cm)']

0 5.1
1 4.9
2 4.7
3 4.6
4 5.0
...
145 6.7
146 6.3
147 6.5
148 6.2
149 5.9
Name: sepal length (cm), Length: 150, dtype: float64

iris['sepal width (cm)']>1

0 True
1 True
2 True
3 True
4 True
...
145 True
146 True
147 True
148 True
149 True
Name: sepal width (cm), Length: 150,

dtype: bool

iris['petal width (cm)']>3

0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Name: petal width (cm), Length: 150,

dtype: bool

from matplotlib import pyplot as plt

import seaborn as sns

sns.scatterplot(x='sepal length (cm)',

y='petal length (cm)',data=iris)

y=iris[['sepal length (cm)']]

x=iris[['sepal width (cm)']]

from sklearn.model_selection

import train_test_split

x_train,x_test,y_train,y_test=

train_test_split

(x,y,test_size=0.3)

x_test.head()

sepal width (cm) | |
---|---|

17 | 3.5 |

112 | 3.0 |

26 | 3.4 |

31 | 3.4 |

80 | 2.4 |

x_train.head()

sepal width (cm) | |
---|---|

134 | 2.6 |

51 | 3.2 |

120 | 3.2 |

19 | 3.8 |

70 | 3.2 |

from sklearn.linear_model import

LinearRegression

lr=LinearRegression()

lr.fit(x_train,y_train)

LinearRegression(copy_X=True,

fit_intercept=True, n_jobs=None,

normalize=False)

y_pred=lr.predict(x_test)

y_test.head()

sepal length (cm) | |
---|---|

17 | 5.1 |

112 | 6.8 |

26 | 5.0 |

31 | 5.4 |

80 | 5.5 |

y_pred[0:5]

array([[5.68319154],
[5.84032627],
[5.71461848],
[5.71461848],
[6.02888794]])

from sklearn.metrics

import mean_squared_error

mean_squared_error(y_test,y_pred)

0.9060450498564653

y= iris[['sepal length (cm)']]

x=iris[['sepal width (cm)','petal length

(cm)','petal width (cm)']]

x_train,x_test,y_train,y_test=

train_test_split

(x,y,test_size=0.3)

x_train.head()

sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|

99 | 2.8 | 4.1 | 1.3 |

82 | 2.7 | 3.9 | 1.2 |

69 | 2.5 | 3.9 | 1.1 |

148 | 3.4 | 5.4 | 2.3 |

135 | 3.0 | 6.1 | 2.3 |

lr2=LinearRegression()

lr2.fit(x_train,y_train)

LinearRegression(copy_X=True,

fit_intercept=True, n_jobs=None,

normalize=False)

y_pred=lr2.predict(x_test)

mean_squared_error(y_test,y_pred)

0.0813243348427029

# modle 2 is better than model 1

because less mean square error is good

## 0 Comments