CORRELATION DEFINITIONS

Correlation  in Statistics

CORRELATION DEFINITIONS

Correlation means two quantitative facts having the relationship of cause and effect are varying simultaneousely in the same or in the opposite directions,the measurement of such variations.

Correlation definition according to L.R.Conner "If two or more quantities vary in sympathy so that movements in the one tend to be accompanied by corresponding movements in the other,then they are said to be correlated".

Correlation definition according to King "Correlation means that between two series or groups of data there exists some casual connections".

Correlation definition according to Croxton and Cowden "When the relationship is of a quantitative nature,the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brif formula is known as correlation".

Correlation definition according to W.A.Neiswanger "Correlation analysis contributes to the understanding of economic behaviour,aids in locating critically important variables on which other depend,may reveal to the economist the connections by which disturbances sprade and suggest to him the paths through which stabilising forces many become effective".

Perfect Correlation

When the movement in two related variables is in the same direction and  the same proportion,it is perfect possitive correlation.The coefficient  of correlation(r) in this case will be +1.On the other hand,if changes are proportional but in opposite direction,it will be perfect negative correlation and its calculated value will be -1.

Perfect Correlation

Absence of Correlation

If no independence is found between two variables or there is no relationship between deviation in one variable to corresponding deviations in the other variable, it is the situation of absence of correlation and in this case coefficient of correlation will be zero.

                

Absence of Correlation

The correlation coefficient is a single-number summary expressing the utility of linear regression. the correlation coefficient is a dimensionless number between - 1 and + 1. The slope and the correlation have the same positive or negative sign. This single number is used to convey the strength of a linear relationship, so values closer to - 1 or + 1 indicate greater fidelity to a straight-line relationship. 

The correlation is standardized in the sense that its value does not depend on the means or standard deviations of the x or y values.

If we add or subtract the same values from the data (and therebychange the means ),the correlation remains the same.If we multiplyall the xs (or the ys)by some positive value,the correlation remainsthe same.If we multiply either the xs or the ys by a negative number,the sign of the correlation will reverse.                                  

As with any oversimplification of a complex situation, the correlation coefficient has its benefits, but also its shortcomings. A variety of values of the correlation are illustrated. Each of these separate graphs consists of 50 simulated pairs of observations. A correlation of 0 in the upper left of no indication of a linear relationship between the plotted variables. A correlation of 0.4 does not indicate much strength, either A correlation of either 0.8 or-0.9 indicates a rather strong linear trend.

Importance of correlation

More Reliable Forecasting: It is important ,to mention that business forecasting is an important aspect of decision making process in business world.
Correlation is useful in research,the technique of correlation proves very useful in making analysis,drawing conclusions and developing hypothesis and theories in the area of research and investigation.
Study of Economic Activity:Correlation is also useful in analytical study of economic activities.For example,it is very useful in studing the impact of price change on the change in demand,of change in suply of money on the price of money and varition in production of cotton on the production of cloth,etc.


Correlation analysis python

 Data has taken from github

from urllib.request import urlretrieve
urlretrieve(medical_charges_url,'medical-charges.csv')
('medical-charges.csv', <http.client.HTTPMessage at 0x7f159ec4ec50>)
import pandas as pd
medical_df = pd.read_csv('medical-charges.csv')
medical_df

agesexbmichildrensmokerregioncharges
019female27.9000yessouthwest16884.92400
118male33.7701nosoutheast1725.55230
228male33.0003nosoutheast4449.46200
333male22.7050nonorthwest21984.47061
432male28.8800nonorthwest3866.85520
........................
133350male30.9703nonorthwest10600.54830
133418female31.9200nonortheast2205.98080
133518female36.8500nosoutheast1629.83350
133621female25.8000nosouthwest2007.94500
133761female29.0700yesnorthwest29141.36030

1338 rows × 7 columns


medical_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1338 entries, 0 to 1337 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 1338 non-null int64 1 sex 1338 non-null object 2 bmi 1338 non-null float64 3 children 1338 non-null int64 4 smoker 1338 non-null object 5 region 1338 non-null object 6 charges 1338 non-null float64 dtypes: float64(2), int64(2), object(3) memory usage: 73.3+ KB
medical_df.describe()
agebmichildrencharges
count1338.0000001338.0000001338.0000001338.000000
mean39.20702530.6633971.09491813270.422265
std14.0499606.0981871.20549312110.011237
min18.00000015.9600000.0000001121.873900
25%27.00000026.2962500.0000004740.287150
50%39.00000030.4000001.0000009382.033000
75%51.00000034.6937502.00000016639.912515
max64.00000053.1300005.00000063770.428010
# Exploratory Analysis and  Visulation
import plotly.express as pt
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size']=14
matplotlib.rcParams['figure.figsize']=(20,4)
matplotlib.rcParams['figure.facecolor']='#00000000'
medical_df.age.describe()
count 1338.000000 mean 39.207025 std 14.049960 min 18.000000 25% 27.000000 50% 39.000000 75% 51.000000 max 64.000000 Name: age, dtype: float64
fig=pt.histogram(medical_df,
                 x='age',
                 marginal='box',
                 nbins=47,
    title='Distribition of Age')
fig.update_layout(bargap=0.1)
fig.show()
Distribition of Age mathclasstutor
fig= pt.histogram(medical_df,
                  x='bmi',
                  marginal='box',
                  color_discrete_sequence=['red'],
                  title='Distribution of BMI(Body Mass Index)' )
fig.update_layout(bargap=0.1)
fig.show()
Distribution of BMI(Body Mass Index) mathclasstutor
fig = pt.histogram(medical_df,
                 x='charges',
                 marginal='box',
                 color='smoker',
                 color_discrete_sequence=['green','grey'],
                 title='Annual Medical Charges' )
fig.update_layout(bargap=0.1)
fig.show( )
Annual Medical Charges
fig = pt.histogram(medical_df,
                 x='sex',
                 marginal='box',
                 color='region',
                 color_discrete_sequence=['green','grey'],
                 title='Annual Medical Charges' )
fig.update_layout(bargap=0.1)
fig.show( )
Annual Medical Charges
medical_df.smoker.value_counts()
no 1064 yes 274 Name: smoker, dtype: int64
pt.histogram(medical_df,x='smoker',color='sex',title='smoker')
smoker.value
#Visualize the relationship between 'age ' 
and 'charges
fig =pt.scatter(medical_df,
                x='age',
                y='charges',
                color='smoker',
                opacity=0.8,
                hover_data=['sex'],
                title='Age vs Charges')
fig.update_traces(marker_size=5)
fig.show()
Age vs Charges
fig =pt.scatter(medical_df,
                x='bmi',
                y='charges',
                color='smoker',
                opacity=0.8,
                hover_data=['sex'],
                title='BMI vs Charges')
fig.update_traces(marker_size=5)
fig.show()
BMI vs Charges
pt.scatter(medical_df,x='children',y='charges')
children
pt.violin(medical_df,x='children',y='charges')
charges

Machine learning book

#Correlation
medical_df.charges.corr(medical_df.age)
0.2990081933306476
medical_df.charges.corr(medical_df.bmi)
0.19834096883362895
smoker_values={'no'0,'yes'1}
smoker_numeric =medical_df.smoker.map(smoker_values)
medical_df.charges.corr(smoker_numeric)
0.787251430498478
pt.scatter(medical_df,x='age',y='age')

Statistics book in python

age
pt.scatter(medical_df,x='age',y='children')
children
medical_df.corr()
agebmichildrencharges
age1.0000000.1092720.0424690.299008
bmi0.1092721.0000000.0127590.198341
children0.0424690.0127591.0000000.067998
charges0.2990080.1983410.0679981.000000
sns.heatmap(medical_df.corr(),cmap='Reds',annot=True)
plt.title('Correlation Matrix');
Correlation Matrix

More related updates on correlation in python

Regration Analysis Used for


Post a Comment

0 Comments