Statistics Yup (Correlation Coefficient)

If you have ever done a lab experiement you have probably tried “To Obtain A Graphical And Mathematical Relationship Between“ two different values. (TOAGAMRB as my Physics teacher says.) Often times your experiment will result in a sort of scatter polt as shwon below:

This data however, its useless unless you can predict the outcome of future experiments. This is where a “line of best fit“ comes in handy. The closer your data is to this line, the more acturate your derived relationship is. See the example below:

As you can see, Graph 1 has data values where the y-values are closer to the line. Conceptually, this line would better predict outcomes at different x values. Graph 2 has data values that vary greater from the line. This predicted line of best fit would not do as great of a job at predicting future values. Both cases, however, use the data that is plotted to predict future outcomes and determine relationships.

On the graph, you probably also notice an “R^2“ value. Graph 1’s value is closer to 1 while Graph 2’s value is farther from 1. This R^2 value is best known as the correlation coefficient. This correlation coefficient is a measure of how well the line used best describes the relationship between x and y. In the two cases above, a quadratic equation is used to find this best fit line, but for the sake of explanation, let’s use a linear graph.

Think of all of these values as collected data. To find the correlation coefficent, we can use the equation below:

First, let’s explain what this equation means. The first part of this is pretty simple. 1/(n-1) or 1/(number of values-1). In the case above, this would be 1/4.

The next part of this equation uses summation. The x with a bar over is is the mean of the x-values or the average. The y with a bar over it is the mean of the y-values or the average. The x-values have a mean of 3 and the y-values have a mean of 6.2. The variables in the denominator of the standard deviations for each data set. The standard deviation is also defined as the measure of the dispersion of data from its mean. What is interesting, is that each factor is this summation is also the equation for the data “z-score.“

The z-score is a value attributed to each variable that standardized its standard deviation. It is found by taking the (value-mean)/standard deviation. The z-score is “how many“ standard deviations and the point is from the mean.

In the correlation coefficient equation, for each x and y value, their corresponding z-scores are multiplied together. For example, the z-score for 1 would be multiplied by the z-score for 2. The same process is repeated and the products are added. Finally, this summation is multiplied by the value 1/4 yielding an R-value of roughly 0.969.

All of that seems kinda complicated and hard to visualize when only working with numbers. Using the same variables, an easier to understand picture is shows below:

The less deviation from the line each point it, the closer to accurate the best fit line is. Since the slope of this line is also positive, there is a positive correlation between x and y. A netative value for r indicates an inverse relationship between x and y.

Statistics Yup (Correlation Coefficient)

FemSTEM 1: Finding Mathematical Relationships

Covering Spaces and Packing Things

Shinnick