This question may or may not have cause me to go on a statistical regression and assessment binge over the last week...
As some of you know, I began a
Google Sheets document a year ago to not only plot out my rankings of roller coasters, but their respective statistics, manufacturers, designers, etc. The original reason for this exercise was to add up how many miles, inversions, etc. I had travelled while riding roller coasters. But I realized, having all these data points, it could be possible to create a statistical regression to see which statistics are most closely correlated to my ranking. In other words - are there certain statistics, such as height, speed, length, that I find most appealing in a roller coaster?
So, while I am still working on generating the statistical regression and showing what factors would have most statistical significance, I give you a preview of findings! The tables that follow are scatter plots of a roller coaster's ranking, and their respective statistic. You'll notice a trendline projected through these data points - this trendline has been logged, to help account for randomness in the data (otherwise known as heteroskedasticity) and smooth out the trend. I've also attached an r-squared value to the trendlines, which is a statistical tool that helps paint a picture on how closely two statistics are linked (in this case, how closely linked are coaster ranking to the individual statistic). In this case, the higher the r-squared value, the better; as it indicates a closer linkage of the two stats.
As you can see, the correlations (in order of greatest to least) are:
- Top Speed - 0.368
- Max Vertical Angle - 0.312
- Height - 0.278
- Cost - 0.258
- Length - 0.238
- Inversions - 0.052
- Duration - 0.021
I'll continue tinkering, and report back to the group as I keep building out the regression model. If there's anything to take away from this assessment at this point, it is:
- Statistics alone are a very difficult indicator of how much you will like a roller coaster.
- Low R-Squared values (ideally, you look for R-Squared values in the 0.5-0.6 range and up for modeling) means there are a lot of other factors, such as aesthetic, airtime, etc. that are difficult to quantify, which have bearing on how much you like a ride.
Those of you who are stat wonks, would also appreciate your input on how more I can build out this model and better capture correlation.