Goals of Analysis

Upon completion of data collection, our team transitioned into an analysis phase breaking up the data and interpreting the results. There were three different analyses that were performed on the data. Aaron completed the DOE analysis, Noah did the correlation and regression, and Ethan did the statistical analysis. Each of these types aided our team in determining the overall best location to study based on the measured response variables. The design of experiments analysis helped us understand the response that each output variable had based on our chosen input factors and their corresponding levels. It also displayed possible interactions in understanding the significance each level had. The correlation and regression analysis allowed for an overview of which trendlines in the data fit different relationships as well as the confidence in those results. The statistical analysis was vital in understanding which relationships of input factors are statistically significant and meaningful tying together the overall validity of the data collection and resulting conclusions. Together, each analysis helped us answer our main project question based on our findings.

Analysis: Text

Correlation & Regression Analysis

By Noah Thrush

Introduction

The goal of running linear regression tests is to see if there is a linear relationship between the data. Finding this relationship could allow us to predict the outcomes of this experiment because this relationship exists. The goal of running correlation tests is to see if we can detect a relationship between the data that can show us how one set of data affects the other. Being able to tell the linear relationship between the time of day and the sound levels can allow us to find the best place to study more effectively. It is possible that the best place on campus to study could change with the time of day, and finding these relationships will help us pinpoint what times that is if that is true.

Methods

I analyzed this data by first setting up the tables that I was going to use to set up my data and equations, as seen in Figure 1.

Screen Shot 2020-11-27 at 2.02.44 PM.png

Figure 1: Table that was used to calculate the linear regression in temperature cost of the data collection process

In the first two columns, I inserted the time that passed by (x) and the temperature of the surrounding environment (y). Once the data was inserted and x*y, x2, and y2 were calculated, I graphed the x and y data and put a trendline in the data as seen in Figure 2.

Screen Shot 2020-11-27 at 2.07.32 PM.png

Figure 2: Graph of temperature vs. Time with the line equation and R^2 value.

After the graph and trendline were made, I used the equation that Excel gave me to calculate the ypredicted values and used them to then calculate (y-ypredicted)^2, and used the r^2 value that Excel gave me for the r^2 value. These equations were different for each of the graphs but were either in the form of y=mx+b or y=ax2+bx+c. Then I calculated xmean to calculate (x-xmean)^2. Once all of these values were calculated and put into the excel document, I was able to calculate the Sy, Sxx, swing term, and confidence intervals using the equations that can be seen in Figure 3.

Screen Shot 2020-11-27 at 2.11.08 PM.png

Figure 3: Equations that calculated Sy, Sxx, swing terms, and confidence intervals

Finally, the swing terms were used to make error bars out of the confidence intervals on the graph. Once all of these things were calculated, I performed some further linear regression analysis. I used the data analysis feature in excel to break down the data as seen in Figure 4.

Screen Shot 2020-11-27 at 2.13.49 PM.png

Figure 4: Linear Regression analysis performed by using the data analysis option and using the regression option.

Once the regression analysis was completed, I did my correlation analysis. I used the CORREL function in excel to get an r-value that would describe how strong the correlation was between the data. Once I obtained the r-value, I was able to calculate a t-value for each of the data sets. I found this value by using the equation (r*SQRT(n-2))/SQRT(1-r^2). After I obtain the t-value, I plugged these values into the T.DIST.2T function in excel to get a p-value which would show us the significance of the correlation.

Results

Screen Shot 2020-12-01 at 2.22.28 AM.png

Figure 5: X-Y experiment that our group ran that measured temperature every hour

Directly above are the results of our Correlation and Linear Regression analyses. These three graphs in particular did a very good job at representing the other data sets that fall in the temperature vs. time data and sound vs. time data. Though these are not all of the graphs that were created during these analyses, these are the ones that best represented all of the others. After analyzing this data, there is not a consistent existing correlation between any of the above factors. There existed a p-value of less than 0.05 for about one-third of the temperature vs. time graphs, but a significant correlation did not exist between our X-Y data. We also were able to observe a jump in temperature at 5:00 PM and 6:00 Pm, which could have been a result of how our temperature probe reacted to the sun shining down directly upon it at those times. However, there was a jump in temperature between. In addition to some of the temperature vs. time graphs, there did exist a correlation between the sound level vs. time graphs. The correlation analysis produced a value that was below 0.05, showing that there was a correlation between the two. This is a very interesting result since none of these data sets produced an r-value that was anywhere close to one.

After testing the correlation of the two data sets, I began the linear regression analysis. Though it was hard to see and much of the data did not fall within the bounds of their confidence intervals, there did exist some statistical significance of the linear relationship in the sound vs. time level data set. When looking at linear regression, the r2 value is the biggest indicator as to whether a relationship between the data exists or not. The r2 value tells us how well the trendline of the graph fits the data. The closer that the trendline is to 1, the more accurately the trendline fits, and the more accurately you can predict the data outcomes of the experiment. For the X-Y data that can be seen in Figure 5, I found an r2 value of 0.2, which is a very low r2 value and tells us that there is no linear relationship. Also, the regression analysis did not give us an F-value or p-value that was below 0.05. The second data set whose graph can be seen in Figure 6 was an experimental run that took place in the Library on a weekend morning. gave an r2 value of 0.076, which is an extremely low r2 value, and well as F and p-values that were not below 0.05. Finally, Figure 7 shows the graph of sound level vs. time in the library on a weekday morning. After running the linear regression analysis we have an r2 value of 0.03. Though this produced a very small r2 value, further analysis showed that there was a statistical relationship in their relationship. We found that the p-value that the regression analysis gave us was below 0.05, and so did the F value. This means that it is possible to accurately predict the outcome of this experiment. After analyzing all this data, we can say that a linear relationship between temperature and time does not exist, but we can see a slight relationship between time and sound level.

Conclusions

In the end, the goal of our project was to determine which spot on campus that serves food is the best place to study. Being able to observe a relationship between time and temperature, or time and sound level could allow us to be able to predict two of the factors that could help us to determine that. Weakness in our analysis could have come from the small amounts of time that data was collected or the small sample size. For example, for our X-Y experiment, we took one temperature every hour. We could have more accurately found the change in temperature vs. time by more frequently checking the temperature as time passed to see if there was a significant fluctuation between the once an hour recording that we had been taking. In the end, we did not find statistical significance in a relationship between temperature and time, though we did see one between sound level and time. Though no correlation was detected, we could see that we could somewhat predict what the sound level would look like over time. With these things in mind, we can now see that the study location is not dependent on the time of day. There is not a correlation between temperature or sound level with a time of day, and there also is not a linear relationship between these variables. This means that the best study location on campus does not depend on the time of day because these factors do not change.

DOE Data Analysis

By Aaron Bashore

Introduction

In this DOE analysis, I hope to assist the reader in interpreting the findings from this experiment and make conclusions as to the most ideal place on campus for them to munch and crunch. Raw data by itself can be extremely confusing, so by presenting the data in a DOE analysis, it will highlight various connections that were discovered and leave the reader with the right information to make an informed decision. The DOE analysis shows how our output variables, temperature, foot traffic, and noise level, change as we change our factor levels, location, time and day of the week. Also, through interaction plots, DOE analysis can highlight the connection between two particular factors. I did this type of analysis because I believe it presents the evidence clearly and intuitively, without confusing calculations. It is a very visual form of data presentation so there is a bigger emphasis on the presentation rather than the behind-the-scenes statistics. This analysis clearly presents the data and shows connections between various factor levels and response variables. Overall, this DOE analysis will clearly display the connections between our chosen factors and output variables and allow us to conclude on the best place on campus to munch and crunch.

Methods

Before we could begin DOE analysis, we formatted our raw data in such a way that we could easily reference it from other sheets in our excel document. We found the easiest way to do this was to format each sheet the same so that each cell in each sheet, contained the same information as every other sheet. By formatting our sheets in this manner, we were able to easily create summary tables by only changing the sheet reference in the equation. The summary table of averages for sound level is shown below in Table 1.

Screen Shot 2020-11-27 at 2.57.55 PM.png

Table 1: Summary Table of Sound Level from Raw Data (units of dB).

From this summary table, I utilized the AVERAGE and STDEV.S excel functions to calculate the mean and standard deviation for each column of data. In addition, I formatted the calculations table, identically to the summary table so that I was able to drag the functions across the rows, and efficiently create the calculations table. This data from the calculations table, in conjunction with the baseline data, was used to create the response distribution plots.

Screen Shot 2020-11-27 at 3.00.41 PM.png

Table 2: Calculations Table of Sound Level from Raw Data (units of dB).

To create the sound level versus time graphs, I used nine sets of raw data that were all collected on the same day between 9:00 – 10:30 am. Because we collected our data over multiple days, I decided that a subset of our data that was collected over the same span of time would be a more accurate representation of any potential drift in our calculations. I created the sound level versus time plots by averaging the 300 sound level readings for each set of data, and I took this average as one point on the plot.

Then to create the response vs. factor level I created a factor level response table that averaged each of the corresponding factor level data. For example, to create the Weekday data, I averaged and took the standard deviation of each column of “Weekday” data from Table 1. I did this process for each response variable and formatted all the “Weekday” data in the factor level response table below. Then I repeated this process for each factor level.

Screen Shot 2020-11-27 at 3.04.13 PM.png

Table 3: Factor Level Response Table

From Table 3, I was able to easily create response versus factor level bar graphs to show the difference in average response between factor levels. I then used Table 3 to create the DOE means plots by using all of the mean data from one row.

Finally, to create the interaction plots, I chose interactions that would be important to show and I created interaction summary tables for each one. To create these tables, I took the averages of each corresponding factor level combination from Table 1. For example, to create the interaction table between the day of the week and location, I averaged the two columns corresponding to “Weekday” and the two columns corresponding to “Weekend” for each location. This resulting table is shown below.

Table 4. Interaction Summary Table for the day of the week and location.

I repeated this process for the time of the day and location interaction, and the time of the day and day of the week interaction. I showed these interactions for both sound level and temperature response variables.

Results

A) Response Distribution Scatterplots

In part A of the DOE analysis, I analyzed the distribution of our response variables from trial to trial. In Figures 8-10, you can see the variation in responses we received over the course of the experiment. In the plots, Trials 1 – 4 were taken in the Union, Trials 5 – 8 were taken in Lottie, and Trials 9 – 12 were taken in the library. These plots specifically highlight the consistency between our pre-and post-baselines with nearly identical responses and standard deviations for both. Also, you can see the differences between the three locations when looking at the sound level response distribution plot, where you can see that each location grouping has distinct sound ranges, with the Library sound level readings much lower.

Screen Shot 2020-11-27 at 4.20.37 PM.png

Figure 8: Response Distribution Plot of Temperature

B) Response vs. Time Scatterplots

In part B of the DOE analysis, I looked at the response vs. time scatterplot for one section of our data. I chose this portion because it consists of consecutive nine runs at one location. In these plots, I would be able to track any drift in our data that occurred over time. As can be seen in Figures 11-13 below, there is no noticeable drift over time, with all of our data staying relatively consistent except for a couple of occasional outliers.

Screen Shot 2020-11-30 at 2.05.59 PM.png

Figure 11: Sound Level vs. Time Scatterplot for Library

C) Response vs. Factor Levels

In part C of the DOE analysis, we can see the connections between various factor levels and each response variable. In Figures 14-22 we can see how each factor level impacts the mean of our response variable and the standard deviation for each variable. One insight from Figure 16 is that the library is much quieter than both the Union and Lottie. This particular insight is not especially surprising, but this bar graph helps visualize the degree of difference between the factor levels. One additional insight from Figure 19 is that on average, Lottie is nearly 2 degrees cooler than the Library, so if you are one who enjoys munching and crunching the warmth, you may want to stay away from Lottie.

Screen Shot 2020-11-30 at 2.07.50 PM.png

Figure 14: Sound Level with Varying Days

D) DOE Means Plot

In part D of the DOE analysis, we looked at the DOE means plot for each response variable. These plots, as shown in Figures 23-25, show the same information as the bar graphs in section C but display the information for each factor on a plot with connected lines instead of a bar graph. In these plots, each line represents one factor with each point on the line a different factor level. From these graphs, we can determine the most important factor as well as the best setting for each response variable. The most important factor is determined by the line that has the steepest slope and the best setting is the factor level that is the closest to the ideal. For the sound level, the most important factor is definitely the location and the best setting is the Library. For Temperature, the most important factor is also the location and the best setting is a bit subjective as people prefer different temperatures but assuming people might want it as warm as possible, the best setting is the Library. And finally, for foot traffic, all the factors have nearly the same importance, but the day of the week is slightly more important than the other two. The best setting for foot traffic is also very close but weekday is the best setting.

Screen Shot 2020-11-27 at 5.04.35 PM.png

Figure 23: Sound Level DOE Means Plot

E) Interaction Plots

In part E of the DOE analysis, we look at specific interactions between two factors. In looking at our data, the most interesting interaction effects were between Location and another factor so Figures 26-29 show the interactions between Location and another variable, as is evident through the response variable mean. By presenting the data in this manner, we can isolate two variables and see if there if one factor is impacting the output of another. One particular interaction is between location and time of day on the sound level. We are not able to flat out say that it is quieter to study in the morning than in the afternoon, as it actually is quieter to study in the afternoon in the Library. Another interesting interaction is between location and time of day, as it affects temperature. Both the Union and Lottie are cooler in the afternoon; however, the library actually warms up in the afternoon. These interactions are important to consider as they provide valuable evidence that is missed in DOE means plots.

Screen Shot 2020-11-27 at 5.12.37 PM.png

Figure 26: Interaction between Location and Day of the Week on Sound Level

Conclusions:

This DOE analysis highlights some of the connections between our different factors and response variables, as well as the consistency in our measurements. From the response distribution plots, we can confirm consistency in our baseline measurements and conclude that there was no noticeable drift in our data. This is important to see, as if there was significant drift, we may have to recollect some of our data or not be able to make significant conclusions from it. From the DOE analysis, we are also able to provide a clear answer to our reader regarding the sound level of various locations on campus. We can conclude that overall, the Library is the quietest location, with Lottie and the Union close together but the Union is slightly louder overall. Our findings partly match our hypothesis as we believed that the Union in the afternoon on a weekday would be the loudest. From our data, we can conclude that the Union in the afternoon is the loudest, but it appears that it is actually louder on the weekends than during the weekday. One potential weakness in this analysis is that it only highlights the average of all of our data. Only showing the average does not allow the reader to understand the complete context of the data, especially when we have a large standard deviation for some data collections. A good extension of this analysis would be to talk about the peaks of each data collection, as these peaks can be extremely valuable for the reader when they are choosing a study location.

Statistical Analysis

By Ethan Barnes

Introduction

To better understand the significance of the location, time of day, and day of the week on our response variables, I chose to use ANOVA analysis. The ANOVA analysis in Excel was chosen to determine whether there was a statistical difference between the resulting temperature, foot traffic, and noise level based on the input factors. From there, because our sample size was less than 30, each variable was then independently run through a two-tail t-test to better identify which are significantly different (if any) for each combination of factors. Afterward, the confidence interval could be used to display the certainty of the collected data’s averages.

Methods

To obtain the ANOVA for each trial for every response variable, the raw data was compiled into organized columns seen in Table 5 for Excel to determine if the results were indeed significantly different. An alpha value of less than 0.05 was denoted as a significant difference between a column of data for our experiment. If ANOVA indicated a response variable with a significant difference across the data, I had to perform individual t-tests for every combination of input factors and levels (Table 5-7). This would then indicate for every combination of columns, which averages were significantly different.

Note: Each factor and level is abbreviated for each column. 'U, WD, M' represents Union, Weekday, Morning. 'Lo, WE, A' represents Lottie, Weekend, Afternoon, and so on...

Screen Shot 2020-11-28 at 10.29.28 AM.pn

Table 5: Temperature data collected

To calculate the p-values, the T.TEST function was used in excel. I used a 2-tail test because I wanted to look for a difference above and below the distribution. I also used a type-3 test due to the lack of correlations between the data analyzed. Any resulting p-value less than the alpha = 0.05 was marked as significantly different and highlighted as a green box.

From there, I calculated the confidence intervals to show the range of accuracy across our collected data. To do so, the assumption was made that our data is normally distributed. Another assumption was that although all the raw data was analyzed for ANOVA, the average for the sound level data was taken for each of the 3 columns of data for 3 trials resulting in uniform sample size across all three response variables. Knowing that n = 9, the Excel function T.INV.2T was further utilized. This function utilizes the degrees of freedom and selected alpha value shown in Figure 30. That function was used to calculate the swing term when multiplied by the Standard Error of the Mean (SEM) in Figure 30 as well. The SEM was found using the standard deviation and sample size (n).

Screen Shot 2020-11-28 at 4.16.42 PM.png

Figure 30: Equations used to determine confidence interval

Lastly the power analysis was conducted to determine the sample size that would be required between each combination that did not yield significantly different results. To find these values, an online sample size power analysis calculator by DSS research was used. Given the average standard deviation and alpha error level desired at 95% confidence, the calculated sample size required to achieve a significant difference was found.

Results

After running through the described methods above, the results from the analysis are able to be compiled and conclusions can be made based on the calculations. The single factor ANOVA described before, showed that there was indeed a significant difference between the different factors for sound, temperature, and foot traffic shown in Figure 32-34.

Screen Shot 2020-11-28 at 11.06.51 AM.pn

Figure 32: Sing factor ANOVA for sound levels

Screen Shot 2020-11-28 at 12.32.17 PM.pn

Figure 35: Sound level p-values of every combination of input factor with significantly different data sets highlighted green (p < 0.05)

Based on the individual t-tests and the majority of Figures 35-36 showing significant differences, this demonstrates that our selected input variables have a large effect on sound and temperature levels. As for the foot traffic (Figure 37), only around 40% of the combinations showed a significant difference displaying a less impactful result due to the input factors.

Screen Shot 2020-12-01 at 3.34.10 AM.png

Figure 38: P-values for afternoon sound levels across three locations

Now that individual p-values have been determined, comparisons can be made based on the significance of different data sets. For sound, I chose to highlight Figure 38 due to its particular distinct set of average dB values for the afternoon. From this graph, it can be concluded that there is indeed a significant difference in the sound levels based on the three locations represented on the y-axis. It can also be concluded that the day of the week caused a significant change to occur between the Union data (p=4.20E-05) as well as Lottie (p=1.12EE-04) shown by the differing blue and orange bar heights. It can also be concluded that the day of the week did not significantly impact the sound level results in the library (p=0.226 due to similar observations. This graph also displays how the afternoon union sound had a significantly greater sound level while the library was the quietest.

Screen Shot 2020-12-01 at 3.36.02 AM.png

Figure 39: P-values for Union temperature comparing morning and afternoon

As for Figure 39, I focused on the Union temperature when comparing morning to afternoon results. For both the morning and afternoon values, it can be concluded that data recorded on the weekend was significantly different from data collected during weekdays with both p-values seen at the bottom of the bar graphs less than alpha. More importantly, though Figure 39 concludes that the afternoon temperatures were significantly greater than the morning temperatures in both cases. Similar results are also found across the other two locations not shown within this graph.

Screen Shot 2020-12-01 at 3.57.22 AM.png

Figure 40: P-values for afternoon foot traffic across three different locations

In terms of foot traffic, I noticed this response variable contained the greatest variance. The individual p-values gave little indication of significant differences across the collected data and the error bars in many cases swung greater than the average values themselves. This in short can be blamed on the large standard deviation and resulting swing term revealing a large range for the confidence intervals at 95%. With that said, Figure 40 displays the foot traffic during the afternoon across the three locations for both the weekday and weekend. Based on this graph, it can be concluded that Lottie is the only location in which the foot traffic remains relatively the same (p=0.704) regardless of the day of week. It can also be concluded that the union on the weekend experienced a significantly greater foot traffic mean than any other location. Because of the large uncertainty and error bars, foot traffic was the variable in which I ran a power analysis. The online sample size power analysis calculator found the numbers seen below in Figure 41. The yellow represents the power analysis n-values found using β=0.75 along with the average sample values and standard deviation. The boxes with N/A represent data in which the avg/stand dev. was either identical or equal to zero. On average, a population of at least 178 would be required in order to show a significant difference between the different factors.

Screen Shot 2020-11-29 at 8.05.00 PM.png

Figure 41: Example of the calculated required population to achieve a significant difference for foot traffic (in yellow)

Conclusions:

Because our goal was to determine which food accessible location on campus provides the optimal place to also study, we had to validate that our input factors indeed had an effect contributing toward a final decision. Based on the results above, it can be concluded that the selected inputs have a significant impact on the sound levels and temperature data collected due to the ANOVA and p-values found. As for the foot traffic, I would conclude that further data collection is required. The low confidence shows a potential weakness in our DOE. A few changes to prevent this in the future could mean analyzing the foot traffic in and out of each location rather than a radius, generating a greater sample size to compensate for potential variation, as well as selecting different input factors all together to better represent changes in location-based population density (acting as a distraction while studying).

Based on the observations above, it can be concluded that the Union experienced significantly louder decibel levels than the other two locations with the library housing significantly lower decibel levels likewise. For temperature, it can be concluded that the afternoon mean temperatures were significantly warmer than that of the morning. Lastly, for foot traffic, it can be concluded that the Union and Lottie experienced significantly greater levels than that of the library. With that said the conclusions made based on foot traffic within our study prove less meaningful due to the high error and confidence interval results. This is potentially due to a poor experimental design with the radius of collected foot traffic being 15 feet. Future studies may want to observe the number of people entering or exiting the location at different exits as well as the total number of people at a given time.

Goals of Analysis

Correlation & Regression Analysis

Introduction

Methods

Results

Conclusions

DOE Data Analysis

Introduction

Methods

Results

A) Response Distribution Scatterplots

​

B) Response vs. Time Scatterplots

​

C) Response vs. Factor Levels

​

D) DOE Means Plot

​

E) Interaction Plots

​

Conclusions:

Statistical Analysis

Introduction

Methods

Results

​

Conclusions: