The scatterplot is a dominant graph used in science articles due to its proven usefulness in science and finding correlations (SCATTER BRAINED, 2018). Residual plots can be used to show further or less validity of scatterplots. This report will delve into the mathematics stated above to provide a conclusion to whether rainfall impacts the production of sugar cane and or sugar. This report will be about the impact rainfall has on tonnes of sugar cane and sugar produced and to see if there is any correlation at all. Appropriate data from Government websites will be used to make graphs which will be analysed. The mill that will be examined is Rocky Point (Douglas Shire).
The weather station chosen to collect data on rainfall was 031062 Whyanbeel Valley QLD which is 6.2km away from Rocky Point, making it the closest weather station and the best one to use for data. While this is the closest weather station, it is still a fair distance away from the town making it less likely to lead to a strong correlation between rainfall and tonnes of sugar cane and sugar produced. The number of data sets gathered was 15 which, according to Venugopal. G, is enough if there is strong relation between X and Y. They however then go on to say that in order for a precise estimate, 40 or more data pairs should be collected and used. Although 15 data sets were gathered, the sites the data was collected from had missing data that was not provided. It is unknown why, but this led to some of the data sets being discarded due to either there being no data for rainfall, tonnes of sugar or both.
2.1 Observations and Assumptions
Before data was collected and analysed, observations and assumptions were made.
Observations that were made:
- There are many weather stations around Rocky Point. 031062 Whyanbeel Valley QLD was chosen as it was the closest out of all the weather stations to the town of Rocky Point.
- The weather station is not in the town. This will somewhat affect the weather readings for rainfall as the weather station chosen will be measuring rainfall 6.2km away from one side of the town. The town is 20.2km2 so that means if there is sugar cane on the other side of the town, which there assumedly is, the weather data will be even less effecting on that end.
- Other factors will also be affecting sugar cane and sugar production. This cannot be controlled but it may be able to be looked into to find the other factors affecting them.
- For some months, the amount of rainfall did not appear. This was found to be because it was missing usually a day of rainfall data. This, however, was fixed by going in and calculating the rainfall amount.
Assumptions that were made:
- I will assume the weather will be the same or similar to the data recorded at the weather station 6.2km away from Rocky Point.
- I will assume nothing else is affecting the production of sugar cane and sugar. For example, fertilizer or crop rotation. This is because I will assume the farmers keep these at a constant.
- I will assume that if temperature affects the production of sugar cane and sugar that the temperature change will be linked to the rainfall. So, therefore, it is rainfall affecting the production of both.
- For the months that did not have rainfall amounts, because they were missing usually a day of data, the daily rainfalls were added up. It is assumed that the few days not included in the total rainfall will not matter or change the outcome the data has in any way because most days that had no rainfall data were usually surrounded by days that had measurements of 0mm of rainfall.
In order to investigate how rainfall affects the production of sugar cane and sugar, many steps had to be taken.
Firstly, data was collected. This was done by going on the canegrowers annual report websites to attain data about the chosen variables to do with sugar cane. The rainfall data was collected from an online compilation of weather station data called Bom.
Scatterplots were the chosen graph used to represent the collected data. Rainfall was first graphed against sugar cane production then rainfall was graphed against sugar production.
A regression equation was found after a somewhat linear association was found.
A regression line was put on the scatterplot by selecting the trendline function on the excel spreadsheet. A trendline is a line that goes through the graph using calculations to find the exact middle line of the trend. The coefficient of determination was also put on the scatterplot through the more trendline options.
The R2 value is used to show how strong the correlation is between two variables, meaning how one variable can predict another linearly. R2 also goes by the names of coefficient of determination which is just the correlation coeficient squared. Residual values were then calculated using the number obtained by the coefficient of determination.
Residual plots were made with the rainfall and sugar cane produced plotted against each other and also for rainfall and sugar produced plotted against each other.
An analysis was performed on the statistical information and conclusions were made.
2.3 Use of Technology
The spreadsheet program, Excel, was used during this investigation to make scatterplots, find the regression equations, the coefficient of determination and to calculate residuals. Excel was also used to find other measures of the data including means, standard deviations and the correlation coefficient. From these a least-squares regression equation was made.
3 Developing a Solution
The data for rainfall, sugar cane production and sugar production can be found in the Appendix.
Each year there were different amount of sugar cane harvested. This could have been a result of farmers not being able to harvest their crops because there was too much rain. Since it is unknown why there is a fluctuation, Hectares Harvested and Rainfall were graphed on a scatterplot to see if there was any correlation.
- Graph 1: The Effect Rainfall Has on Hectares Harvested 2004-2015
Many pieces of data were left out of this graph as they looked to be outliers. While it may not seem statistically dignified to take out as much data as was, all the outlier pieces of data can be seen as error to the weather station being much closer to one side of the town than the other. Graph 1 show a much higher correlation between Rainfall and Hectares Harvested with an r2 of 0.4572 otherwise known as a 46% (rounded) correlation. A moderate to strong positive correlation is shown in Graph 1.
Since the variation of Hectares Harvested per year can be put down mostly to rainfall it still shows that not all variation in Hectares Harvested can be stated to be because of rainfall. This is why, when seeing how rainfall effects the production of Sugar Cane and Sugar, the Hectares Harvested were made to be the same number. It was decided that the average would be found of Hectares Harvested yearly then rounded to the nearest 500 interval. Once this was done, the corresponding tonnes of Sugar Cane produced yearly and the tonnes of Sugar produced yearly were found using the equation of: tonnes of either the Sugar cane or Sugar amount divided by hectares harvested x 3,500 (refer to Appendix 3). Doing this meant that a more accurate and even representation of the effect rainfall has on tonnes produced of Sugar Cane and Sugar could be examined.
- Graph 2: The Effect Rainfall Has on Sugar Produced 2003-2018
Graph 2 shows a moderate positive association between rainfall and sugar produced. A weak positive linear relationship can be seen. This led to a linear regression line being formed by developing a linear regression equation using the least-squares regression method.
The formula below will give the least-squares regression line:
y = a + bx
where b = r x sy/sx and a = [image: ] – b[image: ], r is Pearson’s correlation coefficient, sx and sy are the standard deviations samples, and [image: ] and [image: ] are the mean samples.
Using the Excel function CORREL, r was found to be 0.1938774.
The Excel function Average was used to find that [image: ] = 3,269.2 and [image: ] = 34,385
The standard deviation was found using the Excel function STDEV.S. sy has a standard deviation of 3,186.439865 and sx has a standard deviation of 740.3759.
Refer to Appendix 4 and 5 for the spreadsheet functions used.
b = r x sy/sx
= 0.7092249291 x 3,186.439865 / 740.3759
a = [image: ] – b[image: ]
= 34,385 – 3.052371893 x 3,269.2
· The linear regression equation for the data is given by:
y = 24406.18581+ 3.052471893x
y = 24406.1858 + 3.0525x (corrected to four decimal places)
Using the trendline function on Excel, the regression line was able to be added to the scatterplot as shown in Graph 4. The regression equation was also calculated by Excel when the display equation was selected.
According to the r2 value 0.503, calculated by squaring 0.7092249291, the data plotted had a moderate linear distribution.
- Graph 3: The Effect Rainfall Has on Sugar Cane Produced 2003-2018 with regression line
1 indicates that the correlation coeficient is a perfect possitive linear association. The r2 value for Graph 3
Graph 3 shows a correlation between rainfall increasing and sugar produced increasing with the trend line showing a weak positive correlation. The R2 value is very low with a number of 0.2029 showing that there is a 20% correlation beteen Rainfall and Sugar Produced. The r2 value may be improved if the outlier of 38634.6 (rounded) tonnes of sugar produced from around 2189.8 mm (rounded) of rainfall is removed.
One factor that strongly affected the outcome of the graphs rainfall and sugar cane correlation is that the weather station is 6.2km away meaning the weather data collected is not in the town where the sugar cane is growing. This graph already shows a moderate positive correlation, so it can be assumed that if the weather data was collected in the town a much stronger correlation would occur but it can also be assumed that taking the outlier of 38634.6 tonnes of sugar produced from around 2189.8mm of rainfall.
- Graph 4: The Effect Rainfall Has on Sugar Produced 2004-2014
Graph 4 shows a moderate positive correlation between the amount of Raainfall and Sugar Produced with an r2 value of 0.503.
- Graph 5: The Effect Rainfall Has on Sugar Cane Produced 2003-2018
Graph 5 shows a weak positive correlation with an r2 value of 0.1634. this shows a 16% correlation with the rest of the factors affecting cane being 84% which we know is not due to Hectares Harvested since we already controlled for that. There does not seem to be any drastic outliers that would change the correlation.
4 Evaluation to Verify Results
Residual Plot 1: Residual Plot for The Effect Rainfall Has on Sugar Produced Without Outlier 2004-2014
Residual plot 1 is the residuals for graph 7 plotted. This residual plot shows that there is a linear correlation between Rainfall and Sugar as all the plots are scattered without a pattern on residual plot 1.
Residual Plot 2: Residual Plot for The Effect Rainfall Has on Sugar Cane Produced 2003-2018
Residual plot 2 is the residual plotted for graph 8. This residual plot shows that there is a linear correlation between Rainfall and Sugar Cane as all the plots are scattered without a pattern on residual plot 2.
4.1 Improving the Model
To improve the study, a different cane farm town could be chosen that has a weather station either in the town or right outside of. This would lead to more accurate rainfall data to compare to the production of Sugar Cane and Sugar. This may lead to a higher correlation.
4.2 Strengths and Limitations
A strength of this investigation is that the data used in this investigation was sources from Government websites.
A limitation of this investigation was the small sample size. This however could not be helped due to only a certain amount of data being able to be found on the production of Sugar Cane and Sugar and having some data missing for Rainfall.
Other limitations were that the weather station which collected the Rainfall data was 6.2km away from the town of Rocky Point, not knowing how often each cane farm is rotated and not knowing the ground nutrients of each cane farm, not knowing the exact location of each cane farm in the town with the exact amount of Sugar Cane and Sugar produced.
Appendix 1: Raw Data
Appendix 2: Months Missing From Data and Their Corresponding Missing Rainfall Data
Appendix 3: Converting the Hectares Harvested to The Same Amount
Appendix 4: Finding the Average and Standard Deviation
Appendix 5: The Average and Standard Deviation for Sugar, Sugar Cane and Rainfall
7 Reference List
- CANEGROWERS. 2019. CANEGROWERS Annual Reports. Available at: http://www.canegrowers.com.au/page/about/publications
- Australian Government. Bureau of Meteorology. Climate Data Online. 2019. Available at: http://www.bom.gov.au/climate/data/
- SCATTER BRAINED. A brief history of the scatter plot – data visualization’s greatest invention. By Dan Kopf. March 31, 2018. Available at: https://qz.com/1235712/the-origins-of-the-scatter-plot-data-visualizations-greatest-invention/
- OriginLab. Interpreting Regression Results. 15.4.4 Residual Plot Analysis. 2019. Available at: https://www.originlab.com/doc/Origin-Help/Residual-Plot-Analysis
- ISIXSIGMA. MINIMUM DATA POINTS REQUIRED FOR REGRESSION. By Venugopal. G. 2010. Available at: https://www.isixsigma.com/topic/minimum-data-points-required-for-regression/
The Effect Rainfall Has on Sugar Cane Produced 2003-2018
2189.8000000000002 3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4094.8 2898.8 2858.8 3425 3259.6 2827.2 4405.6000000000004 296755.57324840763 250362.660944206 256431.74061433447 318587.64607679466 301650.5179768434 243461.60822249093 228573.99043570668 261249.72113775794 232504.14878397714 227413.67713004482 256945.43859649124 357861.48108684068 345088.10375670838 402276.33136094676 Rainfall (mm)
Sugar Cane Produced (tonnes)
3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4279.8999999999996 4094.8 2898.8 2858.8 3425 1406.4903840019761 -17.691455605250667 4579.298733563759 -2189.4706046651263 -282.21652815916968 -455.15219302750484 -2994.2651531171141 -2864.3868429603899 2248.5120167488276 139.47339977670345 429.40824344338762 Rainfall (mm)
2189.8000000000002 3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4279.8999999999996 4094.8 2898.8 2858.8 3425 3259.6 2827.2 4405.6000000000004 38594.726158791891 -44831.761948837258 1111.9683077941881 11839.202863932005 25323.867182421382 -21951.331259983475 -46263.349036702479 -39851.983483720978 -42088.483729476575 -42470.738549382251 -46612.604265726783 -30508.359845835599 74330.168196439336 71811.221051631146 91567.458358655102 Rainfall (mm)
The Effect Rainfall Has on Hectares Harvested 2004-2015
3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4094.8 2898.8 2858.8 3259.6 4194 3516 4193 3282 3308 3764 3586 3495 3568 3754 Rainfall (mm)
The Effect Rainfall Has on Sugar Produced 2003-2018
2189.8000000000002 3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4279.8999999999996 4094.8 2898.8 2858.8 3425 4405.6000000000004 38634.554140127388 37263.233190271814 30706.626848691696 41923.205342237066 31238.726386349786 31741.233373639658 32781.349628055264 34475.708502024296 34040.574456218623 35502.718168812593 33271.580717488789 35289.824561403511 0 Rainfall (mm)
Sugar Produced (tonnes)
The Effect Rainfall Has on Sugar Produced 2003-2014
2189.8000000000002 3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4279.8999999999996 4094.8 2898.8 2858.8 3425 38634.554140127388 37263.233190271814 30706.626848691696 41923.205342237066 31238.726386349786 31741.233373639658 32781.349628055264 34475.708502024296 34040.574456218623 35502.718168812593 33271.580717488789 35289.824561403511 Rainfall (mm)
Sugar Produced (tonnes)
The Effect Rainfall Has on Sugar Produced 2004-2014
3751.4 2070 4238.6000000000004 2955.8 2495.6 2893 4279.8999999999996 4094.8 2898.8 2858.8 3425 37263.233190271814 30706.626848691696 41923.205342237066 31238.726386349786 31741.233373639658 32781.349628055264 34475.708502024296 34040.574456218623 35502.718168812593 33271.580717488789 35289.824561403511 Rainfall (mm)