Thanks a lot for all the feedback and discussion regarding the final third analysis. Here are a few follow-ups on the feedback.
Feedback: The correlation between goals scored and passes in the final third is driven by the top 5 goal scoring clubs. If they are removed from the data set, the correlation might be weak.
This was brought up by @WillTGM & @Chumolo
- The correlation is not nearly as strong if the top 5 (goals scored) are removed. However, 5 teams constitute 25% of the sample space. If we cherry pick the top 5, it is not surprising that the correlation becomes much weaker.
- I did an experiment choosing 15 clubs randomly from the 20. In several such experiments, the correlation was strong and significant. R2 varied between 0.56 and 0.87. The regression was significant. (F-test)
- On a similar note, if outliers like Liverpool and Newcastle are excluded, the correlation becomes much stronger.
Feedback: Significance of the regression
@rui_xu brought up a great point about the importance of the significance of the regression and how just R2 might not tell the whole story.
I did the F-test for all the regressions with the following results
- For the overall final third completions vs. total goals scored over a season analysis – the regression was significant
- For the game level final third completions vs. goals scored by Manchester City analysis, the regression is not very significant. This is most likely due to the small sample size (38 games).
- However, when I did the same analysis using data from all the 380 games of last season (760 samples), the correlation was weak (as observed for the 38 games of Man City) and the regression was significant for the larger sample space.
Please keep the feedback coming!