en:learning:schools:s01:lecture-notes:ba-ln-07

This shows you the differences between two versions of the page.

— |
en:learning:schools:s01:lecture-notes:ba-ln-07 [2015/09/22 16:22] (current) |
||
---|---|---|---|

Line 1: | Line 1: | ||

+ | ====== L07: Descriptive statistics ====== | ||

+ | “ I don't even know what I'm doing here.” | ||

+ | |||

+ | Chrom, Tron | ||

+ | |||

+ | |||

+ | ==== Things we cover in this session ==== | ||

+ | * Describing and visualizing model results by boxplots and simple statistics | ||

+ | |||

+ | ==== Things you need for this session ==== | ||

+ | * [[en:learning:schools:s01:worksheets:ba-ws-06-1|W06-1 Leave-one-out validation]] | ||

+ | |||

+ | ==== Things to take home from this session ==== | ||

+ | At the end of this session you should be able to | ||

+ | * Calculate characteristics of the model output | ||

+ | * Create boxplots | ||

+ | * Interpret model results based on boxplots | ||

+ | |||

+ | ===== Descriptive statistics: min/max/mean/median/sd ===== | ||

+ | To interpret the success or characteristics of you model, there are more measures beside the p value and R² you learned in [[en:learning:schools:s01:lecture-notes:ba-ln-04| LN04-1 Regressions]]. | ||

+ | |||

+ | The **minimum and maximum** values of a prediction indicate how well a model is able to predict extreme values (either low or high). | ||

+ | |||

+ | Comparing the mean and median values of a prediction to the observed values teachs you about a general over- or underestimation of the prediction: | ||

+ | The **mean value** is calculated by summing up all values of the dataset of interest and divid it by the number of observations. Though the mean value is widely used to characterize datasets, it has the major disadvantage of | ||

+ | being highly affected by outliers. The **median**, in contrast is the value which is located in the middle of an ordered dataset. Thus it is robust to outliers. | ||

+ | |||

+ | The **standard deviation (sd)** describes the spread of the data. It is the average deviation from each value to the mean value of the distribution. | ||

+ | |||

+ | ==== Descriptive statistics: Do it in R ==== | ||

+ | Luckily, as a R user you don't have to calculate these measures by hand. | ||

+ | The functions | ||

+ | <code rsplus> | ||

+ | mean() | ||

+ | max() | ||

+ | min() | ||

+ | median() | ||

+ | sd() | ||

+ | </code> | ||

+ | will do it for you! | ||

+ | |||

+ | |||

+ | ===== Boxplots ===== | ||

+ | |||

+ | A boxplot is a useful visualization of the measures shown in the section above. It is therefore often used to depict the differences of distributions eg. between predicted and observed values. | ||

+ | |||

+ | <html> | ||

+ | <a title="Jhguch at en.wikipedia [CC-BY-SA-2.5 (http://creativecommons.org/licenses/by-sa/2.5)], from Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3ABoxplot_vs_PDF.svg" target="_blank"><img width="512" alt="Boxplot vs PDF" src="//upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/512px-Boxplot_vs_PDF.svg.png"/></a> | ||

+ | </html> | ||

+ | |||

+ | (Chen-Pan Liao [CC_BY_SA] via wikimedia.org) | ||

+ | |||

+ | A Boxplot shows several components: | ||

+ | |||

+ | - The **box** includes the distribution of the values located in the second and third quartil, thus of the 50% of values which are closest to the mean value. | ||

+ | - The **median** is depicted by the line in the box. The whiskers and representation of outliers represent the spread of the values. | ||

+ | - The **Whiskers** mark the remaining values which don't fall into the second and third quartile. The length of the whiskers is not standardizized. Often they are expanded to 1.5*the interquartile range (IQR). | ||

+ | - The **interquartile range** is the range between the lowest value falling into the second quartile and the highest value falling into the third quartile. | ||

+ | - All values which are higher than 1.5*IQR are considered as **outliers** and are usually marked by points over or under the whiskers, respectively. | ||

+ | |||

+ | |||

+ | ===== Time for practice ===== | ||

+ | [[en:learning:schools:s01:worksheets:ba-ws-07-1|W07-1 Descriptive data set properties]] |

en/learning/schools/s01/lecture-notes/ba-ln-07.txt · Last modified: 2015/09/22 16:22 (external edit)

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International