Author: Gaël Varoquaux
Requirements
COB Lecture Halls Recording Lectures to USB Drive 1. Insert USB drive into port labeled 'USB' on the recording device, located in the podium below the PC. A minimum of 1GB of space is required on your USB drive. Press 'Record' to begin recording (takes a few seconds to begin). Press I I to pause the recording without ending it. COB Lecture Halls Recording Lectures to USB Drive 1. Insert USB drive into port labeled 'USB' on the recording device, located in the podium below the PC. A minimum of 1GB of space is required on your USB drive. Press 'Record' to begin recording (takes a few seconds to begin). Press I I to pause the recording without ending it. EASY TO USE: 1- button to operate, 3- gear freely switch: shutdown - voice control recording mode- continuous recording mode. MULTIFUNCTION PORTABLE RECORDER: lightweight 10 gram and dimensions 1.6 X 1.2 X 0.2 INCH ( 4X 3X 0.5 CM). SIMPLE FILE MANAEMENT: Recordings with time stamp, which is easy to find out when you recorded, what it.
- Standard scientific Python environment (numpy, scipy, matplotlib)
To install Python and these dependencies, we recommend that youdownload Anaconda Python orEnthought Canopy, or preferably usethe package manager if you are under Ubuntu or other linux.
See also
- Bayesian statistics in Python:This chapter does not cover tools for Bayesian statistics. Ofparticular interest for Bayesian modelling is PyMC, which implements a probabilisticprogramming language in Python.
- Read a statistics book:The Think stats book isavailable as free PDF or in print and is a great introduction tostatistics.
Tip
Why Python for statistics?
R is a language dedicated to statistics. Python is a general-purposelanguage with statistics modules. R has more statistical analysisfeatures than Python, and specialized syntaxes. However, when itcomes to building complex analysis pipelines that mix statistics withe.g. image analysis, text mining, or control of a physicalexperiment, the richness of Python is an invaluable asset.
Contents
- Data representation and interaction
- Hypothesis testing: comparing two groups
- Linear models, multiple factors, and analysis of variance
- More visualization: seaborn for statistical exploration
Tip Cookie 5 7 6 – protect your online privacy concerns.
In this document, the Python inputs are represented with the sign'>>>'.
Disclaimer: Gender questions
Some of the examples of this tutorial are chosen around genderquestions. The reason is that on such questions controlling the truthof a claim actually matters to many people.
The setting that we consider for statistical analysis is that of multipleobservations or samples described by a set of different attributesor features. The data can than be seen as a 2D table, or matrix, withcolumns giving the different attributes of the data, and rows theobservations. For instance, the data contained inexamples/brain_size.csv
:
Tip
We will store and manipulate this data in apandas.DataFrame
, from the pandas module. It is the Python equivalent ofthe spreadsheet table. It is different from a 2D numpy
array as ithas named columns, can contain a mixture of different data types bycolumn, and has elaborate selection and pivotal mechanisms.
Creating dataframes: reading data files or converting arrays¶
Reading from a CSV file: Using the above CSV file that givesobservations of brain size and weight and IQ (Willerman et al. 1991), thedata are a mixture of numerical and categorical values:
Warning
Missing values
The weight of the second individual is missing in the CSV file. If wedon't specify the missing value (NA = not available) marker, we willnot be able to do statistical analysis.
Creating from arrays: A pandas.DataFrame
can also be seenas a dictionary of 1D ‘series', eg arrays or lists. If we have 3numpy
arrays:
We can expose them as a pandas.DataFrame
:
Other inputs: pandas can input data fromSQL, excel files, or other formats. See the pandas documentation.
Manipulating data¶
data is a pandas.DataFrame
, that resembles R's dataframe:
Note
For a quick view on a large dataframe, use its describemethod: pandas.DataFrame.describe()
.
groupby: splitting a dataframe on values of categorical variables:
groupby_gender is a powerful object that exposes manyoperations on the resulting group of dataframes:
Tip
Use tab-completion on groupby_gender to find more. Other commongrouping functions are median, count (useful for checking to see theamount of missing values in different subsets) or sum. Groupbyevaluation is lazy, no work is done until an aggregation function isapplied.
Exercise
What is the mean value for VIQ for the full population?
How many males/females were included in this study?
Hint use ‘tab completion' to find out the methods that can becalled, instead of ‘mean' in the above example.
What is the average value of MRI counts expressed in log units, formales and females?
Note
groupby_gender.boxplot is used for the plots above (see thisexample).
Plotting data¶
Pandas comes with some plotting tools (pandas.tools.plotting
, usingmatplotlib behind the scene) to display statistics of the data indataframes:
Record Lectures 3 1 11
Scatter matrices:
Exercise
Plot the scatter matrix for males only, and for females only. Do youthink that the 2 sub-populations correspond to gender?
For simple statistical tests, we willuse the scipy.stats
sub-module of scipy:
See also
Scipy is a vast library. For a quick summary to the whole library, seethe scipy chapter.
1-sample t-test: testing the value of a population mean¶
scipy.stats.ttest_1samp()
tests if the population mean of data islikely to be equal to a given value (technically if observations aredrawn from a Gaussian distributions of given population mean). It returnsthe T statistic,and the p-value (see thefunction's help):
Tip
With a p-value of 10^-28 we can claim that the population mean forthe IQ (VIQ measure) is not 0.
2-sample t-test: testing for difference across populations¶
We have seen above that the mean VIQ in the male and female populationswere different. To test if this is significant, we do a 2-sample t-testwith scipy.stats.ttest_ind()
:
PIQ, VIQ, and FSIQ give 3 measures of IQ. Let us test if FISQ and PIQ aresignificantly different. We can use a 2 sample test:
The problem with this approach is that it forgets that there are linksbetween observations: FSIQ and PIQ are measured on the same individuals.Thus the variance due to inter-subject variability is confounding, andcan be removed, using a 'paired test', or 'repeated measures test':
This is equivalent to a 1-sample test on the difference:
T-tests assume Gaussian errors. Wecan use a Wilcoxon signed-rank test, that relaxesthis assumption:
Note
The corresponding test in the non paired case is the Mann–Whitney Utest,scipy.stats.mannwhitneyu()
.
Exercise Butler 4 3 1.
- Test the difference between weights in males and females.
- Use non parametric statistics to test the difference between VIQ inmales and females.
Conclusion: we find that the data does not support the hypothesisthat males and females have different VIQ.
A simple linear regression¶
Given two set of observations, x and y, we want to test thehypothesis that y is a linear function of x. In other terms:
where e is observation noise. We will use the statsmodels module to:
- Fit a linear model. We will use the simplest strategy, ordinary leastsquares (OLS).
- Test that coef is non zero.
First, we generate simulated data according to the model:
Then we specify an OLS model and fit it:
We can inspect the various statistics derived from the fit:
Terminology:
Statsmodels uses a statistical terminology: the y variable instatsmodels is called ‘endogenous' while the x variable is calledexogenous. This is discussed in more detail here.
To simplify, y (endogenous) is the value you are trying to predict,while x (exogenous) represents the features you are using to makethe prediction.
Exercise
Retrieve the estimated parameters from the model above. Hint:use tab-completion to find the relevent attribute.
Categorical variables: comparing groups or multiple categories¶
Let us go back the data on brain size:
We can write a comparison between IQ of male and female using a linearmodel:
Tips on specifying model
Forcing categorical: the ‘Gender' is automatically detected as acategorical variable, and thus each of its different values aretreated as different entities.
An integer column can be forced to be treated as categorical using:
Intercept: We can remove the intercept using - 1 in the formula,or force the use of an intercept using + 1.
Tip
By default, statsmodels treats a categorical variable with K possiblevalues as K-1 ‘dummy' boolean variables (the last level beingabsorbed into the intercept term). This is almost always a gooddefault choice - however, it is possible to specify differentencodings for categorical variables(http://statsmodels.sourceforge.net/devel/contrasts.html).
Record Lectures 3 1 1 Exe
Link to t-tests between different FSIQ and PIQ
To compare different types of IQ, we need to create a 'long-form'table, listing IQs, where the type of IQ is indicated by acategorical variable:
We can see that we retrieve the same values for t-test andcorresponding p-values for the effect of the type of iq than theprevious t-test:
Consider a linear model explaining a variable z (the dependentvariable) with 2 variables x and y:
Such a model can be seen in 3D as fitting a plane to a cloud of (x,y, z) points.
Example: the iris data (examples/iris.csv
)
Tip
Sepal and petal size tend to be related: bigger flowers are bigger!But is there in addition a systematic effect of species?
In the above iris example, we wish to test if the petal length isdifferent between versicolor and virginica, after removing the effect ofsepal width. This can be formulated as testing the difference between thecoefficient associated to versicolor and virginica in the linear modelestimated above (it is an Analysis of Variance, ANOVA). For this, wewrite a vector of ‘contrast' on the parameters estimated: we want totest 'name[T.versicolor]-name[T.virginica]'
, with an F-test:
Is this difference significant?
Exercise
Going back to the brain size + IQ data, test if the VIQ of male andfemale are different after removing the effect of brain size, heightand weight.
Record Lectures 3 1 1 4c
Seaborn combinessimple statistical fits with plotting on pandas dataframes.
Let us consider a data giving wages and many other personal informationon 500 individuals (Berndt, ER. The Practice of Econometrics. 1991. NY:Addison-Wesley).
Tip
The full code loading and plotting of the wages data is found incorresponding example.
We can easily have an intuition on the interactions between continuousvariables using seaborn.pairplot()
to display a scatter matrix:
Categorical variables can be plotted as the hue:
Look and feel and matplotlib settings
Seaborn changes the default of matplotlib figures to achieve a more'modern', 'excel-like' look. It does that upon import. You can resetthe default using:
Tip
To switch back to seaborn settings, or understand better styling inseaborn, see the relevent section of the seaborn documentation.
A regression capturing the relation between one variable and another, egwage and eduction, can be plotted using seaborn.lmplot()
:
Robust regression
Tip
Given that, in the above plot, there seems to be a couple of datapoints that are outside of the main cloud to the right, they might beoutliers, not representative of the population, but driving theregression.
To compute a regression that is less sentive to outliers, one mustuse a robust model. This is done inseaborn using robust=True
in the plotting functions, or instatsmodels by replacing the use of the OLS by a 'Robust LinearModel', statsmodels.formula.api.rlm()
.
Do wages increase more with education for males than females?
Tip
The plot above is made of two different fits. We need to formulate asingle model that tests for a variance of slope across the twopopulations. This is done via an 'interaction'.
Can we conclude that education benefits males more than females?
Take home messages
- Hypothesis testing and p-values give you the significance of aneffect / difference.
- Formulas (with categorical variables) enable you to express richlinks in your data.
- Visualizing your data and fitting simple models give insight into thedata.
- Conditionning (adding factors that can explain all or part ofthe variation) is an important modeling aspect that changes theinterpretation.
Code examples for the statistics chapter.