










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A series of lecture notes on statistical data analysis, specifically focusing on probability theory, bayes' theorem, random variables, probability distributions, hypothesis testing, and the chi-square test. The notes cover topics such as functions of random variables, expectation values, error propagation, the monte carlo method, p-values, and the significance of a peak.
Typology: Slides
1 / 18
This page cannot be seen from the preview
Don't miss anything!
1 Probability, Bayes’ theorem, random variables, pdfs 2 Functions of r.v.s, expectation values, error propagation 3 Catalogue of pdfs 4 The Monte Carlo method 5 Statistical tests: general concepts 6 Test statistics, multivariate methods 7 Significance tests 8 Parameter estimation, maximum likelihood 9 More maximum likelihood 10 Method of least squares 11 Interval estimation, setting limits 12 Nuisance parameters, systematic uncertainties 13 Examples of Bayesian approach 14 tba
Suppose hypothesis H predicts pdf observations for a set of We observe a single point in this space: What can we say about the validity of H in light of the data? Decide what part of the data space represents less compatibility with H than does the point (^) less compatible with H more compatible with H (Not unique!)
i.e. p = 0.0026 is the probability of obtaining such a bizarre result (or more so) ‘by chance’, under the assumption of H. Probability to observe n heads in N coin tosses is binomial: Hypothesis H : the coin is fair ( p = 0.5). Suppose we toss the coin N = 20 times and get n = 17 heads. Region of data space with equal or lesser compatibility with H relative to n = 17 is: n = 17, 18, 19, 20, 0, 1, 2, 3. Adding up the probabilities for these values gives:
Suppose we observe n events; these can consist of: n b events from known processes (background) n s events from a new process (signal) If n s , n b are Poisson r.v.s with means s , b , then n = n s
Suppose we measure a value x for each event and find: Each bin (observed) is a Poisson r.v., means are given by dashed lines. In the two bins with the peak, 11 entries found with b = 3.2. The p -value for the s = 0 hypothesis is:
But... did we know where to look for the peak? → give P ( n ≥ 11) in any 2 adjacent bins Is the observed width consistent with the expected x resolution? → take x window several times the expected resolution How many bins × distributions have we looked at? → look at a thousand of them, you’ll find a 10
G. Cowan 10
The p -value is a function of the data, and is thus itself a random variable with a given distribution. Suppose the p -value of H is found from a test statistic t ( x ) as Lectures on Statistical Data Analysis The pdf of p H under assumption of H is In general for continuous data, under assumption of H , p H ~ Uniform[0,1] and is concentrated toward zero for Some (broad) class of alternatives. pH g ( p H
g ( p H
G. Cowan 11
0 So the probability to find the p -value of H 0 , p 0
Lectures on Statistical Data Analysis We started by defining critical region in the original data space ( x ), then reformulated this in terms of a scalar test statistic t ( x ). We can take this one step further and define the critical region of a test of H 0
0
Formally the p -value relates only to H 0 , but the resulting test will have a given power with respect to a given alternative H 1
2
If n i
i
i , i.e., n i
i
i 2 ),
2
2
2 = z ): If the n i
i
i
then the Poisson dist. becomes Gaussian and therefore Pearson’s
2
2 pdf.
2 value obtained from the data then gives the p -value:
2
Recall that for the chi-square pdf for N degrees of freedom,
i are right, the rms deviation of n i
i
i , so each term in the sum contributes ~ 1.
2 / N reported as a measure of goodness-of-fit.
2 and N separately. Consider, e.g.,
2 per dof only a bit greater than one can imply a small p -value, i.e., poor goodness-of-fit.
2
← This gives for N = 20 dof. Now need to find p -value, but... many bins have few (or no)
2 to follow the chi-square pdf.
2
2 statistic still reflects the level of agreement between data and prediction, i.e., it is still a ‘valid’ test statistic. To find its sampling distribution, simulate the data with a Monte Carlo program: Here data sample simulated 10 6 times. The fraction of times we
2
29.8 gives the p -value: p = 0. If we had used the chi-square pdf we would find p = 0.073.