Suppose there are a large number of predictors ‘p’. What is the best approach to find out if any of the p predictors are helpful in predicting the response ‘y’? 

Suppose Null hypothesis (H0) is:

?1= ?2 = ……….?p= 0

And Alternate hypothesis (Ha) is: 

At least one of the ?i not equal to 0

Best approach: In order to find if any of the ‘p’ predictors are helpful in predicting ‘y’, use F-Statistic. (This approach works well when p<n. For p>n, other high dimensional methods will work)

Side Note: T-statistic might not be good in this scenario

If p is large, let’s say p = 200, and none of the variables (p1, ….pn) are predictive for response variable y (i.e. null hypothesis above is true), yet about 5% of the p-values associated with each of the variables comes below 0.05 by chance. Now, in reality, these variables with low p values do not have any predictive power. The lower p-value is just by chance. Therefore, if we are using individual t-statistic and p values to conclude that the variables have predictive power, we may be drawing the wrong conclusion. 

As F-statistic adjusts for the large number of variables, it doesn’t suffer from the above problem

Author

Help us improve this post by suggesting in comments below:

– modifications to the text, and infographics
– video resources that offer clear explanations for this question
– code snippets and case studies relevant to this concept
– online blogs, and research publications that are a “must read” on this topic

Leave the first comment

Partner Ad
Find out all the ways that you can
Contribute
Here goes your text ... Select any part of your text to access the formatting toolbar.