Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Multiple regression question

Options
  • 23-09-2017 12:15pm
    #1
    Registered Users Posts: 1,595 ✭✭✭


    I have an intriguing question about how best to use data to make a regression-based prediction.

    Scenario:

    Quantity Y is known to have a linear association with quantities X1 and X2.
    I have 10,000 observations of pairs (X1, Y).
    For 50 of these observations, I also have an observation for X2.

    I want to get as good an estimate of an unknown Y as possible from new observation of X1 and X2.

    I could ignore X2 entirely, use my 10,000 data points as the basis of a regression model for Y on X1 to get an estimate.

    Or, I could ignore 9950 of my observations, and create a regression model for Y on X1 and X2.

    In each case, I seem to be ignoring useful information.

    Any ideas?

    One option I'm considering is getting two separate estimates from simple regression - one using Y on X1 with 10,000 data points, the other using Y on X2 alone based on 50 data points, and combining these by weighting in proportion to the respective squared correlations.


Advertisement