**********************************************************************************; ** PROGRAM: mod_fit_web_final.sas **; ** AUTHOR: Christopher R. Bilder **; ** Department of Statistics **; ** University of Nebraska-Lincoln **; ** chris@chrisbilder.com **; ** DATE: Last edited in 2009 **; ** PURPOSE: Fit model to two MRCVs for the NHANES survey data **; ** Rao-Scott adjustments are used to perform inference for model **; ** comparison tests **; ** Standardized residuals and odds ratios are also found **; ** NOTES: This program is for the paper, **; ** **; ** Bilder, C. R. and Loughin, T. M. (2009). Modeling Multiple-Response **; ** Categorical Data from Complex Surveys. The Canadian Journal of **; ** Statistics, to appear **; ** **; ** Work on this research was partially supported by the National Science **; ** Foundation grants SES-0418688 and SES-0418632 and by the National **; ** Science and Engineering Research Council of Canada. **; **; ** **; ** Also cited in this program is the paper, **; ** **; ** Bilder, C. R. and Loughin, T. M. (2007). Modeling Association **; ** between Two or More Categorical Variables that Allow for Multiple **; ** Category Choices. Communications in Statistics: Theory and Methods **; ** 36(2), 433-451. **; ** **; ** because some segments of code used for it were used here. See **; ** www.chrisbilder.com/mrcv/Comm for that paper's website. **; ** **; ** THE ONLY ITEM THAT NEEDS TO BE CHANGED FOR THIS PROGRAM TO **; ** TO RUN ON YOUR COMPUTER IS THE "libname supersub "C:\ ..." STATEMENT **; ** ON LINE #73. THIS NEEDS TO INCLUDE THE CORRECT FOLDER LOCATION **; ** OF THE mult_response_data AND item_response_table DATA FILES. **; ** **; ** Throughout the program, Z is considered before Y for the two MRCVs. **; ** This is a little different than in the paper where Y is considered **; ** before Z (for example, see p. 6) **; ** **; **********************************************************************************; dm 'log;clear;output;clear;'; options ps=50 ls=70 pageno=1; options mprint symbolgen mlogic; goptions reset=global border ftext=swiss gunit=cm htext=0.4 htitle=0.5; goptions display noprompt; **********************************************************************************; * Settings and read in data; %let alpha = 0.05; *Ho model; %let model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp); *SPMI model; *%let model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj ; *Homogenous association model; *%let model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(smoke); *Smoke main effect, Z main effect model; *%let model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(resp); *Resp main effect, Y main effect model; *%let model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(smoke) smokei*respj(resp); *Z and Y-main effects; *Ha model; * BE CAREFUL WITH SPECIFICATION - Ha model format below needs to have the variables in the EXACT same; * order as in the Ho model (X_reduced is the first ncol(X_reduced) columns of X_full); *%let full_model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj; *Homogenous association model; *%let full_model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(smoke); *Smoke main effect, Z main effect model; *%let full_model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(resp) smokei*respj(smoke); *Z and Y-main effects - CAN USE WHEN Ho:Y main effects model - notice order of smokei*respj(resp) smokei*respj(smoke); *%let full_model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(smoke) smokei*respj(resp); *Z and Y-main effects - CAN USE WHEN Ho:Z main effects model; %let full_model = smoke resp smoke*resp smokei(smoke*resp) respj(smoke*resp) smokei*respj smokei*respj(smoke) smokei*respj(resp) smokei*respj(smoke*resp); *Saturated model; %let numbZ = 5; *Number of smoking items (This is I); %let numbY = 4; *Number of response items (This is J); *Location of SAS data sets on computer - this will need to change for the user; libname supersub 'C:\chris\UNL\research\MRCV_complex_surv\survey_data'; data set2; *Rename the data set here; set supersub.item_response_table; *item_response_table is a SAS data set containing the pairwise item response table with estimated population counts; run; /* Description of the data in set2: smoke = 1, 2, 3, 4, and 5 corresponding to the order of the lifetime tobacco use levels presented in Table 2 of the paper. resp = 1, 2, 3, and 4 corresponding to the order of the respiratory problem levels presented in Table 2 of the paper. smokei and respj are 0 and 1 where a 0 response means negative response and a 1 means a positive response; wtcount is the estimated population totals for a particular combination of items count is the number of observations for a particular combination of items */ data smokeresp2; *Rename the data set here; set supersub.mult_response_data; *The multinomial-like version of the data; run; /* Description of the data in smokeresp2: This data set was created directly from the demo.XPT, smq.xpt, and rdq.xpt data sets available on the NHANES 1999-2000 website at http://www.cdc.gov/nchs/about/major/nhanes/demo99_00.htm and http://www.cdc.gov/nchs/about/major/nhanes/quest99_00.htm. Data sets were merged and put into a form that can be used for this program. Note that individuals with missing observations are not considered in this analysis and no adjustments are made to the sampling weights. This resulted in 26 observations being removed from the data set. It was decided that the statistical methods needed to be first developed in a complete data setting rather than add another layer of complexity to the problem. Future research will consider the incomplete data setting. Some of the variables in the data set not discussed above previously include: WTINT2YR = survey weights WTIREP01 = survey weights used with the jackknife when the "01" set of observations were removed from the data WTIREP52 = survey weights used with the jackknife when the "52" set of observations were removed from the data */ ****************************************************************************************; * Find estimate for Cov(N^) using the jackknife 52 replciates provided in the survey; %global Z Y; *Names need for PROC FREQ, this makes these macro variables global; *Write out the item combinations; %MACRO GENERALIZE_TABLE_STATEMENT; %let Z = Z1*; %do i = 2 %to &numbZ; %let Z = &Z.Z&i.*; %end; %let Y = Y1; %do j = 2 %to &numbY; %let Y = &Y.*Y&j; %end; %MEND GENERALIZE_TABLE_STATEMENT; %GENERALIZE_TABLE_STATEMENT; *Find N_tilde with jackknife weights for all 52 replicates; * Note: N_tilde here is a (2^(I+J) x 1) vector of estimated population counts; %MACRO REPEAT_FREQ(final_set); *do loop is split up due to how the WTIREP variables are numbered in the data set; %do k = 1 %to 9; proc freq data=smokeresp2 noprint; table &Z &Y / sparse out=out_data; weight WTIREP0&k; run; data out_data; set out_data; rep = &k; drop percent; run; data &final_set; set &final_set out_data; run; %end; %do k = 10 %to 52; proc freq data=smokeresp2 noprint; table &Z &Y / sparse out=out_data; weight WTIREP&k; run; data out_data; set out_data; rep = &k; drop percent; run; data &final_set; set &final_set out_data; run; %end; %MEND REPEAT_FREQ; *Initialize save data set for the 52 N_tilde jackknife replicate values; data final_set; set _null_; run; *Find the 52 N_tilde jackknife replicate values; %REPEAT_FREQ(final_set); *Find observed N_tilde; proc freq data=smokeresp2 noprint; table &Z &Y / sparse out=out_data; weight WTINT2YR; run; data out_data; set out_data; drop percent; run; *Find the estimated Cov(N_tilde) with the usual jackknife covariance matrix formula; proc iml; use final_set; read all var{count} into Nhat_jack; *52 N_tilde values for jackknife replicates in a 52*2^(I+J) x 1 vector; use out_data; read all var{count} into Nhat; *Observed N_tilde; save2 = 0; R = 52; *Number of replicates; do i = 1 to R; *do loop to find the sum part; save = (Nhat-Nhat_jack[(1+(i-1)*2**(&numbZ+&numbY)):(i*2**(&numbZ+&numbY)),]) *t(Nhat-Nhat_jack[(1+(i-1)*2**(&numbZ+&numbY)):(i*2**(&numbZ+&numbY)),]); save2 = save + save2; end; cov_Nhat = (R-1)**2 / R**2 * save2; *Estimated covariance matrix; create cov_matrix2 from cov_Nhat; append from cov_Nhat; *Send covariance matrix out to a data set; quit; **************************************************************************************; * Ho (reduced) model calculations; title2 "Ho model"; proc genmod data=set2; class smoke resp smokei respj; model wtcount = &model / dist=poi link=log type3 obstats noint; ods listing exclude parameterinfo; ods listing exclude parameterestimates; ods listing exclude obstats; ods output obstats=obstats_orig modelfit=modfit_orig ParameterEstimates=est_orig; run; title2; *TURN OFF PRINTING; proc printto print="nul:"; run; *Find X matrix; proc glmmod data=set2 OUTDESIGN=X_set; class smoke resp smokei respj; model wtcount = &model / noint; run; *TURN ON PRINTING; proc printto; run; ***********************************************************************************; *More Ho model calculations; * A lot of this code is for getting everything in the correct form for PROC IML; * This includes getting the X matrix in the correct order needed (see Appendix A of Bilder and Loughin (2007)); *MACRO to find number of observations in a data set, adapted from a SAS website sample program; %MACRO NUMOBS(dsn, name); %global &name; data _null_; call symput("&name", left(put(count2,8.))); stop; set &dsn nobs=count2; run; %MEND NUMOBS; *Count the current number of columns in X_set by looking at the est_orig data set; * ncol(X) = nrow(est_orig) - 2, the -2 is due to the intercept and scale parameter rows in est_orig (not estimated); %NUMOBS(est_orig, col_in_X); %put &col_in_X; *This is a check; *Transpose X matrix for upcoming merge; proc transpose data=X_set out=Xtran name=col label=label; var col1-col%eval(&col_in_X-2); run; *Merge with parameter estimates to see which columns of X used; data merg_set; merge est_orig Xtran; if DF ne 0; run; *Transpose back - 4 is in the eval statement since their are 4 cells to a sub-table; proc transpose data=merg_set out=X_mat(drop=_name_); var col1-col%eval(4*&numbZ*&numbY); run; *Get in order needed for PROC IML; data merg_set2; merge obstats_orig X_mat; run; *Get # of parameters estimated - correct # of columns in the X matrix; %NUMOBS(merg_set, row_in_merg_set); %put &row_in_merg_set; *This is a check; *Get in order needed for PROC IML; proc sort data=merg_set2 out=merg_set3(keep= col1-col&row_in_merg_set); by descending smokei descending respj; run; *Get in order needed for PROC IML; proc sort data=obstats_orig out=obstats_orig2; by descending smokei descending respj; run; *PURPOSE: Creates G and H matrices needed for PROC IML; * G contains the 2^I possible item response combinations for Z and * H contains the 2^J possible item response combinations for Y * (see Appendix A of Bilder and Loughin (2007)); %MACRO G_H(numb_items, matrix); *Iterate out all possible combinations; %let X = do item1 = 0 to 1%str(;); %let endit = end%str(;); *Generalize; %do j = 2 %to &numb_items; %let X = &X.do item&j=0 to 1%str(;); %let endit = &endit end%str(;); %end; *Create matrix of all items; data itemvec; &X; output; &endit; run; *Transpose the matrix to be in the correct form; proc transpose data=itemvec out=&matrix.itemvec2; var item1-item&numb_items; run; %MEND G_H; *Find G and H matrices; %G_H(&numbZ, G); %G_H(&numbY, H); *************************************************************************; * Ha (full) model calculations; *TURN OFF PRINTING; proc printto print="nul:"; run; proc genmod data=set2; class smoke resp smokei respj; model wtcount = &full_model / dist=poi link=log type3 obstats noint; ods listing exclude parameterinfo; ods listing exclude parameterestimates; ods listing exclude obstats; ods output obstats=obstats_orig_Ha ParameterEstimates=est_orig_Ha; run; *Find X matrix; proc glmmod data=set2 OUTDESIGN=X_set_Ha; class smoke resp smokei respj; model wtcount = &full_model / noint; run; *TURN ON PRINTING; proc printto; run; ***********************************************************************************; *More Ha model calculations; * A lot of this code is for getting everything in the correct form for PROC IML; * This includes getting the X matrix in the correct order needed (see Appendix A of Bilder and Loughin (2007)); %NUMOBS(est_orig_Ha, col_in_X_Ha); %put &col_in_X_Ha; *Transpose for merge; proc transpose data=X_set_Ha out=Xtran_Ha name=col label=label; var col1-col%eval(&col_in_X_Ha-2); run; *Merge with parameter estimates to see which columns of X used; data merg_set_Ha; merge est_orig_Ha Xtran_Ha; if DF ne 0; run; *Transpose back; proc transpose data=merg_set_Ha out=X_mat_Ha(drop=_name_); var col1-col%eval(4*&numbZ*&numbY); run; *Get in order needed for PROC IML; data merg_set2_Ha; merge obstats_orig_Ha X_mat_Ha; run; *Get # of parameters estimate - correct # of columns in the X matrix; %NUMOBS(merg_set_Ha, row_in_merg_set_Ha); %put &row_in_merg_set_Ha; proc sort data=merg_set2_Ha out=merg_set3_Ha(keep= col1-col&row_in_merg_set_Ha); by descending smokei descending respj; run; proc sort data=obstats_orig_Ha out=obstats_orig_Ha; by descending smokei descending respj; run; **************************************************************************************; * Find n (sample size) and N_tilde (population size estimate, not 2^(I+J) vector of estimated population counts); *Ouput data set contains n and N_tilde; proc means data = set2 sum noprint; class smoke resp; var count wtcount; output out = out_n sum = sum_n_sample sum_N_pop; run; *Get n and N_tilde row of data set; *NOTE: BE CAREFUL - this is hard coded to find n for the (1,1) sub-table; * it would need to be generalized for cases where sub-table total counts are not the same; data out_n2; set out_n; if smoke ne . and resp ne .; if smoke = 1 and resp = 1; run; *Create macro variables for n and N_tilde that can be used later; data _null_; set out_n2; call symput("n_sample", sum_n_sample); call symput("n_pop", sum_N_pop); run; **************************************************************************************; *Calculate X^2, covariance matrices, Rao-Scott adjustments, standardized residuals, odds ratios, ... ; proc iml; *Get information from earlier in SAS program; use cov_matrix2; read all into V; *Var(N_tilde); use Gitemvec2; read all into G; *Need for B; use Hitemvec2; read all into H; *Need for B; use merg_set3; read all into X; *Ho (reduced) model X; use obstats_orig2; read all var{pred} into mu_hat; *M_hat - model based predictions; use obstats_orig2; read all var{wtcount} into wtcount; *M_tilde - empirical estimates (non-model based predictions); use obstats_orig2; read all var{smoke resp smokei respj} into order; *Provides the order of elements in M_tilde; *show names; *Matrices of ones; Jr = repeat(1, &numbZ, 2**&numbZ); *r means row variable; Jc = repeat(1, &numbY, 2**&numbY); *c means column variable; *M = B*N, see Appendix A of Bilder and Loughin (2007) for its construction; * This B uses a different format than given in the MRCV complex paper.; * In the end, both produce the correct results.; B = G@H // G@(Jc-H) // (Jr-G)@H // (Jr-G)@(Jc-H); *Covariance matrix in terms of the marginal population estimated counts; cov_M = B*V*t(B); *Matrix used in distributional approximation for beta_hat; A = Diag(sqrt(mu_hat))*X; *Covariance matrix for beta_hat; *cov_beta = inv(t(A)*A) * t(A) * Diag(1/sqrt(mu_hat)) * cov_m * t(inv(t(A)*A) * t(A) * Diag(1/sqrt(mu_hat))); *Just another way; cov_beta = inv(t(X)*Diag(mu_hat)*X) * t(X) * cov_M * X * inv(t(X) * Diag(mu_hat) * X); *Covariance matrix for M_hat; cov_mu = Diag(mu_hat)*X * cov_beta * t(X)*Diag(mu_hat); *Covariance matrix for residuals - 4=2*2 is for the two MRCVs where each has a item is a 0 or 1 response; cov_e = (I(4*&numbZ*&numbY) - Diag(mu_hat)*X*inv(t(X)*Diag(mu_hat)*X)*t(X)) * cov_M * t(I(4*&numbZ*&numbY) - Diag(mu_hat)*X*inv(t(X)*Diag(mu_hat)*X)*t(X)); *Pearson GOF statistic; e = wtcount - mu_hat; *M_tilde - M_hat = residual; *Form: Xnew^2 = n*SUM{ (P~ - P^)^2 / P^ } = n/N~ * SUM{ (M~ - M^)^2 / M^}; Xsq = &n_sample / &n_pop * t(e) * Diag(1/mu_hat) * e; *Standardized Pearson residuals; stand_err = repeat(0, 4*&numbZ*&numbY, 1); do i = 1 to 4*&numbZ*&numbY; *4=2*2 is for the two MRCVs where each has a item is a 0 or 1 response; stand_err[i,1] = sqrt(cov_e[i,i]); end; stand_resid = e/stand_err; send_out = num(order) || stand_resid || mu_hat || wtcount; col = {"Z" "Y" "zi" "yj" "stand_resid" "M_hat" "M_tilde"}; create stand_resid from send_out[colname=col]; append from send_out; *Put the above information into a data set; **********************************************************************************; *Rao-Scott calculations; *Find eigenvalues for RS adjustment; mat = &n_sample / &n_pop * Diag(1/mu_hat) * cov_e; eig_mat = eigval(mat); sum_eig = trace(mat); sum_eig_sq = sum(eig_mat[,1]##2); *print sum_eig sum_eig_sq; *RS adjusted Pearson GOF statistic; RS_first_df = 4*&numbZ*&numbY-ncol(X); *4=2*2 is for the two MRCVs where each has a item is a 0 or 1 response; RS_first = RS_first_df*Xsq/sum_eig; RS_first_p = 1-CDF('CHISQUARED', RS_first, RS_first_df); RS_second = sum_eig*Xsq/sum_eig_sq; RS_second_df = sum_eig**2 / sum_eig_sq; RS_second_p = 1-CDF('CHISQUARED', RS_second, RS_second_df); print 'GOF for Ho model using X^2', Xsq RS_first RS_second RS_first_df RS_second_df RS_first_p RS_second_p; *Adjustment put into notation on p. 90 of Chambers and Skinner (2003); d = RS_first_df; a_hat_sq = d*sum_eig_sq/sum_eig**2 - 1; d_star = d/(1+a_hat_sq); F_adj = RS_first/d; *NOTE: (&numbz*&numby-1) may change with different complex sampling designs; * Because the exact sampling design is not readily available for the NHANES, these degrees of freedom; * were used; F_first_p = 1-CDF('F', F_adj, d, d*(4*&numbz*&numby-1)); *4*&numbz*&numby = # of cells in item response table; F_second_p = 1-CDF('F', F_adj, d_star, d_star*(4*&numbz*&numby-1)); *nu = d*(4*&numbz*&numby-1); print 'GOF for Ho model using F' F_adj F_first_p F_second_p; *****************************************************************; * Find odds ratios and corresponding confidence intervals; * Note: Under the SPMI model, an error message will be produced in the LOG window due to; * a variance equal to -1.3E-19. Remember that it does not make sense to look at these ORs; * under SPMI anyway (OR all equal to 1 so variance should be 0); alpha = α *Find the derivative needed for the delta method - code taken from Bilder and Loughin (2007) program; save_logOR_der = repeat(0, &numbZ*&numbY, 4*&numbZ*&numbY); save_logOR_der_obs = repeat(0, &numbZ*&numbY, 4*&numbZ*&numbY); do i = 1 to &numbZ*&numbY; *Evaluation using M_hat; delta = repeat(0, 1, i-1) || 1/mu_hat[i,1] || repeat(0, 1, &numbZ*&numbY-1) || -1/mu_hat[&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1) || -1/mu_hat[2*&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1) || 1/mu_hat[3*&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1-(i-1)); save_logOR_der[i,] = delta; *Evaluation using M_tilde; delta_obs = repeat(0, 1, i-1) || 1/wtcount[i,1] || repeat(0, 1, &numbZ*&numbY-1) || -1/wtcount[&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1) || -1/wtcount[2*&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1) || 1/wtcount[3*&numbZ*&numbY+i, 1] || repeat(0, 1, &numbZ*&numbY-1-(i-1)); save_logOR_der_obs[i,] = delta_obs; end; *Covariance matrix for log(estimated OR); var_LogOR = save_logOR_der*cov_mu*t(save_logOR_der); var_LogOR_obs = save_logOR_der_obs*cov_M*t(save_logOR_der_obs); *Find log(ORs) and C.I.s; LogOR = repeat(0, &numbZ*&numbY, 3); LogOR_obs = repeat(0, &numbZ*&numbY, 3); var_LogOR_save = repeat(0, &numbZ*&numbY, 1); var_LogOR_obs_save = repeat(0, &numbZ*&numbY, 1); do i = 1 to &numbZ*&numbY; *Model predicted, log(OR_hat); LogOR[i,1] = log(mu_hat[i,1]) - log(mu_hat[&numbZ*&numbY+i, 1]) - log(mu_hat[2*&numbZ*&numbY+i, 1]) + log(mu_hat[3*&numbZ*&numbY+i, 1]); var_LogOR_save[i,1] = var_LogOR[i,i]; LogOR[i,2] = LogOR[i,1] - probit(1-alpha/2)*sqrt(var_LogOR[i,i]); LogOR[i,3] = LogOR[i,1] + probit(1-alpha/2)*sqrt(var_LogOR[i,i]); *Empirical, log(OR_tilde); LogOR_obs[i,1] = log(wtcount[i,1]) - log(wtcount[&numbZ*&numbY+i, 1]) - log(wtcount[2*&numbZ*&numbY+i, 1]) + log(wtcount[3*&numbZ*&numbY+i, 1]); var_LogOR_obs_save[i,1] = var_LogOR_obs[i,i]; LogOR_obs[i,2] = LogOR_obs[i,1] - probit(1-alpha/2)*sqrt(var_LogOR_obs_save[i,1]); LogOR_obs[i,3] = LogOR_obs[i,1] + probit(1-alpha/2)*sqrt(var_LogOR_obs_save[i,1]); end; *Find ORs; OR = exp(LogOR); OR_obs = exp(LogOR_obs); *Indices for the sub-tables; save_indices = repeat(0, &numbZ*&numbY, 2); count = 1; do i = 1 to &numbZ; do j = 1 to &numbY; save_indices[count,] = i || j; count = count + 1; end; end; print 'Indices: col. #1 = Z item #, col. #2 = Y item #', "OR: col. #1 = model predicted OR, col. #2-#3 = (1-&alpha)100% C.I.", "OR_obs: col. #1 = empirical OR, col. #2-#3 = (1-&alpha)100% C.I."; print save_indices OR OR_obs; **********************************************************************************; *Different form for X^2 that gives the same results after Rao-Scott adjustments; *Form: X^2 = SUM{ (M~ - M^)^2 / M^}; Xsq_diff = t(e) * Diag(1/mu_hat) * e; *Find eigenvalues for RS adjustment; mat = Diag(1/mu_hat) * cov_e; eig_mat = eigval(mat); sum_eig = trace(mat); sum_eig_sq = sum(eig_mat[,1]##2); *print sum_eig sum_eig_sq; *RS adjusted Pearson GOF statistic; RS_first_df = 4*&numbZ*&numbY-ncol(X); RS_first = RS_first_df*Xsq_diff/sum_eig; *&n_sample/&n_pop part falls out here providing the same RS_first as before; RS_first_p = 1-CDF('CHISQUARED', RS_first, RS_first_df); RS_second = sum_eig*Xsq_diff/sum_eig_sq; *&n_sample/&n_pop part falls out here again providing the same RS_first as before; RS_second_df = sum_eig**2 / sum_eig_sq; RS_second_p = 1-CDF('CHISQUARED', RS_second, RS_second_df); *print 'GOF using Xsq_diff', Xsq Xsq_diff RS_first RS_second RS_first_df RS_second_df RS_first_p RS_second_p; ***********************************************************************************************; * Compare Ho and Ha models using the Pearson statistic and apply Rao-Scott adjustments; use merg_set3_Ha; read all into X_f; *Full model X matrix; X_r = X; *Reduced model X matrix; ncol_X_f = ncol(X_f); ncol_X_r = ncol(X_r); *X_F-R - Assumes X_f has additional columns for the extra betas AFTER X_r (X_r is the first ncol(X_r) columns of X_f); X_f_r = X_f[ ,(ncol_X_r+1):ncol_X_f]; Q = (I(nrow(mu_hat)) - X_r*inv(t(X_r)*diag(mu_hat)*X_r) *t(X_r)*diag(mu_hat)) * X_f_r; *Find eigenvalues for RS adjustment; mat = &n_sample / &n_pop * t(Q)*cov_M*Q * inv(t(Q)*diag(mu_hat)*Q); eigen = eigval(mat); *print eigen; *notice number of them is ncol_X_f - ncol_X_r; sum_eig = trace(mat); sum_eig_sq = sum(eigen[,1]##2); *M_hat^(1); use obstats_orig_Ha; read all var{pred} into mu_hat_Ha; e_Ho_Ha = mu_hat_Ha - mu_hat; *M_hat^(1) - M_hat^(0) = residual; Xsq = &n_sample / &n_pop * t(e_Ho_Ha) * Diag(1/mu_hat) * e_Ho_Ha; *RS adjusted modified Pearson statistic; RS_first = (ncol_X_f - ncol_X_r)*Xsq/sum_eig; RS_second = sum_eig*Xsq/sum_eig_sq; RS_first_df = ncol_X_f - ncol_X_r; RS_second_df = sum_eig**2 / sum_eig_sq; *check; RS_first_p = 1-CDF('CHISQUARED', RS_first, RS_first_df); RS_second_p = 1-CDF('CHISQUARED', RS_second, RS_second_df); print 'Full vs. reduced model'; print "Reduced: &model"; print "Full: &full_model"; print Xsq RS_first RS_second RS_first_df RS_second_df RS_first_p RS_second_p; *Adjustment put into notation on p. 90 of Chambers and Skinner (2003); d = ncol_X_f - ncol_X_r; a_hat_sq = d*sum_eig_sq/sum_eig**2 - 1; d_star = d/(1+a_hat_sq); *F_adj = X_sq_first/d; F_adj = RS_first/d; *NOTE: (&numbz*&numby-1) may change with different complex sampling designs; * Because the exact sampling design is not readily available for the NHANES, these degrees of freedom; * were used; F_first_p = 1-CDF('F', F_adj, d, d*(4*&numbz*&numby-1)); F_second_p = 1-CDF('F', F_adj, d_star, d_star*(4*&numbz*&numby-1)); print 'Full vs. reduced model using F' F_adj F_first_p F_second_p; quit; title2 'Standardized residuals and estimated M'; proc print data = stand_resid; run;