ITEM LEVEL DIAGNOSTICS AND MODEL-DATA FIT IN ITEM RESPONSE THEORY ( IRT ) USING BILOG-MG V 3 . 0 AND IRTPRO V 3 . 0 PROGRAMMES

Item response theory (IRT) is a framework for modeling and analyzing item response data. Item-level modeling gives IRT advantages over classical test theory. The fit of an item score pattern to an item response theory (IRT) models is a necessary condition that must be assessed for further use of item and models that best fit the data. The study investigated item level diagnostic statistics and modeldata fit with one-and twoparameter models using IRTPROV3.0 and BILOGMG V3.0. Ex-post facto design was adopted. The population for the study consisted of 11,538 candidates’ responses who took Type L 2014 Unified Tertiary Matriculation Examination (UTME) Mathematics paper in Akwa Ibom State, Nigeria. The sample of 5,192(45%) responses was randomly selected through stratified sampling technique. BILOG-MG V3.0 and IRTPROV3.0 computer software was used to calibrate the candidates’ responses. Two research questions were raised to guide the study. Pearson’s χ 2 and S χ 2 statistics as an item fit index for dichotomous item response theory models were used. The outputs from the two computer software were used to answer the questions. The findings revealed that only 1 item fitted 1parameter model in BILOGMG V3.0 and IRTPRO V3.0. Furthermore, the findings revealed that 26 items fitted 2-parameter models when using BILOG-MG V3.0. Five items fitted 2-parameter models in IRTPRO. It was recommended that the use of more than one IRT software programme offers more useful information for the choice of model that fit the data.


INTRODUCTION
The crucial benefits of IRT models are realized to the degree that the data fit the different models, 1-, 2-, and 3 parameters.Model-data fit is a major concern when applying item response theory (IRT) models to real test data.Though, there is an argument that the evaluation of fit in IRT modeling has been challenging, the use of item response theory model checking and item fit statistics serve crucial factors to effective IRT use in psychometrics for information on items and model selections (Reise, 1990;Embretson & Reise, 2000).
Obtaining evidence of model-data-fit when an IRT model is used to make inferences from a data set is recommended as the standards for educational and psychological testing by the American Association of Educational Research, American Psychological Association, and National Council on Measurement in Education (2014).Failure to meet this requirement invalidates the application of IRT in real data set evaluation.Researches (Orlando andThissen, 2000, 2003) indicated that model checking Cyrinus B. Essen, Department of Educational Foundations, University of Calabar, Calabar, Nigeria.Idaka E. Idaka, Department of Educational Foundations, University of Calabar, Calabar, Nigeria.remains a major hurdle to the effective implementation of item response theory in which, failure to assess item level and model-data-fit statistics in the applications of IRT models, according to Liu and Maydeu-Olivares (2014) before any inferences can be drawn from the fitted model, is capable of leading to any potentially misleading conclusions derived from poorly fitted models.The need to effectively assess model-data fit is imperative for correctly choosing the right model that adequately fits the data.
Studies have shown an extension beyond dichotomous IRT models to polytomous IRT models, including the generalized partial credit model and rating scale model on item fit statistics and model selection in recent times (Chon, Lee & Ansley, 2007;Kang &Chen, 2011).Various model fit statistics for item-fit index, for dichotomous item response theory (IRT) models had been proposed (Orlando,1997;Orlando & Thissen, 1997, 2000, 2003) to assess the appropriateness of the chosen IRT models and calibration procedure in terms of the modeldata test for, 1-, 2-and 3 parameter logistic models, Wells, Wollack, and Serlin (2005) stressed that fit of model to the data must accurately portray the true relationship between ability and performance on the item.They held that model misfit has dire consequences leading to violation of invariance property.Thus, Kose (2014) emphasized that the property of invariance of item and ability parameters is the main stay of IRT that distinguishes it from CTT.The invariance property of item and ability is not dependent on the examinees distribution and characteristics of set of test items.Hence, Bolt (2002) believed that it is imperative for test developers to establish that a particular model fits the data before operationalizing a valid item.Orlando and Thissen (2003) opined that the appropriate use of IRT models is predicated on the premise that a number of IRT assumptions are made about the nature of the data, to ensure that the model accurately represents the data.When these assumptions are not met, inferences regarding the nature of the items and tests can be erroneous, and the potential advantages of using IRT are not gained.Besides, Sinhary (2005) held that failure to ensure the appropriateness of model-data fit analysis carried the risk of drawing incorrect conclusion.
According to Hambleton andSwaminathan (1985 cited in McAlphine, 2002), the measure of model data fit should be based on three types of evidence.Firstly, the validity of the assumption of the model for the data set such as: (a) unidimensionality, (b) the test is not speeded, (c) guessing is minimal for 1 and 2PL, (d) all items are of equal discrimination for 1PL.Secondly, that the expected properties are obtained to reflect; invariance of item and ability parameter estimates.Finally, the accuracy of the model prediction should be assessed through the analysis of item residuals.
In addition, Sijtsma and Hemker (2000) and Sheng(2005) (2000), Stone (2000), Glas and Suarez-Falcon (2003), Stone and Zhang (2003), Dodeen (2004) and Sinharay (2005) developed a number of item-level fit statistics for use with dichotomous item response theory models.The common procedure for constructing item fit indices for the 2PL and 3PL models group respondents based on their estimated standing on the latent variable being measured by the test and obtained observed frequencies correct and incorrect each summed score for these groups.Dodeen (2004) Orlando and Thissen (2003) used MULTILOG software on the utility of S − X 2 as an item fit index for dichotomous item response theory models.Results were based on a simulation generated and calibrated for 100 tests under each of 27 conditions (3 bad items) × (3 test lengths) × (3 sample sizes).The three nonlogistic (bad) items were created and embedded in otherwise 3PL tests of length 10, 40, and 80 items for samples of size 500, 1,000, and 2,000.The item fit indices S − X 2 and Q1 − X 2 were calculated for each item.The conclusion was that the performance of S − X 2 improved with test length.The performance of S − X 2 was superior to Q1 − X 2 under most but not all conditions.Results from the study implied that S − X 2 was useful tool in detecting the misfit of one item contained in an otherwise well-fitted test, lending additional support to the utility of the index for use with dichotomous item response theory models.
Also, Mokobi and Adedoyin (2014) used MULTILOG to assess item level and model fit statistics in a 3 parameter logistic model with 2010 Botswana Junior Certificate Examination Mathematics paper one.A chi-square goodness of fit statistics was employed in assessing item fit to 1PL, 2PL and 3PL models.The results revealed that 10 items fitted the 1PL, 11 items fitted the 2PL model and 24 items fitted the 3PL models.Therefore, the 3PL model was used for the analysis.
Furthermore, Dodeen (2004) used BILOG 3.11 software for fitting the 3PL model to the generated data sets and for computing the values of the χ 2 G statistic.The statistics S−χ2 and S−G 2 were computed using the GOODFIT programme.The proportion significant for the S−χ 2 and χ 2 were low and close to the nominal level for all the test conditions.The statistics χ 2 and G 2 were computed using the IRTFIT RESAMPLE programme.The average item fit statistics, the proportion of item fit statistics were significant at 1percent level and the correlations between the generating item parameters and the average item fit statistics over the 100 replications under any test condition were computed under each of the nine test conditions.Furthermore, Essen (2015) examined model-data fit in 2014 in a 50 item dichotomously scored JAMB Mathematics items data with chi-square goodness of fit statistics using BILOG MG, 3.0 software programme.No item fitted the 1parameter model,26 items fitted 2-parameter IRT model, while 3-paramater model displayed some irregularities.Therefore, the 2-paramter logistic model was best for the data.
In another study, Kang and Chen ( 2007) used an item-fit index, S-X 2 , proposed by Orlando andThissen (2000, 2003) for dichotomous item response theory (IRT) models, which has performed better than traditional itemfit statistics.The study extended the utility of S-X 2 to polytomous IRT models, including the generalized partial credit model, partial credit model, and rating scale model.The performance of the generalized S-X 2 in assessing item-model fit was studied in terms of empirical Type I error rates and power as compared to results obtained for G 2 provided by the computer programme PARSCALE.The results showed that the generalized S-X 2 was a promising item-fit index for polytomous items in educational and psychological testing programmes.
Besides, Chon, Lee and Ansley ( 2007) in a study examined various model combinations and calibration procedures for mixed format tests under different item response theory (IRT) models and calibration methods.Using real data sets that consisted of both dichotomous and polytomous items, nine possible applicable IRT model mixtures and two calibration procedures were compared based on traditional and alternative goodness-of-fit statistics.Three dichotomous models and three polytomous models were combined to analyze mixed format test using both simultaneous and separate calibration methods.To assess goodness of fit, the PARSCALE's G 2 was used.In addition, two fit statistics proposed by Orlando and Thissen (2000) were extended to more general forms to enable the evaluation of fit form fixed format tests.The results indicated that the three parameter logistic models combined with the generalized partial credit model among various IRT models combinations led to the best fit to the given data sets, while the one parameter logistic model had the largest number of misfit items.In comparison of three fit statistics.Some inconsistencies were found between traditional and new indices for assessing the IRT models to data.The study revealed considerably better model fit than the traditional indices.
This study investigated item-level diagnostics and model-data fit in IRT using BILOG MG.3.0 and IRTPRO V3.0 software.The 2014 Unified Tertiary Matriculation Examination (UTME) Mathematics items was used for the analysis.Joint Admissions and Matriculation Board (JAMB) that is vested with the sole responsibilities of conducting examination for admissions into the Nigerian Universities, Polytechnics and Colleges of Education had shifted from the CTT to IRT paradigm in test construction and development in line with the best global practices of item and person independence in educational assessment.However, the extent to which the items fit the various IRT model is the concern of the study as an essential standard condition for the use of the data.

Purpose of the study
The study investigated the extent 2014 UTME Mathematics items fitted the 1-2-and 3 parameter with the use of BILOG MG.3.0 and IRTPRO V3.0 software programmes with the use of S-X 2 and Pearson X 2 statistics.The study specifically examined: 1. The

Method
The research design for this study was expost facto.The researcher's choice to use this method was based on the fact that the researcher had no intentions to manipulate the characteristics of the participants nor the variables involved.The population for the study consisted of 11,538 candidates who took Type L 2014 UTME Mathematics in Akwa Ibom State.Four thousand, five hundred and forty-six were females and 6,994 were males.A stratified sampling procedure was used to select 5,192 candidates' response data, comprising, 2,596 males and 2,596 females, representing 45 per cent of the candidates who took 2014 UTME Mathematics items.The 5,192 candidates' response data were subjected to BILOG-MG 3.0 and IRTPRO V 3.0 computer software calibration in a 1-, 2-and 3-parameter models.The outputs were used for analysis.

Results
The results of the data analysis are presented in Tables: 1 and 2 according to the research questions.

Research question 1
Which of the IRT fit statistics S-X 2 in IRTPRO V3.0 and X 2 in BILOG MG.3.0 best diagnose s2014 UTME Mathematics items model data fit accurately?
Table 1 shows the results obtained from two software programmes: IRTPRO V 3.0 and BILOG MG 3.0.The two programmes shows the extent 2014 Mathematics items were calibrated with S-X 2 and X 2 diagnostic indices of each item at different IRT models.Both software calibrated the data at 1-PL, 2-PL, and only IRTPRO calibrated 3-parameter logistic model.3parameter in BILOG MG 3.0 displayed some level of inconsistencies that did not allow for the use of the calibrated model.From the results in IRTPRO 48 items were significant at less than .05, in 1-parameter, except item no 10 with a non-significant value of .1216.In a 2-paramter, 44 items are significant at less than .05,with 5 items: 10 (.1808), 12 (.1023),18(.0549), 19 (.1714)

ITEM LEVEL DIAGNOSTICS AND MODEL -DATA FIT IN ITEM RESPONSE THEORY (IRT)
no item fits the 1-parameter mode; 26 items fit the 2-parameter model.Therefore, 2-parameter in BILOG MG V3.0 best fits the 2014 UTME Mathematics items.

Research question 2
Which of the IRT software programmes (BILOG MG.3.0 and IRTPRO V3.0) is appropriate for the 2014 UTME Mathematics items?Result in Table 2 reveals that though the numbers of items shows some improvement in IRTPRO from 1 in 1-pl, 5 items in 2-pl to 7 items in 3-pl models the choice of IRTPRO software programme does not prove very suitable, as more items are not identified.Comparatively, BILOG MG V3.0 software programme show remarkable improvement in identifying 26 items that fit 2-parameter models, though no item fits 1parameter model.The result show that the use of software is dependent on which programme indicates an improved number of items that suits a particular model.Therefore, the choice of software programme is the number of items that best show improvement in chosen software at the different models.

Results
The results from research question 1 revealed that IRTPRO V 3.0 and BILOG MG 3.0, exhibited different degrees in the use of S-X 2 and X 2 diagnostic indices of each item at different IRT models.Both software calibrated the data at 1-PL,2-PL, and only IRTPRO calibrated 3parameter logistic model.The findings agree with a study carried out by Orlando and Thissen (2003) on the utility of S − X 2 as an item fit index for dichotomous item response theory models.The item fit indices S − X 2 and Q1 − X 2 were calculated for each item.The conclusion was that the performance of S − X 2 improved with test length.The performance of S − X 2 was superior to Q1 − X 2 under most but not all conditions.Results from the study implied that S − X 2 was useful tool in detecting the misfit of one item contained in an otherwise well-fitted test, lending additional support to the utility of the index for use with dichotomous item response theory models.Also, Mokobi and Adedoyin (2014)  Therefore, the 3PL model was used for the analysis.Kose(2014) found that in a 1-, 2-and 3paramter for assessing model data fit, 2-PL model fitted significantly better than the 3-PL model when -2Log likelihood ratio X 2 was used.However, when Orlando and Thissen (2000) evaluated model-data fit from fixed format tests, the results indicated that the three parameter logistic models combined with the generalized partial credit model among various IRT model combinations led to the best fit to the given data sets.The one parameter logistic model had the largest number of misfit items.In comparison of three fit statistics.Some inconsistencies were found between traditional and new indices for assessing the IRT models to data.The study revealed considerably better model fit than the traditional indices.The finding implied that conducting item level diagnostics and model-data fit is imperative in using IRT models in analysis Results from research question 2 indicated that BILOG MG V3.0 computer programme displayed greater efficiency in dictating items that fit the various IRT models than the IRTPRO programme.The results indicated that the need to use more than one software to examine model data fit.Various IRT software programmes are used to examine model-data fit, such as BILOG, BILOG MG, MULTILOG, IRTPRO, PARSCLE, among others.These programmes provide different information concerning the model fit and comparison will show an improvement when more than one programme is compared in assessing model-data fit.Though, many studies have not considered the use of more than one software in comparing the model-data fit, this study provides the ground for more studies in this respect.

CONCLUSION
The study examined item diagnostics statistics and model-data fit in item response theory using BILOG-MG V3.0 and IRTPRO V3.0 programmes.The results indicated that χ 2 and S -χ 2 statistics showed some items that fitted the 1-, 2-and 3 parameter IRT logistic models.Also, BILOG MGV3.0 and IRTPRO V3.0 showed different degrees in locating items that fitted the various IRT models.Based on these results, the study concluded that assessing model-data fits using various statistical indices and the used of multiple IRT programmes is imperative in the use of IRT model choice analysis.

RECOMMENDATIONS
From the findings and conclusion reached, the following recommendations were made: 1.
That the selection of best IRT model should depend on assessing item fit statistics as the first step to apply IRT with confidence.

2.
That the use of various item fit statistics is a step to ensuring that comparison is made for informed judgment and variety of diagnostic evidences 3.
That the use of more than one IRT programmes will provide the choice of the best programme that provide more useful information about the real data set.