Please consult the event program for date, time and location.
One critical, yet currently under-addressed, question regarding high-stakes large-scale English Language Proficiency (ELP) assessment is whether performance on given items varies across test takers from different language backgrounds after controlling for test takers’ ability. More specifically, the issue is whether or not the relative advantage of a language group on a specific item can be attributed to the linguistic similarity of that language to English (Alderman & Holland, 1981; Chen & Henning, 1985; Kim, 2001; Lee et. al, 2005, Ryan & Bachman, 1992). Differential item functioning is typically examined within an exploratory framework, with an effort to interpret the results using a post hoc content review (Gierl, 2005). However, this approach has not been overwhelmingly successful in providing substantive and consistent interpretations of statistical results or in identifying the underlying causes of Differential Item Functioning (DIF). (Standards, 1999). Within an exploratory framework researchers are not able to isolate the role of language and its potential interaction with linguistic features of test items from other confounding and theoretically relevant factors like contextual features of a given test item and sociocultural characteristics of test takers. To overcome challenges of interpretation of DIF results, Douglas, Roussos, and Stout (1996) proposed Differential Bundle Functioning (DBF) as a framework for interpreting group differences in item performance. A clear advantage of this framework is that researchers may use either statistical methods or content analysis to define dimensionally homogenous item bundles before conducting DIF analysis. This permits the direct testing of theoretically important hypotheses about the interaction of particular item features with home language (Douglas, Roussos, & Stout, 1996; Gierl, 2005; Gierl, Bisanz, Bisanz, & Boughton, 2003).
The current study utilizes this confirmatory DBF approach to identify how the linguistic features of items on a high-stakes large-scale ELP reading assessment may elicit differential performance among students with different native language backgrounds. Home language and test data from three states who participate annually in the assessment are used for analysis. Data from students in Grades 1-12 are divided into four grade-level clusters: Grades 1-2, Grades 3-5, Grades 6-8, and Grades 9-12. Home languages are identified as either “Close” or “Distant,” depending on their linguistic proximity to English. Close languages include Germanic and Romance languages, and Distant languages include other Indo-European and non-Indo-European languages. Confirmatory DIF analysis will be conducted using item bundles that are identified by two linguistic experts through linguistic analysis of test items. These experts will examine the linguistic features in the item stem and option categories, as well as the parts of the reading passages that are required to answer each of the test items. They will make a determination as to whether the test takers’ home language may offer them some advantage in getting the items correct because of certain characteristics of their native languages. In light of the complexity of assessing ELP, statistical methods are not likely to produce meaningful item bundles that delineate the similarities and differences between Close and Distant languages; therefore using human expert judgments to identify item bundles is a more appropriate method. After the item bundles are identified, DIF analysis will be conducted on the item bundles using both Mantel Haenszel statistics (Holland & Thayer, 1988) and SIBTEST (Stout & Roussos, 1995).
Results will be compared to results obtained using exploratory DIF analysis techniques typically used in operational test environments. Results from an initial exploratory DIF analysis using a subset of these data indicated that while exploratory DIF analysis using home language as a grouping variable was effective at identifying items that exhibited DIF, some test items produced different results when ethnicity was used as the grouping variable. One test item moved from showing evidence of weak DIF to evidence of strong DIF when home language was compared with ethnicity, and another item moved from a strong DIF categorization to a weak DIF categorization across the two grouping strategies. These differences were not readily explained in a post hoc analysis, highlighting the necessity of identifying potentially influential linguistic features in a principled way that might, in turn, inform future test development procedures.