DMCA
Discrimination in grading’. (2012)
Venue: | American Economic Journal: Economic Policy, |
Citations: | 6 - 0 self |
BibTeX
@ARTICLE{Hanna12discriminationin,
author = {Rema N Hanna and Leigh L Linden and Abhijit Banerjee and Amitabh Chandra and David Figlio and Karla Hoff and Asim Khwaja and Michael Kremer and Sendhil Mullainathan and Karthik Muralidharan and Rohini Pande and Jonah},
title = {Discrimination in grading’.},
journal = {American Economic Journal: Economic Policy,},
year = {2012},
pages = {146--68}
}
OpenURL
Abstract
We report the results of an experiment that was designed to test for discrimination in grading in India. We recruited teachers to grade exams. We randomly assigned child "characteristics" (age, gender, and caste) Numerous studies have documented what is known as the Pygmalion effect, in which students perform better or worse simply because teachers expect them to do so (see for example, Rosenthal and Jacobson, 1968). In the modern education system, such expectations are set not just by teachers but by a range of evaluators, many of whom have no direct contact with the student, such as admissions officers or the anonymous graders of national and standardized exams. Of particular concern is whether the resulting experiences of students differ systematically based on observable characteristics, like minority status and gender. Such discrimination could have long-lasting effects, by reinforcing erroneous beliefs of inferiority (Steele and Aronson, 1995, 1998; Hoff and Pandey, 2006) and discouraging children from making human capital investments (Mechtenberg, 2009; Taijel, 1970; Arrow, 1972; Coate and Loury, 1993). Additionally, since such external evaluations are often used to determine access to academic opportunities like competitive schools and higher education, such discrimination could directly block access to these important resources. Unlike teaching, however, external evaluations take place away from the classroom, making it feasible to restrict the information available to evaluators. Teachers can often deduce the race of a student from physical characteristics observed in the classroom, but this information can be removed from an exam, for example, before it is graded. Thus, concerns have entered the discussions on grading standards both because the expectations conveyed through them affect student achievement (Figlio and Lucas, 2004) and because more formalized grading strategies may result in a less equitable distributions of scores (Brennan et al., 2001; Gallagher, 1998). Unfortunately, it is difficult to empirically test whether discrimination exists. Disadvantaged minorities, by definition, come from disadvantaged backgrounds with many characteristics that are associated with poor academic performancefew educational resources in schools, low levels of parental education, etc. Thus, it is hard to understand whether children from minority groups perform worse due to discrimination or due to other characteristics. Moreover, as In this study, we designed an experiment to investigate discrimination in grading. We implemented an exam competition in which we recruited children to compete for a large financial prize (58 USD or 55.5 percent of the parents' monthly income). We then recruited local teachers and provided each teacher with a set of exams. We randomly assigned the child "characteristics" (age, gender, and caste) to the cover sheets of the individual exams that were to be graded by the teachers in order to ensure that there would be no systematic relationship between the characteristics observed by the teachers and the quality of the exams. Therefore, any effect of the randomized characteristics on test scores can be attributed to discrimination. Within the education literature, our work builds upon a rich body of research in the United States that evaluates teachers' perceptions of African American and female students (see Ferguson, 2003, for a thorough literature review). Our methods closely correspond to recent field experiments that have measured racial discrimination in labor market settings, typically in the hiring of actual applicants. The researchers either have actual individuals apply for jobs (Fix and Struyk, 1993) or they may submit fictitious job applications to actual job openings (Bertrand and Mullainathan, 2004; Banerjee, Bertrand, Datta, and Mullainathan, 2009; Siddique, 2008). Under both strategies, the "applicants" are statistically identical in all respects, except for race or caste group. Unlike pure laboratory experiments, in which individuals are asked to perform assessments in a consequence-free environment, an advantage of these experiments is that they measure the behavior of actual employers making real employment decisions. The early literature on discrimination in grading practices focuses on smallscale lab experiments. Subjects were asked to hypothetically evaluate tests, essays or other student responses for which the researcher has experimentally manipulated the characteristics of the student to whom the work is attributed. Many of these early studies find evidence of discrimination: for example, DeMeis and Turner (1978) find discrimination against African Americans, while Jacobson and Efferts (1974) find evidence of reverse discrimination with unsuccessful females being criticized less harshly than males when failing a leadership task. However, this literature also finds evidence that discrimination varies by who does the grading (Coates, 1972; Lenney, Mitchell, and Browning, 1983), the type of work being evaluated (Wen, 1979), and the underlying quality of the individual's application (Deaux and Taynor, 1973). Compared to our methodology, many of these older studies have limited sample sizes and ask graders to assign hypothetical grades. Like the labor market studies, our design places graders in an environment in which their grades have a material effect on the well-being of a child because the graders know they determine the awarding of the prizes. The second, more recent, strand of the literature compares scores obtained from non-blind grading to scores awarded under blind grading using observational data. Outside of the education context: Goldin and Rouse (2000) find that the adoption of blind auditions for symphony orchestras increase the proportion of hired women. Blank (1991) finds no evidence of gender discrimination when submissions to the The American Economics Review are refereed with or without knowledge of the author's identity. Much of this literature tends to find results that contradict the earlier experimental evidence from the lab, finding no discrimination for minority students (Shay and Jones, 2006; Dorsey and Colliver, 1995; Baird, 1998; Newstead and Dennis, 1990). Recent exceptions include Lavy (2008), which finds that blind evaluations actually help male students, and Botelho, Madeira, and Rangel (2010) who find evidence of discrimination against black children in Brazil. While these studies provide important evidence, the same exams are usually not graded by the same grader or even using the same grading framework, requiring the researcher to infer differences in grading practices by comparing the distribution of scores between two different measures of student performance. We compare the same exams graded under non-blind grading, holding the individual grader and all but the characteristics of the student constant. On the whole, we find evidence of discrimination against lower-caste children. Teachers give exams that are assigned to "lower-caste" scores that are about 0.03 to 0.08 standard deviations lower than exams that are assigned to "high caste." These differences are practically very small. They represent, at most, a difference in exam scores of 1.5 percentage points and given the observed test scores distribution, a reduction in score of this magnitude would only slightly change a students' rank in the distribution. On average, we do not find any evidence of discrimination by gender or age. The data appear consistent with statistical discrimination. Graders tend to discriminate more against children who are graded early in the evaluation process, suggesting that graders utilize demographic characteristics when the testing instrument or grade distribution are more uncertain. If the graders were purely taste-discriminating, there would be little reason to expect that discrimination would vary by the order in which they graded the exam. We find no evidence that the subjectivity of the test mattered: in fact, graders made "less subjective" subjects, such as math, "more" subjective by being generous with partial credit. Finally, we do not find evidence of in-group bias on average. In fact, we observe the opposite, with discrimination against the lowcaste children being driven by low-caste graders, and graders from the high-caste groups appearing not to discriminate at all (even when controlling for the education and age of grader). Taken together, these findings offer new insights into discrimination in grading. First, the results suggest that if discrimination exists in the subtle grading of an exam, other more blatant forms of discrimination may exist in the educational system as well. Second, we shed light on the channels through which discrimination operates, so that these findings can help inform the design of future anti-discrimination policies. For example, given that the graders appear to statistically discriminate, policies aimed at making graders more confident in the testing techniques may, perhaps, reduce the dependence on child characteristics while grading. The paper proceeds as follows. Section II provides some background on caste discrimination and education in India, and articulates our conceptual framework. Section III describes the methodology, while Section IV describes the data. We provide the results in Section V. Section VI concludes. II. BACKGROUND AND CONCEPTUAL FRAMEWORK A. Caste Discrimination in India In India, individuals in the majority Hindu religion were traditionally divided into hereditary caste groups that denoted both their family's place within the social hierarchy and their professional occupation. In order of prestige, these castes were the Brahmin, Kshatriya, Vaishya, and Shudra respectively denoting priests, warriors/nobility, traders/farmers and manual laborers. In principle, individuals are now free to choose occupations regardless of caste, but like race in the United States, these historical distinctions have created inequities that still exert powerful social and economic influences. Banerjee and Knight (1985), Lakshmanasamy and Madheswaran (1995), and Unni B. Conceptual Framework We explore three main theories of discrimination in this paper. First, we aim to distinguish between behaviors that are consistent with taste-based models of discrimination, in which teachers may have particular preferences for individuals of a particular group or characteristic (Becker, 1971), and statistical discrimination, in which teachers may use observable characteristics to proxy for unobservable skills (Phelps, 1972; Arrow, 1972). One might think that the process of grading would limit statistical discrimination in practice, as teachers observe a measure of skill for the child, i.e. the actual performance on the exam. However, this may not be the case: First, the teacher may be lazy and may not carefully study each exam to determine its quality. Instead, he or she may just use the demographic characteristics as a proxy for skill. Second, teachers may statistically discriminate if they are not confident about the testing instrument. In particular, teachers may be unsure as to what the final distribution of grades "should" look like, and therefore, they may not know how much partial credit to give per question. Thus, teachers may use the demographics, not as a signal of performance, but rather as a signal of where the child should place in the distribution. Our design allows us to test the different implications of these models. Second, we explore whether discrimination is more likely to occur in subjective subjects. The introduction of objective tests (particularly multiple choice exams) has been championed as a key method for reducing teacher discrimination. However, these types of tests are not without their detractors, particularly because objective exams are limited in their ability to capture certain types of learning (see, for example, Darling-Hammond, 1994; Jae and Cowling, 2008). We explore whether teachers are less likely to discriminate when grading exams in relatively objective subjects (like math and Hindi) than subjective subjects (like art). There are very few empirical papers that have tested for the presence of statistical and/or taste-based discrimination. These include, but are not limited, to: Altonji and Pierret (2001), which finds evidence of statistical discrimination based on schooling, but not race; Han (2004), which performs a test for taste-based discrimination in the credit market and cannot reject the null hypothesis of the non-existence of taste-based discrimination; Levitt (2004), which finds some evidence of taste-based discrimination against older individuals; and List (2004), which finds evidence of statistical discrimination in the sports cards market. For example, teachers' beliefs about the average characteristics and capabilities of children from different castes may be influenced by their own membership in a particular caste. One might imagine that lower-caste teachers would be less likely to use caste as a proxy for performance given their intimate experience with lowcaste status or alternatively that they might be partial towards people from their own social group. However, there are arguments against in-group bias: for example, low-caste teachers may have internalized a belief that different castes have different abilities, and thus such teachers may discriminate more against low-status children. In laboratory experiments, subjects often exhibit behaviors that are consistent with in-group bias. 4 We explore whether low-caste teachers are more likely to discriminate in favor of low-caste children. III. METHODOLOGY AND DATA A. Experimental Design The experiment is comprised of three components: child testing sessions, the creation of grading packets, and teacher grading sessions. Each component is described in depth below. 4 A series of experiments in the psychology literature have found that individuals presented in-group bias even in artificially constructed groups (Vaughn, Tajfel, and Williams, 1981) or groups that were randomly assigned (Billig and Tajfel, 1973). Turner and Brown (1976) studied "in-group bias" when "status" is conferred to the groups, and found that while all subjects were biased in favor of their own group, the groups identified as superior exhibited more in-group bias. More recently, Klein and Azzi (2001) also find that both "inferior" and "superior" groups gave higher scores to people in their own group. In addition, using data from the game show "The Weakest Link," Levitt (2004) finds that some evidence that men vote more often to remove other men and women vote more for women. Child Testing Sessions.-In April 2007, we ran exam tournaments for children between seven and 14 years of age. Our project team went door to door to invite parents to allow their children to attend a testing session to compete for a 2,500 INR prize (about 58 USD). 5 Families were informed that the prizes would be distributed to the highest scoring child in each of the two age groups (7 to 10 years of age, and 11 to 14 years of age), that the exams would be graded by local teachers after the testing sessions, and that the prize would be distributed after the grading was complete. The prize is relatively large, given that the parents earn an average of 4,500 INR per month (104 USD). 6 Over a two-week period, 69 children attended four testing sessions. The sessions were held in accessible locations such as community halls, empty homes or temples to ensure that they did not conflict with the school day and that parents would be able to accompany their children. During the testing sessions, the project team obtained informed consent and then administered a short survey to the parents in order to collect information on the child and the basic demographic characteristics of the family. Next, the project team administered the exam. We included questions that tested standard math and language skills, as well as an art section. Math was selected as the most objective section, covering counting, greater than/less than, number sequences, addition, subtraction, basic multiplication, and simple word problems. Language, which was chosen to be the intermediately objective section, included questions on basic vocabulary, spelling, synonyms, antonyms, 5 For recruitment, our project team mapped the city, collecting demographic information about each community. To ensure that children of varying castes would be present at each session, the team then recruited from neighborhoods with many caste groups or from several homogenous caste neighborhoods. 6 The formula for awarding the prize affects the probability that a given child will benefit from the competition. If teachers used this information in conjunction with their initial assessments of an exam's quality, the prize structure may even affect the level of discrimination experienced by different students. For example, our mechanism makes the grading of higher quality exams more important than the grading of low quality exams because only the highest quality exams can receive the prize. It is possible that graders may make an initial (though noisy) assessment of an exam and then decide how much effort to spending grading. They may even choose to rely more on stereotypes when grading exams that they believe have no chance of winning. This is an important question for future research. and basic reading comprehension. Finally, the art section was designed to be the most subjective: children were asked to draw a picture of their family doing their favorite activity and then to explain the activity. The exam took about 1.5 hours. All parents and children were told that they would be contacted with information about the prize when grading was complete. Randomizing Child Characteristics.-Typically, one can only access data on the actual grades teachers assign to students whose characteristics the teachers know. This makes it difficult to identify what grade the teacher would have assigned had another child, with different socioeconomic characteristics, completed the same exam in an identical manner. To solve this problem, we randomized the demographic characteristics observed by teachers on each exam so that these characteristics are uncorrelated with exam quality. (Henceforth, we refer to the characteristics that are randomly assigned as the "assigned characteristics" and the characteristics of the child actually giving the exam as the "actual characteristics".) Thus, any correlation between the assigned characteristics and exam scores is evidence of discrimination. Each teacher was asked to grade a packet of exams. To form these packets, each completed test was stripped of identifying information, assigned an ID number, and photocopied. Twenty-five exams were then randomly selected to form each packet, without replacement, in order to ensure that the teacher did not grade the same photocopied test more than once. Each exam in the packet was then given a coversheet, which contained the randomly assigned characteristics: child's first name, last name, gender, caste information, and age. We also include caste categories (General, Other Backward Caste, Scheduled Caste, and Scheduled Tribe), which are groupings of the caste. We find small effects of discrimination against the lower categories, but while the magnitude is the Each exam was graded by an average of 43 teachers. As explained in Section I, one of the main limitations of existing studies that compare blindly and non-blindly graded exams is that they have to compare across different graders using different grading standards. We designed our study to allow for the inclusion of grader fixed effect by stratifying the assignment of the exams and assigned child characteristics to ensure an equal distribution for each grader. Since many last names are caste specific, we randomized the last name and the caste together. Similarly, first name and gender were randomized together. 8 The assigned characteristics were each drawn from an independent distribution. Caste was assigned as follows: 12.5 percent of the exams were assigned each to the highest caste (Brahmin) and the next caste (Kshatriya), while 50 percent of the exams were assigned to the Vaishya Caste and 25 percent were assigned to the Shudra Caste. For each teacher, we sampled the child's name without replacement so that the teacher did not grade two different exams from the same child. 9 We randomly selected the ages of the students from a uniform distribution between eight and 14, and ensured that gender was equally distributed among the males and females. Teacher Grading Sessions.-We next recruited teachers to grade the exams. We obtained a listing of the city's schools from the local government and divided them into government and private schools. For each category, we ranked the schools using a random number generator. The project team began recruitment at the schools at the top of the list and approached schools until they obtained the same across all coefficients, it is only statistically significant when including the blind test score. Disaggregating by category, the effect is driven by the scheduled caste category. Given the overlapping in categories and caste, we cannot isolate different effects between these two groupings. 8 This strategy has the advantage of consistently conveying caste. It does prevent us from identifying the specific channel through which teachers get the information. It may be possible, for example, that the name alone is enough to convey caste. 9 In addition to being classified into the four large castes, Indian citizens can also be assigned to several affirmative action categories. These are Scheduled Tribe, Scheduled Caste, and Otherwise Backward Castes. The purpose of the distribution of castes was to ensure variation in both caste and the caste categories to which children could be assigned. These categories are restricted to the lowest two castes. The result of ensuring equal distribution among each category was that 75 percent of exams were assigned to the lowest two castes. desired number of teachers. 10 The recruitment proceeded as follows: First, the project team talked with the school's headmaster to obtain permission to recruit teachers. Once permission was obtained, the team invited teachers to participate in a study to understand grading practices, where they were told that they would grade twenty-five exams in return for a 250 INR (about 5.80 USD) payment. The team also informed the teachers that the child who obtained the highest overall score would receive a prize worth 2,500 INR (about 58 USD). This prize was designed to ensure that the grades had real effects on the well-being of the children, just as the grades assigned by external graders also have a direct impact on things like the receipt of a scholarship or school admissions. In total, the project team visited about 167 schools to recruit 120 teachers, 67 from government schools and 53 from private schools. 11 Each grading session lasted about two hours. The project team provided the teachers with a complete set of answers for the math and language sections of the test, and the maximum points allotted for each question for all three test sections. The team went through the answer set question by question with the teachers. Teachers were told that partial credit was allowed, but the team did not describe how it should be allocated. Thus, the teachers were allowed to allocate partial credit points as they felt appropriate. Next, the teachers each received 25 randomly selected exams-with the randomly assigned cover sheets-to grade, as well as a "testing roster" to fill out. To ensure that teachers viewed the cover sheets, we asked them to copy the cover sheet information onto the grade roster. They were then asked to grade the exam and enter the grades onto the roster. When a teacher finished grading, the project 10 Overall, about half of the schools that were approached had teachers that agreed to participate. Generally, teachers cited being busy or a lack of interest as reasons for declining our offer. 11 These results may also have implications for the behavior of teachers in the classroom. However, the incentives in this study are, of course, not identical to those experienced in the classroom. In the classroom, teachers know much more about a child than is available on our cover sheets and teachers have the opportunity to interact repeatedly with students over the course of the school year. team administered a short survey to the teacher, which was designed to learn their demographic characteristics and teaching philosophy. After all the grading sessions were complete, we computed the average grade for each child across all teachers who graded his or her exam. We then awarded the prize to the highest scoring child in each of the age categories based on these average grades. B. Data Description We collected two sets of exam scores. The first set includes the test scores generated by each teacher. In addition, a member of the research staff graded each exam on a "blind" basis, with no access to the original characteristics of the students taking the exam or any assigned characteristics. This was done to provide an objective assessment of the quality of the individual exam. Note that while the blind grading was meant to mimic the teacher's grading procedures, it was conducted by a project team member who may have graded differently from the teachers. Finally, note that we normalized the exam scores in the analysis that follows in order to facilitate comparisons with other studies in the literature. Each section and the overall exam score are normalized relative to the distribution of the individual scores for the respective measure. 12 In addition, we have data from two surveys. First, we have data from the parent survey, which contains information on the family's caste and the child's gender and age. Second, we have data from the teacher survey, which included basic demographic information, such as the teachers' religion, caste, educational background, age, and gender. In addition, we also collected information on the