Unorthodox Theory’s Rebuttal to the Ultimate Reseach Document Debunked

Vaush’s Ultimate Research Document and Rose Wrist Response Debunk

This document addresses the sources and arguments from Francis Black’s blog post responding to Vaush’s Ultimate Research Document. The general trend seems to be a mixture of genuine criticisms of methodological limitations in certain studies (such as llsome of the smaller studies included in much larger reviews or meta-analyses) combined with a whole heaping helping of cherry-picking, confirmation bias, motivated reasoning, and arguments from incredulity. Some of the counterarguments and counterexamples given are hilariously poor and/or only vaguely address the point being made. The sources are heavily outdated as well. Francis Black’s sources will likely not represent the current sociological trends of today. In fact, there’s very few sources after 2010. They never include the works cited making it more difficult to sift through the studies they use, because they either cherry pick quotes or just lie about the findings. Basically, it’s race realism parading as analysis. I think it’s even more damning to take a look at the author’s bias here:
(I’m going to refer to the author as FB for short.)
(My responses are in red, underlined, and italicized.)

https://www.sentencingproject.org/wp-content/uploads/2015/11/Black-Lives-Matter.pdf
Extensive document on racial biases in our criminal justice system.
Studies seem to indicate about 61-80% of black overrepresentation in prisons can be explained by higher black crime rates, with the unexplained portion largely attributable to racial bias.
Remember – the factors which lead to disproportionate criminality amongst black Americans are also in large part a product of racial bias. Underfunded public programs, redlining, generational poverty, bad schooling, and myriad other factors which influence criminality can also be traced to racial bias.

The documents presented by The Sentencing Project attempt to compare arrests to % of prisoners in prison, or by looking at some measurement of crime and arrest rates. Since there is some unexplained variance, it’s attributed to racial bias without evidence. Let’s take at the first cited study in this document, Tonry, and Melewski (2008). After comparing UCR data to prisoners by race %s, they found a 38.9% disparity that could not be explained, replicating a previous study that used the same crime database as them.

There is an important issue with this analysis and the other studies cited. The researchers are comparing black presence in prison to arrest rates, especially since they’re attempting to replicate a previous study with the same method. This is important since the number of blacks in prison is influenced by how many blacks go to prison in any given year, and using arrest rates from a particular date may overestimate the disparity. For example, say that in 2020, blacks made up 50% of prisoners, and the number of blacks sentenced to prison in 2021 was 5%. You decided to look at UCR data, you check to see how much of the variance can be explained by crimes when looking at the amounts of blacks in prison. To do this, you use UCR data from 2021 and then run your test. You find that a large percentage of the number of blacks in prison can not be explained by crimes. The error with this is that you’re comparing the presence of blacks in prison who could have been there for years, and then using crime arrest data from a single point in time and leading yourself to find a large disparity. Thus, your results could be heavily biased due to this. The 2nd study cited, Blumstein (1993), found a 20.5% disparity that was left unexplained with the same methodology as the first study. The 3rd study, Langan (1985), looked at NCS data, inmate surveys, and admission census. It was found that the amount of blacks in prison was higher than the number of black offenders reported in the NCS. In 1979 and 1982, 84% and 85% of the variance could be explained by black crime, leaving the rest to possible racial bias. As the paper says, there may be other explanations as to what explains the remaining disparity which is not racism; as the footnote says

“regional differences in the imposition of sanctions may account for these differences since blacks may be concentrated in regions of the country where prison sentences are relatively common among convicted offenders (blacks and whites alike). If that is the case, statistics for the nation could indicate that the probability of imprisonment is higher for blacks than whites even in the absence of racial discrimination in justice administration. Another explanation is that blacks may have on average slightly longer criminal records than whites, thereby increasing their chances of receiving a prison sentence. Other legitimate explanations of these percentages are discussed by Blumstein, supra note 2, at 1268-70.”

As can be seen, the author does not fully endorse racism as an explanation and offers alternative hypotheses. Like the studies cited before this one and the others that follow, the authors do not test to see if racial bias can explain the remaining disparity. All they do is argue that racism could be an explanation, but offer no statistical tests to see if they’re correct. This will be explained later, but let’s continue. The final study cited is Blumstein (1993), but I have been unable to find an online version of this paper to see what data was used. Regardless, none of these cited studies can prove racism is to blame. As was already noted up above, the researchers do not prove that the remaining disparity can be explained by racial bias. All they do is find that there is a disparity left, say it could be because of racism, and that’s it. No models are used to test this hypothesis, but readers gobble it up and say the remaining portion is “largely attributable to racial bias.” No statistical model is run, no test is used to see if this is true, just an interpretation of the data by the author. It’s up to the authors to actually prove that this is the case, not to assume it is. Readers, especially those who watch Vaush, may find what I just said asinine, but it’s the truth. Either run a model to test your racism hypothesis and see if the results confirm your hypothesis, or do not make the claim. Unfortunately, what I said doesn’t matter since the researchers have a fancy degree and they’re authority figures, so degrees matter more than evidence.

I’m pretty sure FB is not familiar with how literature reviews work because he’s describing the purpose of literature reviews. They cite preceding research. Although the source does cite another source that claims criminal histories may play a role which is in favor of the differential involvement hypothesis, the original source acknowledges that:

”The remainder might be caused by racial bias, as well as other factors like differing criminal histories.”

This means that it can be both. Vaush was just highlighting where they say it can be largely attributable to racial bias. They spend paragraphs saying that the disparity being attributed to racial bias is just an analysis which no one has claimed otherwise. Interpreting the data is exactly that. If the disparity can't be explained by anything empirical, then that's an argument in favor of hidden bias. It's already bad that FB is childish with:

“Either run a model to test your racism hypothesis and see if the results confirm your hypothesis, or do not make the claim. Unfortunately, what I said doesn't matter since the researchers have a fancy degree and they're authority figures, so degrees matter more than evidence.”

Like what would a model to test the racism hypothesis even look like? Studies cannot empirically verify intent. put simply, you just don't know off of your 5 senses what another person is thinking. So the most accurate thing to do is to just say that: people don't know, people can't conclude whether they have discriminatory intent or not even if they can explicitly state it. Basically, studies control for most things other than race, and so the hypothesis is that racial bias explains this. If FB dismisses this hypothesis, does he simply prefer to stake out a position that completely lacks explanatory power, over one that is entirely plausible?

To continue, though, studies looking at the NCVS and official crime databases yield essentially similar racial differences as do official statistics (La Free 1996

La Free only looks at arrest trends for UCR data as it indeed says that UCR and NCVS data have limitations and weaknesses.

, in Hawkins 1995;

FB uses Hawkins 1995 to say that since NCVS and UCR reports may match then that means they accurately measure crime rates however the only time Hawkins 1995 even mentions the National Crime Victimization Survey is when they cite Bogess and Bound, 1993. Bogess and Bound, 1993 literally says this (as Hawkins 1995 points out too):

“the large increase in the incarceration rate is attributable primarily to an increase in the likelihood of incarceration given arrest.”

Hawkins 1995 points out that this means it is not an increase in the crime rate.

Wilbanks 1987).

The thing about Wilbanks is it cites Blumstein (1982) and its replication, Langan (1985) however those studies have been countered by a growing body of evidence such as (Peterson and Hagan 1984, Myers and Talarico 1987, Bridges and Crutchfield 1988, Bridges, Crutchfield, and Simpson 1987). racial differences in imprisonment. These studies show empirically that racial differences in crime and arrest rates contribute substantially less to racial differences in imprisonment than Blumstein's (1982) and Langan's (1985) work suggests. For example, Bridges and Crutchfield's (1988) multivariate analysis of state differences in imprisonment concludes:

“racial differences at arrest for serious criminal behavior may be coupled with differential treatment of minorities in the legal system.”

There are several limitations when comparing UCR and NCVS data. One shortcoming relates to whether the UCR and the NCVS measure the same phenomenon. A plethora of researchers find sizable differences between relative crime levels reported in the UCR and those obtained from the NCVS for 26 central cities (Booth, Johnson & Choldin 1977; O’Brien 1983; O’Brien, Shichor & Decker 1980).

Longitudinal analyses also show that for a variety of criminal offenses the UCR and NCVS trend in completely opposite directions (see O’Brien 1985).

Even then there are still problems: UCR statistics do not represent the actual amount of criminal activity occurring in the United States. As it relies upon local law enforcement agency crime reports, the UCR program can only measure crime known to police and cannot provide an accurate representation of actual crime rates. The UCR program is focused upon street crime, and does not record information on many other types of crime, such as organized crime, corporate crime or federal crime.

Further, law enforcement agencies can provide inadvertently misleading data as a result of local policing practices. These factors can lead to misrepresentations regarding the nature and extent of criminal activity in the United States. UCR data are capable of being manipulated by local law enforcement agencies. Information is supplied voluntarily to the UCR program, and manipulation of data can occur at the local level. The UCR tracks crime for the racial category of "White" to include both Hispanic and non-Hispanic ethnicities. According to the ACLU, with over 50 million Latinos residing in the United States, this hides the incarceration rates for Latinos vis-à-vis marijuana-related offenses, as they are considered "White" with respect to the UCR. NCVS statistics do not represent verified or evidenced instances of victimization. As it depends upon the recollection of the individuals surveyed, the NCVS cannot distinguish between true and fabricated claims of victimization, nor can it verify the truth of the severity of the reported incidents.

Also, the UCR is affected by the police “unfounding” crimes and by reporting errors. The UCR is not a mandatory program; therefore, all law enforcement agencies do not report to the UCR and do not follow the definition, classification, and reporting guidelines provided by the UCR program. Law enforcement agencies also have the process of “unfounding” a crime that may bring some discrepancies between the two programs. A sizeable number of crimes reported by people are deemed unfounded by police because of wrong reporting, misunderstanding of the law, and lack of prima-facie evidence.

Further, the NCVS cannot detect cases of victimization where the victim is too traumatized to report. These factors can contribute to deficits in the reliability of NCVS statistics. ⁠

The NCVS program is focused upon metropolitan and urban areas, and does not adequately cover suburban and rural regions. This can lead to misrepresentations regarding the nature and extent of victimization in the United States.

The declining quality of the NCVS, especially after the methodological changes of 1992, is also cause or spurious convergence. The National Academy of Sciences panel report has clearly stated that reduction in sample size and other quality control mechanism in the survey have led to serious methodological inadequacy and the survey could no longer serve its function of providing valid annual estimates of crime victimization (Groves and Cork 2008). The sample size of the survey has been continuously decreasing due to flat or reduced funding. In 1972 survey, 160,000 persons from 65,000 households were interviewed, and in 2006, when the sample size reached the lowest level, only 67,000 people from 38,000 households participated in the survey. Between 1993 and 2006, the number of households included in the sample decreased by 21 % and the number of individuals who participated in the survey decreased by 28 %.

The UCR also does hotel and hierarchy rules in which the NCVS doesn’t follow.

There is also sampling and non-sampling errors. The estimates of the NCVS are based on sample data, and confidence intervals are computed (standard error bars for black victimization tends to be ~2x higher), whereas in the case of the UCR, no adjustment is made for unreported data by different agencies or underreporting by police. The biggest source of non-sampling error in the NCVS is the telescopic effect (O’Brien, 1988). Although the bounding process is an effective methodological tool to minimize the effect of telescoping, the respondents’ recalling victimization that occurred in the previous 6 months is not free from error. If the bounding process is removed and the reference period is increased from 6 months to one year, the non-sampling error due to telescoping will increase (Rand, 2007).

Also, the process of measuring series victimization in the NCVS is another reason for the discrepancies. According to series victimization, if a victim has experienced six or more similar but separate victimizations in the last 6 months and is unable to describe them separately, one report will be taken for the entire series of victimization.

Also, the victimization survey, which began in 1973, revealed that less than 50% of offenses are reported through the UCR program (Skogan, 1974).

Menard and Covey (1988) used the UCR and NCVS raw data from 1973 to 1982 and conducted trend analysis to see whether the two series were correlated or converging over time.

Additionally, Menard (1992) suggested using test-retest or Cronbach’s alpha to establish reliability of the correlations between the two measures. A reliability score of less than .8 would mean that the two series do not measure the same phenomenon. No study has done this.

Convergence as correlated rates has been the most popular definition to use with cross-sectional and time-series data. The first problem associated with this definition is the issue of detrending and differencing, especially when time-series data are used. The second problem is the threshold of correlation coefficient for determining whether the two series are converging. Since convergence as correlated rates has been the most popular definition of convergence, it would be helpful to see how some of the important research that adopted the correlational definition has used this definition for exploring convergence and what conclusions have been drawn as a result. Studies using the correlated rates definition of convergence have used cross- sectional and time-series data for comparing the rates of different aggregated and disaggregated crime categories of the UCR and NCVS (McDowell & Loftin, 2007).

Black people are also more likely to report anyway (Skogan 1984).

https://repository.library.northeastern.edu/files/neu:1028/fulltext.pdf

For example, after examining data over a 3-year period, O’Brien (2001,^[1]

As I was looking for O'Brien, 2001, I did come across this thing, which seemed pretty informative on the disadvantages of UCR vs NCVS (e.g. NCVS data severely underestimates burglaries, robberies, theft, and vandalism):^[2]
The whole UCR vs NCVS stuff is already making me confused, and I haven't even gotten into the meat of everything else. They seem to claim that NCVS data shows similar racial differences in crimes committed as racial differences in arrest rates from the UCR data, which is supposed to be an argument showing that the racial arrest disparities shown in UCR data are due to higher black criminality, not racial biases in the criminal justice system (since NCVS is survey data). But at the same time, I thought UCR data was used in conjunction with other data to demonstrate that ~20% of the disproportionate incarceration of black individuals is due to racial biases ... So if the NCVS data shows almost identical results to the UCR data ... doesn't this just support what was originally being said in the "Ultimate Research Document"? Rather, at the very least, it doesn't seem to contradict anything. Again, I get it's because they want to show that self-report data shows that black people commit crime at rates almost identical to what the arrest rates would indicate from UCR data, but the point from the research document was about incarceration rates and not arrest rates ... So why did they present this as some kind of rebuttal? These are two separate points.

1. ⁠UCR data represents compiled police reports and arrest data from voluntarily participating law enforcement agencies.

2. ⁠NCVS data represents a survey requesting information on crimes committed against individuals/households.

3. ⁠If the NCVS data lines up with the UCR data, that means that arrest rate data is probably a very good proxy of crime differences between different races as opposed to racist law enforcement. For example (these numbers aren't real, they're just hypothetical): if UCR data indicates that 60% of assaults are committed by black people and NCVS data indicates 61% of assaults are committed by black people, the UCR data acts as a good proxy for crime committed by different races.

4. ⁠So if we look at the proportional racial differences in crimes committed from the arrest data (which we just confirmed is a good proxy), then we can see how that matches up with the incarceration data to see if there's any disparities there.

5. ⁠As the Harris study's data shows, somewhere between 75-80% of the disproportionate incarceration of black people can be explained by the arrest data (which we just confirmed is a good proxy for racial differences in crimes committed). C: The NCVS matching up with the UCR concludes that arrest data is a good proxy for crimes committed by racial groups, so if there is a disparity between arrest data and incarceration rates, then some proportion of incarceration is not explained by racial differences in crime committed. The NCVS matching up with the UCR does not conclude that this unexplained proportion is due to racial differences in crime. In fact, it concludes the opposite^[a]: This shows that the NCVS data lining up with UCR data seems to directly contradict the idea that "dark crime" could explain the incarceration-arrest disparity, since in theory NCVS should account for at least some of this "dark crime". So if the only potential confounding variable offered by the author of the blog is "dark crime", and the blog states that the unexplained variance being due to systematic racism is unlikely because "official crime reports match up with crime arrests", then wouldn't this all be false from the information we've just discussed? Crime reports matching up with crime arrests is kind of agnostic with respect to racial bias affecting incarceration rates. What this does is it shows that there is a possibility for post-arrest racial bias. So at best it limits where the racial bias can be occurring from any point in time to post arrest. Also, I took a look at one of the studies that's supposedly arguing for the 61-80% variance statistic, and it didn't solely look at UCR data. It also seemed to factor in NCVS data.^[3]

in Walsh and Ellis 2006) found that crime reports matched up with crime arrests for race, meaning that blacks were being arrested at the same rate that they committed crimes

Walsh & Ellis 2006 cited D’Alessio & Stolzenberg 2003 and the NIBRS as evidence that “crime reports matched up with crime arrests for race”. However there’s several problems with D’Alessio and Stolzenberg 2003 and NIBRS (as I describe further below). While the examination of solo offending incidents is most analogous to D’Alessio and Stolzenberg 2003’s research on race and arrest, the sample is not necessarily directly comparable to co-offending incidents. That is, co-offending incidents may be substantively distinct in some way from solo offending incidents.

For example, when analyzing racial differences in the likelihood of arrest within co-offending partnerships, the estimated relationship between race and arrest is substantially different. According to this approach, the primary estimate of interest is the within-incident. race estimate (i.e., level one estimate). The results from this analysis indicate that, within partnerships and net of controls, while other race offenders do not have a statistically different arrest likelihood, black offenders are actually slightly more likely than their white co-offending partners to be arrested for a violent offense. Put another way, when black and white offenders co-offend together, and an arrest is made, the arrest is more likely to involve the black co-offender than the white co-offender.^[4]

Another advantage of this approach is the ability to examine how the relationship between race and arrest varies in the context of a co-offending partnership. So when a black offender co-offends with a white offender, the black co-offender is slightly more likely than the white co-offender to be arrested.

There are also many other problems with D’Alessio & Stolzenberg 2003 utilizing NIBRS data:

1. Incomplete agency coverage is the main disadvantage of the NIBRS data relative to other official statistics such as the UCR Summary System and SHR. When the NIBRS program was first put into practice, law enforcement agencies in South Carolina submitted incident-level data as a pilot program. In 1991, the first year of actual implementation, only two states were certified to have their agencies submit incident-based data (Barnett- Ryan 2007). By 2005, participation grew to include 31 states, though it is still the case that not all police agencies in participating states report data to the NIBRS. For example, the District of Columbia participates in the NIBRS, but only the Transit Police report data, not the DC Metropolitan Police.

Due to the voluntary nature of participation, NIBRS agencies do not represent a random sample of American police agencies, and NIBRS incidents are not a random sample of U.S. crime incidents (Addington 2007b).

According to Chilton and Regoeczi’s (2007) comparison of data from NIBRS and non-NIBRS agencies, southern agencies are overrepresented and western agencies are underrepresented in the NIBRS. Along with this lack of regional representativeness, the distribution of agency size among NIBRS participants is not representative of all police agencies. Criminologists are keenly interested in crime statistics from large police departments representing urban and high crime jurisdictions, but most such departments are absent from the NIBRS. This is not simply because overall participation in the NIBRS is low, but also because large police agencies serving populations over 250,000 are especially underrepresented in the NIBRS (Addington 2008). Chilton and Regoeczi (2007) found that the mean population served by NIBRS agencies is almost 30 percent less than that of non-NIBRS agencies. One reason for the slow adoption of the NIBRS in large agencies may be that such agencies often already have their own incident-based crime recording system, optimized for local needs (Maxfield 1999).

This is all concerning as the non-nationally representative coverage of agencies by the NIBRS results in bias. Addington’s (2008) analysis found that bias in crime rates calculated from the NIBRS sample, measured by differences from rates calculated from the entire sample of agencies participating in the UCR Summary System, can be meaningfully large. Bias seems especially prominent for change estimates over 2 years, and in estimated violent crime rates for the set of large city agencies.

2. Challenges may be inherent in NIBRS in that it remains a secondary data source (Dunn & Zelenock, 1999) that inevitably does not contain the specificity of variables or response categories that many analysts and researchers would like to examine in granular detail for specific topics of interest. This challenge sometimes, at least partially, can be tempered by a careful reading of the documentation for the data being utilized. Unfortunately, often the clarity of documentation for secondary data sources is found lacking. In the case of NIBRS, the documentation and resource literature regarding its proper use and limitations have become more abundant over the years.

3. Another challenge is often encountered when NIBRS data are thought to represent a police process that may not be occurring or may not feasibly occur when responding to calls for service involving criminal victimizations.

While the data are generated from submissions by thousands of agencies across the country, the underlying day to day processes that are the foundation of police record keeping vary widely from jurisdiction to jurisdiction and NIBRS, like its predecessors (UCR and SHR), was not designed as a research database but as a system of records to report information pertaining to the offenses and arrests that police conduct on a monthly and yearly basis nationwide. As such, crucial data for some studies are not explicitly captured in NIBRS.

4. The multiplicity of attributes in NIBRS requires additional tabulations, which combined with a lack of a hierarchy rule in which the most serious attribute (offense, weapon, injury, etc.) would be reported in the first available field results in computational challenges that can often be problematic or overlooked. Analytically, this structural aspect of NIBRS yields a requirement for analysis of all fields of a multiple response variable in order to correctly confirm that all target responses have been captured. This can be complicated and perhaps impact results.

5. NIBRS data is plagued by missing data due to item missing within each incident; agency level by month, due to some agencies reporting only for a partial year; and agency level by year, due to some agencies not submitting data to NIBRS for an entire year. This can cause significant bias in statistical estimation and obstructs analysts’ ability to make inferences directly from the data.

6. At present, the NIBRS data have not been collected long enough to support time-series analysis. This is important because analyses of time- series data frequently lead to different conclusions than those of cross- sectional data (like that of D’Alessio & Stolzenberg 2003). In fact Stolzenberg and D’Alessio (2007), the same authors, argued that time-series data are needed to better determine the causal direction of the possibly recursive relationships between divorce and spousal victimization rates.

7. Although the NIBRS collects richer information on crime incidents than the SHR or the NCVS, this comes at the cost of a considerably more complex data format. The NIBRS uses separate segment files to store information on offenses, victims, property, offenders, arrestees, and the place where the incident occurred. It is relatively easy to use data from a single segment because each segment has a format similar to the SHR and the NCVS. However, researchers interested in obtaining data from two or more segment files need to link the segments; see Akiyama and Nolan (1999) and Dunn and Zelenock (1999) for more detail on the NIBRS structure and linking of separate segments. This leads to significant and sometimes quite challenging data management issues.

For instance, one crime incident could have multiple offenses, victims, offenders, and arrestees. Also, one victim can be linked to multiple offenders or arrestees, and vice versa.^[5] If researchers make decisions that reduce the complexity of the NIBRS data prior to conducting analysis, results could be affected. For example, most clearance research using the NIBRS data has excluded incidents with multiple victims and/or offenders.

(see Wilson and Herrenstien 1985 for m^[b]^[c]^[d]ore).^[6] Harris (2009) found that the overrepresentation of non-whites in jail reflects their crime rates. Harris et al. (2009) used annual prison admissions and compared them to the violent crime index. According to Harris et al., “the overall pattern is one of considerable consistency across stages of the criminal justice system of disparities observed in racial proportions of arrests” [emphasis added by the researchers].

The only time there was a disadvantage, meaning that the admission of blacks exceeds their arrest rates, for within-blacks was for rape. For whites, there was a disadvantage for rape, and for Hispanics there was a disadvantage across the board. Turning to between-black-white differences, Harris et al. there was a high concordance rate between arrests, admissions, and in-stock prison population. Of the differences there were, they were small and show that groups of all races are going to jail at the same rate that they commit crimes.

This is only from Pennsylvania and the study itself says

“Arrest data may reflect discrimination, especially for rape and aggravated assault. Our analysis does not inform us about possible ‘arrest errors’ or racial bias in arrests (i.e., an arrest bias against minorities).”

Besides, there was another study that summarized the results of the Harris 2009 study (along with other studies) as follows:

“While there are significant data limitations that impede our ability to fully decompose aggregate racial disparities in incarceration rates (Frase, 2010), the available data indicated that during the 1970s and 1980s little of this disparity was due to sentencing differentials and a large majority—75-80%—was due to differential selection into the criminal justice system, namely racial disparities in rates of offending and arrest (Blumstein, 1982, 1993; Harris, Steffensmeier, Ulmer, & Painter-Davis, 2009; Langan, 1985).”

https://www.tandfonline.com/doi/pdf/10.1080/07418825.2012.682602?casa_token=4mMx7hjKnfkAAAAA:9uuBwzePOsHAQs4F3gkL8DpSfMYEd3t-2sIt6tFxzIL4yViTXWwcd1BDw1-6OQhmZjgjlRV-8b8_kzA^[e]
This appears to be consistent with the ultimate research document ... unless this paper is somehow incorrectly citing Harris. I don't know why the blog cited this, when the ultimate research document says that 61-80% of the disproportionate incarceration of black people is due to black people committing more crime, which is completely consistent with that conclusion. I skimmed through the Harris paper itself, and it doesn't seem to specify what proportion of the disproportionate incarceration of black people is due to increased criminality, so they must have taken the raw data from the Harris paper and calculated this 75-80% figure by themselves (unless I just missed that in the original Harris paper).

The whole “incarceration lines up with crime reports” is addressed further below

Rubenstein (2006) also found blacks to be arrested at the same rate that they commit crimes in 2013.

This compares the UCR to the NIBRS but it’s already noted how these have problems.

A more recent study provided by the original Reddit comment noted that these studies still found an unexplained disparity, but as has already been noted above, they do not test to see if the remaining disparity can be explained by racism or not. John may tell Bill that 85% of the reason he got into NYU was because of his grades, but the remaining 15% could be due to his race. Bill is right to say that John can not prove this, and no amount of hypothetical arguments can prove this either without him testing his hypothesis. If the remaining disparity is due to racism, then it’s up to the supporters of this view to run a model to see if this is the case, the fact that none have done so is odd. Furthermore, it’s unknown if the data is only looking at inmates who went to jail for a certain year and then comparing it to the crime rates at that specific year; or if they’re looking at the total number of people in jail and then comparing it to crime rates. If it’s the latter, then we should expect the data to be inflated since some people may be in jail for crimes not captured during that year's crime statistics, especially if they have been there for years prior. Regardless, there is a remaining disparity, but this can not be pinned onto racism as the document asserts without actual evidence that this explains the remaining disparity.

Reasons for Black Crime

In the original edition of this response, the issue of why blacks commit more crime was not discussed. When working on this new response, I didn’t want to include it, but further thinking led me to discuss the issue on the reasons for black criminality. According to Vaush, the reason blacks commit more crime is due to “Underfunded public programs, redlining, generational poverty, bad schooling, and myriad other factors which influence criminality.” It’s unknown how some of these things lead to crime (e.g. how does redlining lead to blacks committing crimes?), but most likely there are interaction effects at play. For example, redlining may lead to blacks staying in poverty ridden locations, and the lack of resources leads to blacks committing crimes. Let’s discuss if this can explain black crime on closer inspection.

Generational Poverty

In the case of general poverty, the argument goes that historical sin has led to blacks being poor, and this lack of wealth was passed onto their offspring. Due to this, poverty is a reason as to why blacks commit more crimes. First, it’s important to discuss if poverty is leading to blacks committing crimes. In general, it’s agreed upon that poor people commit more crimes. The question, though, is if poverty shares a casual relationship with crime. Although poor people do commit more crimes, this is not evidence of causality.

Even assuming these studies are accurate and the conclusions drawn from it are sound—an assumption which one should not seriously make outside of a hypothetical, by the way—does it really matter if poverty and crime are causally related, or merely correlated? It’s the same difference, when we’re talking about institutional racism. Furthermore, the causal mechanism could simply be indirect, or something not within the scope of the study in question.

Sariaslan et al. (2013) utilized a large sample of Swedish individuals born from 1975-1989. It was found that a 1 SD increase in neighborhood deprivation was associated with a 57% increase in the chances of being convicted of a violent crime. However, controlling for unobserved confounders made the association disappear. After controlling for confounders, the association between neighborhood deprivation (poverty, basically) was shown not to be casual. However, after further controls were adjusted for, their OR was near 1 for violent crime and the OR > 1 for substance abuse.

The definition of neighborhood deprivation is a poor proxy for poverty. For example, someone who is employed but earning a low wage might still struggle to make ends meet and experience material hardship. Similarly, someone who has completed secondary schooling but lacks access to quality healthcare or safe housing might still experience significant disadvantage.
This study did not specifically investigate underreporting or misclassification of offenses, but there are some clues that could suggest these issues may be present. For example, if there are systematic differences in reporting or classification of offenses across different neighborhoods or demographic groups, this could lead to biased estimates of the relationship between neighborhood deprivation and adolescent criminality and substance misuse. Additionally, if there are changes in reporting or classification practices over time, this could impact the comparability of our results with previous studies. To address these issues, we used a comprehensive measure of more serious adolescent criminality (conviction data) and adjusted for observed and unobserved confounders in our analysis.
However, as the study notes, when they adjust for confounders, one of the confounders they adjust for is parent income. Parent income plays a huge role in poverty so of course when you adjust for it, you’d see a different outcome. This makes sense as this is a study on children. This is like saying “after adjusting for poverty, the poverty-crime relationship is non causal.” There are also significant limitations not mentioned:

1. “Although conviction data are a comprehensive measure of more serious adolescent criminality, they do not capture less serious offending”

2. “one could argue that official statistics for crime partly reflect policing practices, with the targeting of individuals of lower SES resulting in a greater risk of conviction for individuals of lower SES than for those of higher SES.”

3. “although we were able to adjust for school clustering, we had no access to indicators of school quality.”

4. “the multi-level models used in our study assume no correlation between the fixed and random effects that we included, nor between the random effects.”

5. “measuring neighbourhood membership and deprivation at a single point could lead to attenuation bias Duncan, G. J., & Raudenbush, S. W. (2001) and Subramanian, S. V. (2004)”

6. “endogeneity is a form of selection bias that arises in situations in which individuals can to some degree choose their exposures (i.e. what neighbourhoods to live in). Winkelmann R (2008).”

7. “the sibling-comparison design of our study makes a number of important assumptions (e.g. that exposed siblings do not influence their unexposed siblings, that differentially exposed siblings are generalizable to the population and that siblings share their environment). BB Lahey (2010), M McGue (2010), & T Frisell (2012).”

But even then, the confidence intervals still lend credence towards poverty and crime being strongly related (HCL = 1.12).
Additionally, there are more appropriate models to use:
One alternative statistical method that could have been used in the study is a multilevel regression model with random intercepts and slopes. This approach allows for the estimation of both between-neighborhood and within-neighborhood effects of neighborhood deprivation on adolescent criminality and substance misuse, which is more appropriate for the research question than the fixed-effects model used in our study. Additionally, this approach can account for potential heterogeneity in the effects of neighborhood deprivation across different neighborhoods or subgroups of individuals, which improves the accuracy of the estimates.

Another alternative statistical method that could have been used is a propensity score matching analysis. This approach involves matching individuals from different neighborhoods based on their propensity to live in a certain neighborhood, which can help to control for potential confounding variables that may impact both neighborhood selection and adolescent behavior. This approach is particularly useful if there are unobserved confounding variables that are difficult to measure or control for in other statistical models. For example, an unobserved confounding variable that may be difficult to measure or control for in this study is parental mental health. Parental mental health can impact both neighborhood selection and adolescent behavior, but it is difficult to measure and control for in statistical models. If there are systematic differences in parental mental health across different neighborhoods or demographic groups, this could lead to biased estimates of the relationship between neighborhood deprivation and adolescent criminality and substance misuse. While they were able to adjust for observed confounding variables, such as parental education and income, it is possible that unobserved confounding variables, such as parental mental health, are still present in the data and impact the results to some extent. This highlights the importance of using appropriate statistical methods and conducting sensitivity analyses to test the robustness of the results.

The study did not consider economic conditions over time. Let's say that during the study period, there was an overall improvement in the economy, with more job opportunities and higher wages. This might lead to a decrease in violent crime rates and substance misuse across all neighborhoods, regardless of their level of deprivation. If this trend is not accounted for in the analysis, it might appear that neighborhood deprivation has less of an effect on these outcomes than it actually does. This is because the decrease in violent crime and substance misuse due to the improved economy would be mistakenly attributed to neighborhood-level factors rather than broader economic changes.

So, once you control for familial confounding variables, the correlation between poverty and criminality is not casual as some have posited. So, although people in deprived areas may commit more crime, this is not because of deprivation. Similarly, Sariaslan et al. (2018) looked at a total of 526,167 people in Sweden who were born between 1989-1993. Children of parents of the lowest income percentile were more likely to be convicted of a violent crime when compared to those born in a high income percentile. Like the previous study, controlling for unobserved familial risk factors made the association go away–showing that there is no causal relationship between poverty and criminality, and the correlation between the two is spurious rather than casual.

As can be seen, for quintile 1, which was the poorest group, their OR went from 6.78 in their first model to 0.95 in their fourth model. This was also the case for quintiles 2, 3, and 4. Thus, family variables cause the correlation between poverty and crime, making it a spurious correlation rather than a causal one.^[7]

These results are unreliable because of the wide confidence intervals. Even then, we see an HCL of 2.03 meaning that these participants have a two fold increase in the risk of getting convicted of a violent crime.

There should be estimates for the various kinds of violent crime too (i.e homicide, assault, robbery, threats and violence against an officer, gross violation of a person’s/woman’s integrity, unlawful threats, unlawful coercion, kidnapping, illegal confinement, arson, intimidation, or sexual offences.) It’s expected that having an extended family or nuclear family can lead to a decreased incentive to commit certain crimes like sexual offences or arson.

Their measure of poverty/socioeconomic status is shoddy as well. They use a mean of family disposable income rather than a median which is a problem because the income distribution isn't normally distributed.

Additionally, there's no warrant for adjusting for income when adjusting for familial risk factors accordingly, they didn’t do that. Their statistical methodology is also confusing if not flawed. Why are they calculating hazard ratios, using cox regressions, etc. when that is exclusively used for survival analysis? For example, the study says

“to account for time at risk, we calculated hazard ratios (HRs) with corresponding 95% confidence intervals for adolescent violent crime or substance misuse by fitting Cox proportional hazards regression models to the data”

However, this relies on the assumption of proportional hazards. Not only is this unlikely to be true in many real world situations, it is untestable in most real world data sets because the sample size required is huge (they excluded 67,960 people from their sample, lowering the power of the analysis. They also didn't warrant how many people were missing due to panel attrition). There are also significant limitations with this study:

1. “we cannot exclude potential bias from cohort effects that might have affected the associations between childhood family SES and outcome, because the included cohorts were infants or preschool children when Sweden underwent a major economic recession in the mid-1990s with quadrupling unemployment rates and substantially rationalised welfare programmes Å Bergmark (2003). We were unable to explore such bias because we did not have access to yearly parental income data prior to 1990.”

2. “our approach of using nationwide registry data confined our analyses to arguably more severe cases that had been registered by the legal and clinical services for their actions. It is obviously an empirical question whether the results for non-diagnosed cases would be similar.”

3. “the sibling-comparison design makes several important assumptions and requires a large sample size. BM D'Onofrio (2013), BB Lahey (2010), & T Frisell (2012)”.

Recently, Sariaslan et al. (2021) used a sibling analysis to see the correlation between childhood income, family income and mental illness, substance abuse, and violent crime arrests. This analysis offers us to see if there is possible causality since the children grew up at different times in which the family income changed and could possibly lead to lower effects or higher ones, depending on the income level at the time. Much like the previous 2 studies, there was no causal correlation between childhood family income and criminality.

This methodology is dog. They are looking at siblings growing up in the same household at different household incomes (it sounded like an adoption study at first). This raises 2 problems:

1. Outcomes are nonlinear function of income

2. White collar income increases with age.

To explain (1): Think about the difference between households on $20k and $40k, versus the difference between households on $100k and $110k. Obvs big difference in former, less difference in latter (particularly for extreme outcomes).

For (2), households likely to experience increasing income are more likely to be professional-class, which probably has more impact on outcomes than income at specific moment in time. That is, inherent differences aren’t the only other explanation for this result and the siblings do not remove all other confounders as claimed.

This study takes place in Finland which isn't appropriate. Rates of violent crime are different in the US and income differences are smaller. The study even says this:

“the generalizability of our findings remains unclear. Although rates of psychiatric disorders and assaults are similar across Western Europe, there are smaller income differences in Finland than in other high-income countries.”

This was the trend among the trio studies FB cited in a row.^[8] Another thing worth noting is that it says

“the analyses of violent crime perpetration were based on arrest data, which had the benefit of including violent perpetrators who were not convicted but at the cost of including individuals who were subsequently acquitted.”

This study is also warranted to clarify whether the effects persist over time and, importantly, whether they have an impact on adulthood outcomes in the offspring of the recipients. In fact we see that, from figure 2, the adjusted hazard ratios (aHR) start to increase to 1.0 as age increases but then slowly and gradually decreases.
This suggests further increases in family income decreasing risk of these negative associates as age increases.

Furthermore, blacks and whites in similar economic conditions do not have the same level of crime. In a graph provided by Chetty et al. (2018), race differences in crime persisted even when comparing blacks and whites in the same income group.

Chetty et al. also said that it’s incarceration, not black people committing more crime:

“Although there are large differences in incarceration rates between black and white men, incarceration itself cannot fully explain the black-white gaps in income for men documented in Figure Va. One way to see this is that the income gap remains substantial even among children in the highest-income families, for whom incarceration rates are much lower in absolute terms. Incarceration also cannot explain the sharp disparities observed in outcomes at younger ages, such as high school dropout rates. Moreover, incarcerated individuals have low levels of earnings even prior to incarceration (Looney and Turner 2017). We therefore treat incarceration as an endogenous outcome determined by some of the same processes that shape education and labor market outcomes. We defer consideration of factors that may directly increase incarceration rates for black men and depress their subsequent earnings, such as discrimination in the criminal justice system (Steffensmeier et al. 1998; Pager 2003), to future work.”

In fact, blacks in higher income ranks have similar crime rates as whites in lower income ranks (there was no difference when comparing white females and black females, but this isn’t an issue given that crime is primarily concentrated among men). Zaw, Hamilton, and Darity (2016) found that rich black kids are more likely to go to jail than poor whites,

Chart made by Ehrenfreund (2016)

The Zaw, Hamilton, and Darity study literally just doesn't say what he claims it says. The study concluded that incarceration rates were higher at every level of income for blacks, not that rich blacks commit more crime than poor whites. Like literally it says

“data indicate that although higher levels of wealth were associated with lower rates of incarceration, the likelihood of future incarceration still was higher for blacks at every level of wealth compared to the white likelihood, as well as the Hispanic likelihood, which fell below the white likelihood for some levels of wealth. Further, we find that racial wealth gaps existed among those who would be incarcerated in the future and also among the previously incarcerated.”

It also says

“One explanation for the differential odds of incarceration between races maybe that even while having similar wealth levels, individuals still may have disparate economic situations, through income, extended family wealth or differential exposure todiscrimination. Personal and family human capital levels such as education, job experience and social connections also may differ greatly among those with similar wealth levels. Therefore, observed racial differences in maleincarceration rates despite similar wealth levels may be explained once those factors are taken into account”

Additionally it says

“Although racial disparities in incarceration seem to converge for males in the top decile of wealth, given the small sample sizes, this finding is inconclusive.”

The same study notes that types of crimes weren't reported and that there were limitations about this.

A user on Reddit has responded to some claims I’ve made in this post, discussing the issue of richer blacks committing more crimes than poor whites. He claims: “the Zaw, Hamilton, and Darity study literally just doesn’t say what he claims it says. The study concluded that incarceration rates were higher at every level of income for blacks, not that rich blacks commit more crime than poor whites.” First of all, my claim on what Zaw et al. said is correct. I doubt he read the study since he also said, “If you want to do an actual debunk yourself, just look at the abstracts of the studies he cites.” Just read the study yourself, especially since abstracts can be misleading. Regardless, let’s look at this graph made from the data by Zaw et al.: The Chart from Ehrenfreund (2016) shows rich blacks are more likely to be incarcerated than poor whites. We can use incarceration rates as a proxy for crime since incarceration rates align with crime reports, as discussed in this article above.

https://au.sagepub.com/sites/default/files/upm-assets/109135_book_item_109135.pdf
This explains why the comparison of former self report vs victimization surveys tends to be advantageous to measure offending at least in its more recent iterations. Self-reported offending differences are smaller than what is reflected in NCVS data which in turns seems to be even smaller than NIBRS and UCR data. Plus there're many limitations when comparing NIBRS data. Some individual variables lack adequate delineation of more nuanced aspects of the incident that may be required for analyzing the questions of interest.
https://www.researchgate.net/profile/Brendan-Lantz/publication/333854332_The_co-offender_as_counterfactual_a_quasi-experimental_within-partnership_approach_to_the_examination_of_the_relationship_between_race_and_arrest/links/5d0a453192851cfcc62308cc/The-co-offender-as-counterfactual-a-quasi-experimental-within-partnership-approach-to-the-examination-of-the-relationship-between-race-and-arrest.pdf
This expands on it.
The primary critiques don’t really seem to apply to more expansive and recent measures of offending.
https://www.researchgate.net/profile/Mike-Tapia/publication/258155816_Gang_Membership_and_Race_as_Risk_Factors_for_Juvenile_Arrest/links/5666f44508aea62726ee1fb5/Gang-Membership-and-Race-as-Risk-Factors-for-Juvenile-Arrest.pdf?origin=publication_detail
Here’s an example of a more recent paper with more expansive delinquency. No evidence of bias per se don’t use it as such. I’m talking about offending specifically though.

He also complains that some studies are uncited for the section talking about the economy and crime. Only one of them was uncited, which was Rubinstein (1992). A citation has been added to that source since I can’t find a version of it online. Wolfgang, Figlio, and Sellin (1972) remarked that in the 20th century, lower class blacks had higher levels of crime than lower class whites.

This paper isn’t even about crime, it’s about delinquency. Even then, the paper itself notes the problems with the research that the author is omitting:

“…it should be clearly understood that the offense histories we have analyzed are derived from police-arrest records. We are aware of the concept and studies of "hidden delinquency," or the "dark numbers" of crime, which refer to illegal acts unknown or unrecorded by official agencies. For certain types of offenses, delinquency status, usually the less serious, racial and socioeconomic disparities found in official police records are often reduced among self-reporting studies from anonymous questionnaires or interviews. There may also be race differentials in police arrests. Therefore, we generally use the phrase "having a police record" or a "police contact.”

This also isn’t even generalizable:

“How representative a single cohort may be for other communities, for different birth cohorts, for females can only be conjectured. Cohort subset comparisons as, for example, delinquents with nondelinquents, on the basis of social, economic, and personality variables, may be representatively valid and reliable beyond the single cohort itself.”

Indeed this is just a cohort so it’s not conclusive nor generalizable whatsoever. If this were to be true, this can be attributed to lower satisfaction with police and neighborhood characteristics which was not controlled for. Yuning Wu, Ivan Y. Sun and Ruth A. Triplett found this to be true:

“The results from the individual-level analysis indicate that both race and class are equally important predictors. African Americans and lower-class people tend to be less satisfied with police. The significant effects of race and class, however, disappear when neighborhood-level characteristics are considered simulta- neously. Neighborhood racial composition affects satisfaction with police, with residents in predominantly White and racially mixed neighborhoods having more favorable attitudes than those in predominantly African American communities. Further analyses reveal that African Americans in economically advantaged neighborhoods are less likely than Whites in the same kind of neighborhoods to be satisfied with police, whereas African Americans and Whites in disadvantaged communities hold similar levels of satisfaction with police.”

Furthermore, the correlation between poverty and crime on a national scale is inconsistent: Ellis, Beaver, and Wright (2009) found that most studies show that crime rises when the economy actually improves: 17 found that crime rises when the economy improves, 10 found that crime rises when the economy is doing bad, 5 found no relationship between the two.

Well, literally in the same book 3 pages before the state of the economy improves, they highlight this:

The economy does correlate to poverty. For example, markets can be doing very well, such as what happened under Pinochet, and the standard of living could go down or remain the same. The economy doesn’t actually “improve” for the working class. The economy doing well =/= improved living standards especially among systematically oppressed areas such as majority black neighborhoods. Especially for black people, it doesn’t change for oppressed peoples who had decades upon decades of systemic violence against them specifically. The whole “inclusive capitalism” bullshit is a more recent thing to try to cover for uhh..well the obvious bullshit. Poverty has not fallen despite robust economic growth because this growth did not result in rising wages at the median and below. Dahlquist (2014) even concluded that economic growth isn’t a sufficient tool when the level of extreme poverty is high.

Rubinstein (1992) found that murder and robbery tend to increase when employment increase.

This could be attributed to increased imprisonment, changes in the market for crack cocaine, the aging of the population, tougher gun control laws, the strong economy, increases in the number of police, as well as the crime bill.

In looking at predictors of violent behavior (which cause crime, duh!), poverty, the mother’s lack of education, and the mother’s unemployment were predictive for whites but not blacks (McLeod, Kruttschnitt, and Dornfeld 1994).

Per the study:

“Parenting practices and antisocial behavior are reciprocally related for Whites, but parenting practices do not significantly predict antisocial behavior for Blacks.”

So first of all, these looked at predictors for antisocial behavior, not violent behavior. Second of all, how would this help his case if white children are becoming antisocial?

Ramos (2014) found the beta coefficient between poverty and crime to be statistically insignificant and weak.

This thesis, which is not peer-reviewed or published in a reputable journal, but it says that their results aren’t conclusive and it’s actually masking the poverty-crime relationship:

“these results should be taken with caution due to the aggregate- level nature of the study. It is possible that within a state, communities with high levels of poverty experience higher rates of violent crimes while the majority of the state experiences the opposite. The total effect could be what we have seen: less poverty overall but more localized violent crime. A targeted study of smaller communities with high levels of poverty and violent crime could yield completely different conclusions.”

So the results are most likely due to a Simpson’s paradox given the study was only looking at averages. In fact that’s the precise problem with this study. From the regression equation it looks like it’s using linear regression:
The problem with using linear regression is that it assumes that the same ethnic group is as criminal in one region of the country as another which is obviously not true.
Their use of GDP as a variable is flawed. GDP in 2005 chained dollars does not adequately account for inflation or changes in the value of money over time. The study used data from 2000 to 2012, and the value of money has changed significantly over this time period. This introduces bias into the results of the model, as the relationship between poverty and crime is different depending on the value of money. For example, their data included the period of the 2008 recession. The economic downturn has led to job losses, reduced income, and increased financial strain for many individuals and households. However, the use of GDP per capita in 2005 chained dollars as a measure of economic well-being does not account for changes in the value of money over time or for changes in the economic circumstances of individuals and households. This introduces bias into the results of the model, as the relationship between poverty and crime is different depending on the economic context.
Their definition and methodology of measuring poverty is flawed: First, the Census Bureau's methodology for measuring poverty is based on the poverty threshold, which is a set of income levels that are used to determine whether a household's income is below the poverty line. The poverty threshold is determined based on the size and composition of the household, and is updated annually to account for inflation. However, this methodology may not accurately reflect the true cost of living or the economic well-being of households, as it is based on a national average of the cost of goods and services and does not take into account regional variations in the cost of living or in the income needed to meet basic needs.
Second, the Census Bureau's methodology for measuring poverty is based on the use of pre-tax income as a measure of economic well-being. This may not adequately capture the impact of taxes or government assistance programs on household income, as these factors can significantly affect a household's ability to meet its basic needs. Alternative measures of economic well-being, such as post-tax income or household consumption, may provide a more accurate representation of a household's economic circumstances.
There are several unobserved variables that could be influencing the relationship between poverty and crime that the study's time fixed effects estimation method may not have fully accounted for.
For example, the study may not have adequately controlled for individual differences in risk-taking behavior or criminal tendencies. While these factors may not change over time, they could have a significant impact on the relationship between poverty and crime. Similarly, the study may not have fully accounted for differences in social networks or community characteristics that could also affect the relationship between poverty and crime.
Other unobserved variables include factors related to the local economy, such as the availability of job opportunities or the strength of the local housing market. These factors could be important determinants of both poverty and crime rates, and may not be fully accounted for by the time fixed effects estimation method.
The inclusion of these unobserved variables could introduce bias into the study's results, and could potentially explain the lack of a statistically significant relationship between poverty and crime found in the study.
Additionally, the "black" variable in the linear regression model had a Variance Inflation Factor (VIF) of 3.21 and a Tolerance of 0.311. This is among the highest VIF in their model.
The VIF value of 3.21 indicates that the "black" variable may be correlated with one or more of the other predictor variables in the model. The Tolerance value of 0.311 suggests that the "black" variable is not highly correlated with the other predictor variables, but it is still worth considering the potential impact of multicollinearity on the results of the model. The presence of multicollinearity in the model disadvantages the study by introducing imprecision or unreliability into the estimates of the model coefficients.
Also, the variable "lag of poverty rate" was not included in the correlation and multicollinearity tests. This is potentially problematic because the inclusion of this variable could have had significant implications for the results of the model.
The "lag of poverty rate" variable is generated from the poverty rate variable and is therefore likely to have a high correlation coefficient with it. This high correlation could result in a high Variance Inflation Factor (VIF) for the "lag of poverty rate" variable, which could in turn affect the test results of the other variables in the model.
The high VIF of the "lag of poverty rate" variable could prompt the study to incorrectly exclude it from the regression, as the presence of high multicollinearity in the model could make the estimates of the model coefficients unstable and unreliable. This could potentially disadvantage the study by limiting the ability to accurately interpret the results and draw reliable conclusions about the relationship between poverty and crime.
Overall, the exclusion of the "lag of poverty rate" variable from the correlation and multicollinearity tests could potentially introduce bias into the study's results.

Finally, race differences in crime persist even after controlling for socioeconomic status (Kornhauser 1978).

(Kornhauser 1978) actually opposes the idea of “race differences in crime persist even after controlling for socioeconomic status”. In the study it says

“some theorists have argued that blacks and whites in similar social positions experience fundamentally different realities and hence develop unique cultures that account for race differences in non valued outcomes such as violent crime. Our analysis suggests a different conclusion. While we would not deny that the daily realities of life differ for poor blacks and poor whites, and for blacks and whites in single-parent households, the children from such households show similar behavioral responses early in life. Given the clear links between early antisocial behavior and later criminal offending (Loeber & LeBlanc 1990), and the increasing racial divergence in rates of offending over the life course (Reiss & Roth 1993), the subsequent experiences of these children - both inside and outside of the home deserve close attention.”

In conjunction with what was said above, percent black is a better predictor of crime than poverty when put into a regression. Kposowa, Breault, and Harrison (1995) analyzed crime variation across 2,078 U.S counties and found that the proportion of the county that was % black continued to predict crime even after controlling for county differences in poverty, divorce rates, income inequality, religiosity, population density, and age. This was true for both violent crime and property crime.

FB omits a crucial part of the paper:

“Employing a variety of research strategies and techniques, we fail to support the subculture of violence theory as applied to the region of the South or blacks.”

It also says

“Poverty and divorce continue to be the strongest determinants of homicide in rural counties, while population mobility and urbanity are the strongest factors in both rural violent and property crime. Unemployment also plays a strong role in rural property crime.”

It also says

“We conclude that race and ethnic specific explanations of crime are unwarranted.”

This is antithetical to the notion that inherent attributes predispose black people to crime, since that in itself would be a race specific explanation which the authors say is untenable. So how do the authors reconcile the findings that seem to suggest an inherently racial component to crime with the conclusion that race specific explanations are unwarranted? Well, out of their larger sample they conduct regression estimates for counties that were above 25% black. They argue:

“if race specific and ethnic specific explanations of crime are superfluous, then we would expect such variables to be crime factors in areas where these groups are strongly represented.”

They then find that when you look at counties above 25% black, the proportion of the population that is black no longer becomes this strong predictor of crime and the others even note it’s not related. Meanwhile, poverty and urbanity become the strongest predictors of crime. Unless the author were to argue that the natural propensity for black people to commit crime somehow diminishes as the proportion of the black community increases, which would be nonsensical along with the fact that the authors point out the relationship between crime and percent black is implausible, then this would be the exact opposite of what we would expect if black people were more predisposed to crime by nature. There would be no reason why percent black would no longer be a predictor of crime once the proportion of black people in a county reached a certain point. We would see this relationship no matter what. This is why crime in minority communities may be high, the authors conclude that

“there are more plausible factors that these groups share, notably poverty, divorce and population density in the case of homicide, urbanity and population density for violent crime, and at least urbanity for property crime.”

Also that

“the weight of the evidence suggests that the major causes of crime in areas in which black, Hispanics and Native Americans are strongly represented are the same factors that explain crime elsewhere. Thus these data suggest that race and ethnic specific explanations of crime are not necessary.”

It is evident that unemployment has larger statistical significance and a larger statistically significant difference compared to percent black for property crime.

And where are the error values for these results (i.e standard error)?

They use OLS regression however this doesn’t seem like the correct statistical technique given the black variable is overdispersed. A better regression technique would’ve been negative binomial regression. Their estimates are going to be biased with OLS because assuming they corrected for non normality of their distributions through log transformations, their undefined values are due to the majority of people who don’t commit crimes therefore they’re predicting with lots of missing data. Linear regression assumes that the same ethnic group is as criminal in one region of the country as another. But there’s no warrant for any of this so their analysis is flawed.

If they are so certain that “percent black” is causal in crimes being committed, there should be some sort of test for endogeneity (i.e two stage least squares regression). Even then, the relationships between percent black and property crimes & violent crimes are rather weak (standardized beta coefficients for percent black are .178 & .369, respectively. 64% unexplained variance for variables other than race, and 52% unexplained variance.)

It would be interesting if we could see the intercepts for the variables like poverty and controls for the interaction effects of poverty and urbanity in the models. Indeed the paper says “table III presents OLS regression estimates for property crime.”

“Percent black” is also such a vague variable too. They should give clarification of what it means and how they’re quantifying it. Why is it that “percent black” and other population variables in the models look like they have similar coefficient values?

Also, what they should’ve done for their regression estimate is actually divide property crime in their respective subgroups as the UCR does (arson, burglary, larceny-theft, and motor vehicle theft). They should’ve done the same for violent crime (murder and nonnegligent manslaughter, rape, robbery, and aggravated assault) because we see that poverty has a positive relationship with homicide (b = .234, standardized b = .228) but not violent crime? This means that there’s a Simpson’s paradox for property crime and violent crime.

Another problem is their measure of income inequality. It doesn’t take into account scale invariance. Assuming scale invariance for a measure such as one capturing the distribution of income is dubious. There is no a priori reason to assume that summing up the inequality levels of neighborhoods in a city will yield the level of inequality in the city overall. It is quite possible for a city to have low levels of inequality within its neighborhoods, but high levels of inequality across them, resulting in differences in the level of inequality across this unit of analysis. The possibility that the levels of inequality across the neighborhoods of a city do not necessarily aggregate up to the level of inequality in the city as a whole raises the importance of understanding the geographic distribution of inequality for understanding crime rates. Note that it is logically possible for a city to have a high degree of inequality overall, but virtually no inequality within its neighborhoods if there is complete segregation based on income level. So they need to simultaneously test the overall level of inequality in a city with the average level of inequality in its neighborhoods. Thus, the geographic dispersion of inequality in a city may have important implications for the amount of crime and how it changes over time. This is the same case for their measure of poverty as they’re assuming there’s a linear relationship when it’s actually nonlinear. They’re assuming that the geographic distribution of poverty does not matter for fostering crime. It’s not positing a specific theoretical mechanism existing at this higher level of aggregation to explain this relationship, but instead often simply assuming a degree of scale invariance in which higher levels of poverty in the neighborhoods of a city will lead to higher levels crime in each of them, and this effect will then aggregate up to the county level. This linear relationship assumption isn’t reasonable. As said before, there’s an interaction effect of poverty and urbanity on crime. It therefore may be important to take into account the geographic dispersion of poverty in a city. That is, the degree to which residents in poverty are clustered into particular neighborhoods may lead to higher levels of crime in such cities. A nonlinear relationship between poverty and crime at the neighborhood level, which would have important implications for the amount of crime in the city overall. This paper fails to test this proposition.^[9]

Why on Earth would the author use county-level data? Counties are large! Even Marin County has low income areas (https://bestneighborhood.org/household-income-marin-county-ca/), where the average household income is $36k (fine for some places, but good lord, how does anyone live on that in the Bay Area??). We have this information at the municipality and even neighborhood level, which would give more control over the variables. Changing the level at which you look at the data is going to change your results, so I'm very suspicious at using such a large population grouping, unless you're doing something like looking at governmental policies enacted at the county level.

There's also a theory that local income inequality, not just poverty, raises homicide rates (https://www.scientificamerican.com/article/income-inequalitys-most-disturbing-side-effect-homicide/). If that's the case, counties that are the most destitute should have a lower murder rate than counties in the middle who might have more areas that are on either end of the household income scale. But the author's not considering fine detail like this; their categories are extremely broad.

They also use an average for homicide which isn’t good because the distribution isn’t normal and it is overestimating because the distribution is skewed. Their definition of a rural county also differs from reliable definitions like the Economic Research Service. They define rural counties/areas as

“According to the current delineation, released in 2012 and based on the 2010 decennial census, rural areas comprise open country and settlements with fewer than 2,500 residents.”

Kposowa defines rural counties as those with less than 25,000 people.

The coefficients of variations were also above greater than 30 for percent black which isn’t considered acceptable.

This uses arrest counts which is not an accurate reflection of race differences in crime. Criminologists have rightly questioned, and some have abandoned, the practice of using arrest to understand the dynamics and distribution of offending or to generalize to criminals in the population, one calling it “indefensible” (Elliot 1995, 9). Instead, they tap official responses to offending and the discretion of agencies. Official arrests neither do an adequate job at describing the incidence and distribution of offending in the population, which is far more extensive in self-reports, victimization surveys, and crimes known to police, nor adequately capture the individuals who self- report offending. Most offenders are never arrested and most crime is never reported, and the probability of “arrest per self-reported serious violent offense” is shockingly low (2%). Specifically, the correlations between arrests for index crimes and self reported index offending rates are small, hovering around 0.38, and arrest rates explained just 9 to 14% of the variation in offending based on self reports (Elliot 1995). Arrest rates are not necessarily accurate predictors of offending patterns nor do they accurately distinguish offenders from non-offenders. Even the “worst offenders” based on official arrests bear almost no relationship to the worst offenders based on self reports, with more than 75% of one group missing from the other (Elliot 1995). Offense patterns and estimates of the prevalence of offending by demographic group based on both sources of data look remarkably different. (Pollock et al. 2015). Knowing arrest history, in short, does not allow one to say much of anything about offending in the population, nor do arrest samples come close to being representative of the population of offenders. On these grounds, we follow a growing group of experts who have argued for relying on offending self reports as a more suitable method.(weaver 2019)

FB citing studies from the 70s on all race differences in crime after controlling for poverty. Poverty isn't the only thing that influences crime. There's a number of things that have an impact, e.g. age and urbanity. As it turns out, most black people live in urban areas while rural areas have very few (aside from in the south). Here's a study that

1. This isn't from the 70s

2. It controls for poverty AND urbanity.

It finds that

“Poor urban blacks (51.3 per 1,000) had rates of violence similar to poor urban whites (56.4 per 1,000).”

But even then that's not all relevant factors
https://fivethirtyeight.com/features/trump-doesnt-know-why-crime-rises-or-falls-neither-does-biden-or-any-other-politician/
The underlying causes of crime are pretty varied and it's hard to get a clear picture by just controlling for one or two things.

Templer and Rushton (2011)^[10] looked at crime across the 50 U.S. states and found that the percent of the population that was black was a stronger correlate than average income for murder rates (.84 v -.40), robbery rates (.77 v .06), and assault rates (.54 vs -.23).

This paper is looking at a correlation for two variables in different periods meaning there's measurement error and they’re not comparable. They also use arrest accounts specifically from the UCR which has several problems. Average income is not the full picture, by any means in fact it is worthless here - if you have many black people in poverty and a few very rich people, you have average higher income but many desperate people, hence more crime. Unless you make a histogram per "race" and overlay them, showing equal distribution, these numbers are not helpful. This is because average income is the income everyone would have if the total of all income combined was equally distributed among everyone. In a geographic area, for example, where there happen to be a small number of extremely high income households and a large number of lower income individuals, the average income can become inflated and give the false impression that everyone on average has more money than they actually have. If the average income is greater than the median income, this means that there is significant income disproportionately concentrated in the wealthiest households. For instance, the average U.S. household income is $87,864, and the median is $61,937. This is because wealth is generally skewed towards the top 1% which owns more wealth than the bottom 90%. The connections between poverty and crime are more multifaceted than purely income. Black Americans face unique systemic hurdles that go farther than pure income rates. The study does not control for urban poor people. The reason why black people have higher crime statistics is because redlining and other instances of systemic racism have resulted in the majority of urban poor people being black. Crime is associated with urban poverty and when you control for that, there is no statistical difference between black and white people in terms of crime rate as the previously linked study shows. This study does not address this at all and just repeats the same racist assumptions that skin color is a causation of higher crime rates rather than simply a correlation because of racism. Even then this would just support our claim. Black people are the poorest people in the country so they commit the most crime proportionally. Also these trends are in a lot of different places as well. Just because black disproportionately commit crime relative to average income (which black average income would be lower) doesn’t mean that generally poorer people don’t commit more crime.

Income was a stronger correlate for rape rates than race, but the coefficients were weak. Rubenstein (2005) found a very strong correlation (r=0.81) between percent black and Hispanic of a state and the violent crime rate of the state. Conversely, the association between poverty and crime and unemployment and crime are 0.36 and 0.35, respectively. All these lines of evidence cast doubt on poverty being a role in black crime, especially when looking at the effect sizes.

However there’s no p-value cited whatsoever.
This is a pretty wide correlation and there may be heteroscedasticity.

The shared variance is around 64%.

The reason why there is an .81 correlation is because since there is an outlier, which also acts as a high leverage point on the least square regression line’s slope which goes through the mean of the x variable and the mean of the y variable coordinate, in the first quadrant and there’s little to no relationship in the third quadrant, the single point outlier acts as a point to fix the straight line or it can pull the line away from an otherwise linear relationship. If this outlier were removed, there would be a weaker than already weak correlation and a lower correlation because it is an influential point.

The residuals are large meaning there is large error as well

There is a relative lack of heteroscedasticity for these figures. They also have fairly large residuals meaning these models have error due to residual length.

These also don’t have the same scales meaning they’re not comparable.

Also, where is the control for age in these bivariate models? Obviously, young people are unemployed and older people usually don’t commit more crimes so these correlations are severely inflated.

The Pearson coefficient still has problems because

1. Correlation measures only linear relationships.

2. Correlation is an average across subsamples and may not reflect the relationship between any two individuals.

This is a report, it’s not an actual study with no methodology.

And this data is from the UCR 2002 arrests for violent crime which has several limitations.

Also the methodology for data used to estimate unemployment is flawed. It used the BJS 2002 statistics on unemployment however their measure has several problems:

The definition of unemployment only includes individuals who are actively looking for work, which means that it excludes individuals who are not actively seeking employment, such as those who are discouraged from seeking work or who are in school. This can lead to an undercount of the total number of individuals who are unable to find work.
The definition of the labor force only includes individuals who are employed or unemployed, which means that it excludes individuals who are not actively seeking work or who are not able to work, such as homemakers, students, or disabled individuals. This can lead to an undercount of the total number of individuals who are not participating in the labor force.
The definition of employment only includes individuals who worked in the reference week or who have a job from which they were temporarily absent. This means that individuals who worked part-time or who worked less than 15 hours in the reference week may not be counted as employed, even if they have a job.
The CPS is based on a sample survey, which means that the results are estimates and are subject to sampling error. This means that the unemployment rate and labor force participation rate estimates may not be perfectly accurate.
Their methodology: Their method relies on data from multiple sources, including the Current Population Survey (CPS), the Current Employment Statistics (CES), and state unemployment insurance (UI) data. These data sources may have different definitions of employment and unemployment, which can lead to discrepancies in the estimates produced using this method.
The use of estimating equations based on regression techniques involves making assumptions about the relationships between different variables. If these assumptions are not accurate, the estimates produced using this method may not be reliable.
This method relies on data from the CPS, which is based on a sample survey. As with any sample survey, the estimates produced using this method are subject to sampling error, which means that they may not be perfectly accurate.
The estimates produced using this method are based on data from a specific point in time, which means that they may not accurately reflect changes in the labor market or unemployment trends over time.
Seasonal Adjustment: Seasonal adjustment is a statistical method that is used to remove the influence of predictable, regular patterns from data, such as seasonal fluctuations in employment or unemployment. While these adjustments can help to make data more interpretable, they rely on statistical models that may not perfectly capture the underlying patterns in the data. As a result, the adjusted data may not perfectly reflect the actual trends in employment or unemployment.
Seasonal adjustment can be a complex process, and the results can be sensitive to the choice of statistical model and the assumptions used. If the assumptions or models used are not accurate, the adjusted data may not be reliable.
The adjustment process relies on data from the Current Population Survey (CPS), which is based on a sample survey. As with any sample survey, the estimates produced using this method are subject to sampling error, which means that they may not be perfectly accurate.
The adjusted data may not accurately reflect changes in the labor market or unemployment trends over time, especially if there are structural changes in the economy that are not captured by the adjustment process.

They also use a flawed measur of lack of education:

The ACS is based on a sample survey, which means that the estimates produced using this data are subject to sampling error. This means that the estimates of the percentage of people with a lack of education may not be perfectly accurate.
Their measure only captures the percentage of people who have completed high school, and does not take into account the level of education beyond high school. As a result, it may not accurately reflect the overall level of education in a population.
This measure only considers formal education, and does not take into account other forms of education or learning, such as informal or self-directed learning.
This measure is based on data from a specific point in time, and may not accurately reflect changes in the level of education over time.
This measure may not accurately reflect the level of education in certain subgroups of the population, such as those who have dropped out of high school or who have not had access to education due to financial, social, or other barriers.

It's still misattributing statistics to race rather than to urban poverty. States with higher percentages of black people also have higher urban poverty rates. This is ignorantly or intentionally not controlling for urban poverty which is the biggest factor in crime rates. They instead use the misleading variable of "poverty" with no distinction between (mostly white) rural poverty and (mostly black) urban poverty. Rural poor people have significantly lower crime rates than urban poor people because of population density. More recent data indicates that after accounting for poverty and urbanity, poor urban black people (51.3 per 1,000) have rates of violence similar to poor urban whites (56.4 per 1,000).

These were also similar findings noted in Byrne and Sampson (1986).

Metropolitan areas exclusively with no prior convection variables?

This data isn’t only from 1970 so it isn’t longitudinal at all. They admit the data they use is flawed:

“The findings and conclusions of this investigation are thus limited by the well-known validity and reliability problems affecting all studies which utilize ‘offenses known to the police.”

Additionally, they are analyzing an extremely small sample size (N = 125) so this doesn’t have enough statistical power and the results aren’t applicable to black people.

Additionally, they use OLS regression which is sensitive to outliers. Since there’s always going to be a select few individuals who don’t commit crime and a select few black people that do, this biases their coefficients.

Also this study didn't account for all forms of crime only those specific categories, it never stated what black general crime rates were. Regardless, they say

“Controlling for population size, unemployment rate, and regional location, relative deprivation has significant effects on six of the seven index offenses. While the absence of an effect on robbery rates is puzzling, these findings are generally consistent with the proposition that crime is generated by economic inequality.”

It would make sense that in the 1980s where black income inequality rates were extremely high than black crime rates would be high as well.

What peer review journal was this even published in?

Going back to Beaver, Ellis, and Wright’s 2019 review, the majority of studies also find a positive relationship between % black and crime. **None were negative** Thus, not only does poverty not have any causal relationship with crime, but the % of an area that is black positively correlates with an area’s crime rate. Poverty may not cause crime, but why are blacks lagging behind whites in their wealth? Leftists often blame historical sin for this, arguing that discrimination and racism led to blacks not having access to things and thus they were unable to pass on their wealth, leading to generational poverty. As far as I know, there is no study that has shown that if blacks were not victims of historical sin, the black-white income gap would be small or nonexistent.

What the fuck is this. The arbitrary skepticism here is through the roof. No, we should not have to create an entire counterfactual United States universe in which slavery, the KKK, and Jim Crow never existed in order to prove that those things had negative effects on black wealth. Jesus Christ.

Instead, it’s just story time with Leftists and them saying how “since this happened, it led to this and then this, and then that and a bunch of other things!” No piece of evidence is offered to show that if blacks weren’t discriminated against in the U.S., then they’d be at the same level or almost the same level as whites in regards to wealth. If anything, the null-hypothesis that racism is to blame for black poverty doesn’t hold up (see Last 2019).^[11]

Slavery

Sometimes, slavery is offered as an explanation for African American poverty rates. At first glance, this seems like a absurd explanation since without slavery African Americans would simply be Africans, and would no doubt be poorer than they currently are. Of course, people who say that slavery explains Black poverty do not mean “explanation” in the normal sense. They don’t suppose that Black poverty would be lesser had slavery simply not occurred. Rather, what they mean is that Black poverty would be lesser if African Americans had all been transported to America as free people instead of as slaves. Of course, this would not have happened in the real world, but we are encouraged to ignore this fact when discussing the causes of black poverty. This is unjustifiable, but for the sake of argument let’s consider the question of whether African Americans would be significantly richer today had they been brought to America as free people. The mechanisms by which slavery is often said to have retarded black economic progress is through inhibiting black education, by destroying the Black family, and by stopping Black Americans from accumulating wealth which they could transfer to their offspring. None of this stands up to empirical scrutiny. The best evidence on the effects of slavery come from comparisons of the descendants of free blacks and those who were enslaved. Beginning soon after emancipation, those black people who were freed from slavery were less literate, and poorer, than black people who were born free. This difference persisted for some time, but after two generations the descendants of enslaved and free black Americans did not statistically differ in terms of both education and economic success (Sacerdote, 2002).

First of all, just because black people were free, doesn’t mean they were economically/financially stable. While African Americans were the only freed slaves to be granted political rights so soon after emancipation, those rights were limited for a people without land, wealth, or job prospects. Most freed slaves left their former homes with little more than the clothes on their backs. Black men were ostensibly awarded the rights of citizenship, but even that was inconsequential if they were jobless and their families were suffering from hunger and want. Second of all, the study’s abstract says this:

“I find that it took roughly two generations for the descendants of slaves to ‘catch up’ to the descendants of free black men and women.”

So there was a disparity between enslaved and free black african americans but it dissipated after two generations suggesting that slavery did play a part in the disparity. However it also says

“if we instead measure the progress of free blacks and slaves (and their descendants) relative to whites born in the same regions, then we find partial but not complete convergence.”

Also this is a working paper so it isn't complete yet may be important. It also says

“the persistence of black-white differences could be explained if a new set of discriminatory institutions rose up after emancipation (as in Wright [1986]).”

It also didn’t even include things like wages and years of schooling. I also don’t consider these results particularly reliable given there was only an 11% success rate in matching fathers. Sacerdote says this is okay because the observations are similar between matched fathers and unmatched fathers, however when looking at table 13, there were t values as large as 15.94, indicating a significant difference. Not to mention how other studies find different results. Other studies including Chandra (2001), Johnson and Neal (1996), and Heckman, Lyons and Todd (2000) do not find a narrowing of the black-white wage gap during the 1980-1990 period.

“In contrast to Sacerdote (2005), our parameter estimates of the intergenerational correlation between the labor market status of sons and their fathers exceeds two generations, such that through 1930, having a father who was a slave mattered and was associated with a lower labor market status.”

- Price (2017).

“Unlike Sacerdote, we find that the South’s peculiar institution had a persistent and stable effect across generations.”

- Penny & Reyes (2017).

This suggests that the effects of slavery had largely faded by the time we get to the grandchildren of slaves and makes it highly improbable that such effects still linger today. This may seem surprising, but it is consistent with other data on the intergenerational effects of wealth in 19th century America and in the South. For instance, Bleakley and Ferrie (2013) found that the descendants of those who won Georgia’s land-lottery in the 1830s fared no better for it in terms of their income, wealth, and literacy rates.

Bleakley and Ferrie updated their findings in 2016 and found the same result. They suggest that a pure cash transfer cannot increase intergenerational mobility in the United States.

However studies have shown that the transitory income component could play a much larger role in intergenerational mobility and that some mobility can be promoted through such cash transfers.

They also show that human capital actually plays a smaller role in intergenerational income elasticity.

They also indicate that when using a structural model to gain identification from theoretical restrictions on functional forms, approximately two-thirds of the intergenerational income elasticity can be attributed to human capital with the remaining one-third being attributed to the financial component. (Cardak et al. 2013, Miller et al. 2019).

Analyzing the opposite sort of event, Ager et al. (2016) looked at data on those whose wealth was destroyed during the civil war due to slave emancipation and war related property destruction. Based on this analysis, they estimate that a person’s wealth being decimated by 10% predicted a 0.4% decrease in their children’s income. By the next generation, this effect probably wouldn’t significantly differ from zero. It certainly would not differ from zero for those born 100 years later.

SL is excluding a certain part of this paper that is contrary to his argument which is basically how sons were able to rebound by around age 50, particularly if they remained in agriculture. The study says:

“Our estimate is consistent with previous work documenting high rates of occupational mobility in the nineteenth century US, suggesting that there was a substantial amount of inter-generational mobility even in the more agricultural and class-based US South (Ferrie, 2005; Ferrie and Long, 2007, 2013; Bleakley and Ferrie, 2016).”

The study even references Acemoglu & Robinson, which talks about how when elites lose official power, they invest in alternate mechanisms to maintain control – for example, owners of large slave plantation simply morph into large landholders of tenant farms. This is corroborated by the fact that half of the planters in a sample of Alabama counties remained in the wealthiest strata of households both before the war and after the war.

So, if you’re talking about 10% they still have 90% of their wealth. That’s very important to recognize. They still have upward mobility because they have the resources and means to do so. Most of the people who lost property during the war in the south were already wealthy. They had money in assets compared to slaves and black people who had little to none. Also most southern plantation owners basically re enslaved their slaves through sharecropping.

SL using this paper to argue against intergenerational mobility in regards to african americans is absolutely not correct, this document looks to be exclusively about how white men lost a portion of their overall wealth after mass emancipation, but i'm not really sure what their point is besides proving that. It's just showing that rich white slave owners lost money, and that it affected southern society as a whole, which of course it did. The source is saying that the wealth of the Southern landlords decreased. But it increased the wealth of the nation as a whole, since the Northern Bourgeoisie industrialized the South after they took it. The victory of the North helped Capitalism expand to the South, it was crucial to the USA. Technically the slaves had nothing to begin with, so even a slight increase in income would count as high generational mobility. I mean if you think for even a fraction of a second you'll realize that this is because of the relative flatness of the wage market for the broader population over time. If the average worker loses 10% of their wages due to a bad recession or ponzi scheme or whatever, they're still only making like 30k, or 50-70k, etc etc and provided things don't drastically change, the same will ultimately be true of the average child too. It's just the way the numbers bare out due to the law of large numbers, averages, etc. It has nothing to do with the actual reality of social mobility in the US.

This paper isn’t conclusive about anything given there’s literally no written conclusion…

- pg. 23.

This isn’t all that different from what modern studies find, especially among African Americans. For instance, Toney (2016) finds that a 10% increase in wealth among an American’s grandparents predicts a 1.8% increase in their own wealth if they are white and a 0.2% increase in wealth if they are Black.

Methodological problems with this study include its sample. It’s a PSID sample, which is longitudinal yet there’s no warrant for clarifying how many participants were no longer included in the dataset due to panel attrition so these results aren’t as representative. Not to mention how the PSID has been sampling through telephone interviews since the 70s so this study suffers from under coverage meaning it is underestimating these results.

Cherry picking the data from Table 5.5, which is from 1984 to 2007, means nothing because interestingly enough, when you look at Table 5.6, which is more recent from 2007 to 2013, it actually shows that the correlation in net wealth between a white grandparent and grandchild is 0.60, a significant increase.

However for Black people, the estimated correlation coefficient is around 0.09. Such correlation matches the account observed in previous estimates, that wealth in the black family tree still faces very little intergenerational mobility in net wealth. The study even says this:

“Our results indicate that there are racialized legacies in wealth, legacies that are key determinants of intergroup disparity.”

…Literally the last sentence of this graduate paper. The paper argues that there is little intergenerational mobility in net wealth between white grandparents and grandchildren and even less mobility between black grandparents and grandchildren, and that this disadvantage for black people is compounded by having fewer routes of intergenerational wealth transfer open to them.

“Clearly, fewer routes are available for blacks to carry their wealth across generations.”

Is the penultimate sentence of the paper. How does any of this prove that there is no systemic racial component to poverty?

One cannot infer convergence of ethnic differentials for the entire American population, even for white Americans, a grandfather-grandson multigenerational model with a single intercept.

Regardless, there are a few more things omitted in this study that are actually antithetical to SL’s argument. The study says this:

“There are large and persistent differences in economic standing. Measured in 2005 dollars, the black-white ratio of median net worth is 22.20% [=$22,040/$99,256] for grand-parents in 1984 to 1989. This means that, at the median, black grandparents own about 22% of the wealth that is possessed by the white grandparents. Two generations later, for the adult grandchildren in 2007, the black-white wealth gap is 23.57% [=$18,842/$79,935]. Why are wealth positions of black Americans so far behind the wealth positions of white Americans? Katznelson (2005) provides context by showing how white wealth is boosted through affirmative action over the 20th century. For example, millions of white veterans benefit from the GI Bill of 1944, which allows them to accumulate human capital and financial assets. Through a combination of generous governmental assistance and no barriers in the administration of the policy, the GI Bill allows white veterans to purchase farms, open enterprises, own a home, and obtain grants for higher education. Such preferential treatment was not granted to black veterans.”

The study also says this:

“Indeed, the availability of family wealth declines, particularly for black families, which accelerates racial wealth in- equality. In fact, the median wealth ratio drops to 15.21% [=$13,640/$89,688], adjusted in 2013 dollars. This implies that black adult grandchildren own only 15% of the median wealth that is owned by white adult grandchildren in 2013.”

In regards to the inter generational wealth elasticity correlations SL is citing, the study clarifies what this actually means:

“Obviously, black wealth is much less likely to reproduce itself, relative to the higher wealth positions held and carried by white families. These results reveal the hysteresis, a term coined for lagging behind, in wealth across generations.”

More strikingly, research shows that black children born to parents in the top fifth of the income distribution are equally likely to occupy the top and bottom fifth of the income distribution when they grow up. By contrast, white children born into the top economic quintile are far more likely to stay there than fall to the bottom (Chetty et al., 2018). The speed with which wealth fades across African American generations today gives us even greater reason to doubt that differences in wealth from 150 years ago are still persisting in effect. This may seem like a surprisingly short period of time in which to expect the effects of major events to vanish. However, this is similar to, or greater than, the amount of time it seemingly took, for instance, the Irish to rebound from extreme repression by the English,

It is not accurate to say that the Irish "rebounded" from extreme repression by the English or that black people have not "recovered" from racism. The history of Ireland and the struggle against racism are complex and ongoing issues. In both cases, there have been significant challenges and obstacles, as well as a number of successes and achievements.
In the case of Ireland, the country faced centuries of oppression and repression under British rule. However, over time, the Irish people fought for and won their independence, establishing the Republic of Ireland in 1922. The country has faced many challenges since then, including economic and political instability, as well as ongoing tensions with Northern Ireland. However, Ireland has also made significant progress in recent decades, becoming a thriving and successful democracy.
In the case of racism, it is important to recognize that it is a deeply ingrained and systemic problem that has persisted for centuries. While there have been significant efforts to combat racism and promote equality, there is still much work to be done. Black people continue to face discrimination and inequality in many aspects of their lives, including education, employment, and the criminal justice system. It is crucial for society as a whole to continue to work towards eliminating racism and promoting equality for all people.

Japan to rebound following WW2,

The Japanese economic miracle is just that. A miracle which was chiefly due to the economic interventionism of the Japanese government and partly due to the aid and assistance of the U.S. aid to Asia. Black people didn’t even have this. It is not accurate to say that Japan "rebounded" following World War 2 or that black people have not "rebounded" from racism. The history of Japan and the struggle against racism are complex and ongoing issues. In both cases, there have been significant challenges and obstacles, as well as a number of successes and achievements.
In the case of Japan, the country was devastated by World War 2, with much of its infrastructure and economy destroyed. However, with the support of the United States and other countries, Japan was able to rebuild its economy and become one of the most prosperous and successful countries in the world. This was a long and difficult process, but Japan was able to make significant progress through hard work, determination, and a focus on education and innovation.
In the case of racism, it is important to recognize that it is a deeply ingrained and systemic problem that has persisted for centuries. While there have been significant efforts to combat racism and promote equality, there is still much work to be done. Black people continue to face discrimination and inequality in many aspects of their lives, including education, employment, and the criminal justice system. It is crucial for society as a whole to continue to work towards eliminating racism and promoting equality for all people.

and countries like Estonia to recover from communism.

Eastern europe has verged all the way back into extreme religious fundamentalism and it's no coincidence that these reactionary beliefs weren’t as prominent in the majority of the 20th century but eastern european people have their own government. Slaves have never had their own government, even their freed children and children’s children are ruled by the same power that shackled them and they also largely had their own government even as a satellite state (which is still not a good thing).
It is not accurate to say that Estonia "recovered" from communism or that black people have not "recovered" from racism. The transition from communism to a market economy in Estonia, as well as the ongoing struggle against racism, are complex and ongoing issues. In both cases, there have been significant challenges and obstacles, as well as a number of successes and achievements.
In the case of Estonia, the government implemented a number of economic and political reforms in the 1990s, which helped to rebuild the country's economy and political system. However, the process was not without challenges, and the country continues to face a number of economic and social issues.
In the case of racism, it is important to recognize that it is a deeply ingrained and systemic problem that has persisted for centuries. While there have been significant efforts to combat racism and promote equality, there is still much work to be done. Black people continue to face discrimination and inequality in many aspects of their lives, including education, employment, and the criminal justice system. It is crucial for society as a whole to continue to work towards eliminating racism and promoting equality for all people.

In general, it seems not unusual for populations to recover from great acts of oppression or exploitation within a few generations of the ending of that oppression or exploitation.

1. Not all oppression is the same as well as recovery. 2. This is a survivorship bias fallacy. It is not unusual to see populations recover from great acts of oppression or exploitation within a few generations of the ending of that oppression or exploitation. Many countries and communities have overcome significant challenges and adversity to rebuild and thrive after experiencing periods of oppression or exploitation. This process can be difficult and complex, and it often requires a combination of hard work, determination, and support from the international community. However, with the right combination of factors, it is possible for a population to make significant progress in a relatively short period of time. It is important to remember, however, that every situation is unique and that the recovery process can vary depending on the specific circumstances and challenges involved.

More broadly, it is the case that environmental effects on people decay fairly quickly once the relevant environmental stimuli are removed. For instance, childhood interventions aimed at improving general cognitive abilities by giving children special educational attention, improved nutrition, and giving their parents classes in how to parent better, improve cognitive ability in the short run but this effect fades out after the intervention ends such that they have no long term effect (Protzko, 2015).

SL is addressing the fadeout effect and is misinterpreting the study. In accepting the "fade out" of IQ gains from interventions, there must be some clarification that this argument does not support the hereditarian hypothesis see from Sauce & Matzel 2018:

“Absent the opportunity to assimilate into an environment that is matched to their new cognitive capacity (a forced loss in gene-environment correlation), it would be difficult to maintain or amplify the initial benefits afforded by the early intervention. Thus, much like the inter-generational Flynn effect, increases in IQ might be amplified, or at least sustained, by greater access to opportunities that often are inevitably distributed. In simpler terms, the analysis of Protzko should not lead us to conclude that early intervention programs such as Head Start can have no long-term benefits. Rather, these results highlight the need to provide participants with continuing opportunities that would allow them to capitalize on what might otherwise be transient gains in cognitive abilities.”

There is an idea whereby gains will “fade out” if, and only if those in interventions return to their old environments and necessary information is not taught. See from howe 1997 where they measured the effects of a four year intervention program which emphasized math skills:

“For instance, to score well at the achievement tests used with older children it is essential to have some knowledge of algebra and geometry, but Seitz found that while the majority of middle-class children were being taught these subjects, the disadvantaged pupils were not getting the necessary teaching. For that reason they could hardly be expected to do well. As Seitz perceived, the true picture was not one of fading ability but of diminishing use of it.”

Thus the conclusion regarding the efficiency of interventions is meaningless when those in interventions like Head Start compared to non-HS children faired better on cognitive and socio-emotive measures having fewer negative behaviors see: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3050648/pdf/nihms273531.pdf

The same is found for more targeted interventions which aim, for instance, at improving children’s language abilities by reading with them (Noble et al., 2018).

Another misinterpretation. The interventions were only 6-8 weeks. The measure the effect of shared reading on language skills....6-8 WEEKS. They themselves also advise to include more participants from different socio-economic backgrounds as well. Then it's no surprise their effects are so small.

People often overestimate the persistence of environmental effects because children resemble their parents even well into adulthood and group differences often persist across generations. In the case of families, we know that this is an illusion created by genetics. For instance, Hyytinen et al. (2013) contains a review of 19 previous twin studies. On the basis of this literature, it was shown that variation in income in the United States is 41% due to genetics, 9% due to the childhood home environment, and 50% due to other factors (which technically includes certain special sorts of genetic factors, measurement error, random luck, etc., as well as the non-home environment). Parents and offspring share roughly half of their genes and thus the correlation between parents and offspring in income is mostly explained by genes.

There are classic problems with both Hyytinen et al. (2013) and race realists' interpretations. Firstly, ACE, ADE (accounting for dominance effects) represents a flawed understanding of genes and environment. The study tries to account for GxE but see:
https://developmentalsystem.wordpress.com/2019/10/07/gene-environment-interactions-and-the-statistical-fallacy/
Their coefficient of relatedness is put into question. As stated before:

“Genes and environments combine in statistically heterogeneous and stochastic ways that evade detection without actual developmental research.”

Modeling such as in Purcell 2002 ONLY works when SPECIFICALLY modeling for the environment. Second of all, Heritability estimates do not help identify particular genes or ascertain their functions in development or physiology, and thus, by this way of thinking, they yield no causal information. (Panofsky) No genes have been “found” for income and differences by ways of race fail to account for the assumptions of the twin studies (the assumptions the study conveniently “addresses” in a couple of statements and proceeds on such as Assortative mating and EEA). Even then, they must account for more than just a few biases (when dealing with such a topic the bar ought be set high for hereditarians to prove their arguments). Third, h2 estimates themselves wouldn’t tell us which traits are “genetic” or not Bailey et al. (1997).

Moreover, there are several studies showing that people born into rich families end up having higher incomes, while the income of the family a person is adopted into has no effect on their income as an adult (Sacerdote, 2002;

First of all, this study literally says

“I find that adoptive parents' education and income have a modest impact on child test scores but a large impact on college attendance, marital status, and earnings.”

Second of all, the regressions were measuring the effects of adoptive family’s socioeconomic status on children-later-adults family income, however, the age for “adult” was only 23 years old. So of course you wouldn’t see a statistically significant result because most 23 year olds don’t even have families yet.

Another reason for the statistically insignificant result is because of the likely underpowered analysis given the small sample size, identical means, etc so it’s a type II error.

Third of all, this study’s sample comes from NLSY 1979 data which is of a sample of children born from 1954 to 1964 so it’s incredibly old and outdated.

Fourth of all, the study literally notes that it’s shoddy limited measure of family income is the reason why there seems to be no measurable effect:

“This is the weekly income (in British pounds) of the subject and his or her spouse if any. It is difficult to reconcile the results that family environment affects college attendance but not income. Perhaps lifetime incomes are affected, but a snapshot at age 23 does not pick this up.”

Studies like Björklund et al. (2004) analyzed later-life outcomes for adopted and natural children using data from the Swedish population register, with a large sample – 7,500 adopted children aged 26 and above and they found that both the degree of intergenerational persistence in earnings and income for adoptees effects were positive and statistically significant.

Sacerdote 2004).

Sacerdote (2004) found that the health, income, and education transmissions from parents to children to be higher for non-adoptees than for adoptees.
However this study isn’t about income, it’s about college.
Additionally from the abstract it literally says

“For height, obesity, and income, transmission coefficients are significantly higher for non-adoptees than for adoptees.”

So this study is not saying what SL posits.
Before we accept the putative inference that education and income are predominantly genetically transmitted (while smoking and drinking are culturally transmitted), we must question the external validity of the adoptee sample. On observables, we know, ipso facto, that families who adopt are a distinct social group on unobservables, as are the adoptees themselves.

For example, if socialization is weaker among adoptees who do not feel connected to their adoptive parents, the difference in heritability could be weaker by virtue of this fact, not the absence of genetic similarity. Many other dynamics could be at work as well, such as increased (or decreased) parental investment, halo effects or stigma, and truncated genetic variability among adoptees (or adopters), which may work to bias estimates for this population in unpredictable ways. The only adoption study that would avoid such questions would be one in which adoptees were randomly selected from the newborn population and then randomly assigned to parents, with both groups blind to the treatment (i.e., not knowing whether they were adopted or not) – all while the prenatal environment was held constant. In other words, it is an impossibility to reliably estimate genetic heritability using such an approach. This is crucial because there is this focus on deflating c2 and inflated h2 as a result of certain biases (range restriction, selective placement, and family unit effects) all rightly pointed out by Richardson (2005). From one DSTsquad article I had picked up on the critique of McGue (a paper dedicated to defending the adoption study model as used by refuting claims of range restriction issues). Where, among other critiques, they state "the second issue here is that their method of testing for variance reduction was to compare the group of adoptive families in their sample to the group of ‘biological families’ in their sample. That adoptive families are selected is well-known: the impact of this selection on correlations is the issue under contention." Also further, there is the discrepancy where h2 estimates differ from that of twin studies (often not significant as seen in McGue).

Thus, there is little, if any, non genetically-mediated transmission of wealth and income even within a single generation.

If we allow for different degrees of genetic linkage of particular genes with other genes by population, then we cannot even plausibly say (for sure) that a given gene is responsible for the outcome in two different populations even if we observe the same marker-phenotype association (never mind GE interactions).

This gives us even greater to doubt that the economic effects of slavery could persist to the current day. Sometime, it is noted that black families were broken when kin were sold to separate slave owners and this is said to explain modern rates of single parenthood among blacks. The idea that black people today leave their children because their ancestors from the 1860s or before, who they have never met, were forcibly separated from their family is totally without evidence. The idea that these forces explain modern black family structure is down right implausible since black rates of single parenthood are far greater today than they were in the 19th century. Ruggles (1994).

I somewhat agree that slave auctions don’t explain fatherlessness today however, slave auctions occurred before 1880. The source the author references analyzes data from 1880 which doesn’t make sense to make a claim on slave auctions. Even then the paper says that during those times, black children were 2 to 3 times more likely to reside without one or both parents than were white children. It’s just that Black women had increased economical autonomy but also black men were less stable for labor in the 1940s, partially because of technical work that they never received an education for. There were also things like mass incarceration that was occurring too.

This brings us to yet another reason to doubt that the effects of slavery are still in the process of being eliminated: if this were true than the effects of slavery should be marginally lesser each generation and so we should see slow and steady economic improvements among African Americans.

With the system being tilted against them, not really. Its not as if African Americans have a level playing ground to develop on, so instead they have mostly been kept in poverty for the sake of capitalism. Slavery wasn't just a phase of capitalism, it was/is a process that is integral to its development. As such the effects can never be eliminated within capitalism because the existence of capitalism depends on these effects to still be in place. Structural inequality, poverty, racism etc. Plus, racism fluctuates according to the intensity of the contradictions in American society.

Nothing like this has taken place for the last half century. A 2017 Federal Reserve report shows that white and Black working women had roughly equal wages in the 1970s and 1980s, but since the 90s a gap has appeared which favors white women. For males, there was already a wage gap present in the 1970s and it is even greater today.

This is because job market participation for women was still very low before the 80s-90s, then came an explosion during those decades of professional women, and this favored white women for the same reason it favored white men in comparison to black men. The wage gap between black and white working women in the 1990s was the result of a complex combination of factors, including systemic racism and discrimination, as well as economic and social inequality. Black women have historically faced significant barriers to education, employment, and other opportunities, which have made it difficult for them to earn the same level of income as white women.
In the 1990s, black women were more likely to work in low-paying jobs and industries, and they were also more likely to face discrimination and unequal treatment in the workplace. In addition, black women were more likely to be single mothers and to face the challenges and added responsibilities that come with raising a family on their own. All of these factors contributed to the wage gap between black and white working women during this period.
It is important to recognize that the wage gap between black and white women persists to this day, and that significant efforts are needed to address systemic racism and inequality in order to close the gap and promote equal pay for all workers.

Now, with respect to males many authors make the mistake of looking at the wages only of working men, neglecting the impact that changes in unemployment and incarceration rates have on average earnings. This is also a problem for the analysis of trends among females, but it is less severe. Bayer and Charles (2017) correct for this and make a striking finding: the difference between the median white and median black male worker in earnings is larger today than it was in 1940.

The difference between the median white and median black male worker in earnings is larger today than it was in 1940 for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which continue to disproportionately impact black workers and limit their access to education, employment, and other opportunities.
In 1940, black workers were more likely to be employed in low-paying jobs and industries, and they faced significant barriers to upward mobility. Over time, these barriers have persisted, and black workers continue to face discrimination and unequal treatment in the workforce. In addition, black workers are more likely to be underemployed or to work in part-time or temporary jobs, which can also impact their earnings.
Another factor that has contributed to the widening wage gap is the increasing income inequality in the United States. In recent decades, the top 1% of earners have seen their incomes rise significantly, while the incomes of middle- and low-income workers have stagnated or declined. This has disproportionately impacted black workers, who are more likely to be in lower-paying jobs, and has contributed to the growing gap between the median white and median black male worker in earnings.

With respect to net wealth, the situation is similar but more extreme.

The difference between the net wealth of white and black male workers is larger today than it was in 1940 for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which have disproportionately impacted black workers and limited their access to education, employment, and other opportunities.
In 1940, black workers were more likely to be employed in low-paying jobs and industries, and they faced significant barriers to upward mobility. Over time, these barriers have persisted, and black workers continue to face discrimination and unequal treatment in the workforce. As a result, black workers are more likely to have lower incomes and less wealth than white workers.
Another factor that has contributed to the widening wealth gap is the increasing inequality in the United States. In recent decades, the top 1% of earners have seen their wealth increase significantly, while the wealth of middle- and low-income families has stagnated or declined. This has disproportionately impacted black families, who are more likely to have lower incomes and less wealth, and has contributed to the growing gap between the net wealth of white and black male workers.

Since the 1960s, the Black-White Wealth gap has increased many times over.

Yes. Because capitalism can never fix inequality when it is based on it. The Black-White wealth gap has increased many times over since 1960 for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which have disproportionately impacted black families and limited their access to education, employment, and other opportunities.
In the decades following the Civil Rights movement, black families continued to face significant barriers to upward mobility and wealth accumulation. These barriers included discrimination in the housing and lending markets, as well as a lack of access to quality education and job opportunities. As a result, black families were more likely to have lower incomes and less wealth than white families.
Another factor that has contributed to the widening wealth gap is the increasing inequality in the United States. In recent decades, the top 1% of earners have seen their wealth increase significantly, while the wealth of middle- and low-income families has stagnated or declined. This has disproportionately impacted black families, who are more likely to have lower incomes and less wealth, and has contributed to the growing gap between the net wealth of white and black families.

Turning to employment, as shown in Fairline and Sundstrom (1999), the Black-White unemployment gap appeared sometime in the 1940s and has massively widened since then.

The Black-White unemployment gap has widened since the 1940s for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which have disproportionately impacted black workers and limited their access to education, employment, and other opportunities.
In the decades following World War 2, black workers continued to face significant barriers to upward mobility and job opportunities. These barriers included discrimination in the hiring and promotion process, as well as a lack of access to quality education and job training. As a result, black workers were more likely to be unemployed or underemployed than white workers.
Another factor that has contributed to the widening unemployment gap is the changing nature of the economy and the labor market. In recent decades, many jobs that were traditionally held by black workers, such as manufacturing and industrial jobs, have been automated or outsourced to other countries. This has disproportionately impacted black workers, who are more likely to be in these types of jobs, and has contributed to the growing gap between the unemployment rates of white and black workers.

With respect to home ownership, the trajectory is perhaps even more surprising. Collins and Margo (2011) show that the Black-White home ownership gap is today similar to what it was in 1910.

The Black-White home ownership gap is similar to what it was in 1910 for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which have disproportionately impacted black families and limited their access to education, employment, and other opportunities.
In the decades following the Civil War, black families were largely excluded from the housing and lending markets, and they were unable to access the same opportunities and benefits as white families. This was the result of discriminatory policies and practices, such as redlining and restrictive covenants, which effectively segregated black families into specific neighborhoods and denied them access to credit and other resources.
Despite significant progress in the decades since 1910, including the Civil Rights movement and the passage of fair housing laws, black families continue to face barriers to home ownership. These barriers include discrimination in the lending and real estate markets, as well as a lack of access to quality education and job opportunities. As a result, the Black-White home ownership gap remains similar to what it was in 1910.

And the rate of poverty among African Americans stopped declining in the 1960s, and actually began to rise thereafter if you exclude welfare payments when calculating income (“latent poverty”). Murray (1981).

Latent poverty is a term used to describe a situation where a person or group experiences poverty not as an immediate and visible condition, but as a persistent and underlying reality. In other words, latent poverty is poverty that is not immediately apparent, but that is present and affects a person or group in subtle and often invisible ways.
Latent poverty among African Americans increased after 1960 for a number of reasons. One of the main reasons is the ongoing existence of systemic racism and discrimination, which have disproportionately impacted black families and limited their access to education, employment, and other opportunities. In the decades following the Civil Rights movement, black families continued to face significant barriers to upward mobility and wealth accumulation, which made it difficult for them to escape poverty and build a better future for themselves and their families.
Another factor that has contributed to the increase in latent poverty among African Americans is the changing nature of the economy and the labor market. In recent decades, many jobs that were traditionally held by black workers, such as manufacturing and industrial jobs, have been automated or outsourced to other countries. This has disproportionately impacted black workers, who are more likely to be in these types of jobs, and has made it more difficult for them to achieve economic stability and security. As a result, latent poverty among African Americans has increased in recent decades.

Nothing in any of this data looks like African Americans are slowly recovering from the economic effects of slavery.

It is not accurate to say that African Americans are slowly recovering from the economic effects of slavery. The legacy of slavery continues to have a profound and lasting impact on the black community in the United States, and the economic effects of slavery cannot be undone or erased.
Slavery was a barbaric and inhumane institution that stripped African Americans of their labor, their dignity, and their freedom. It was also a key driver of the American economy, and it provided the foundation for many of the country's industries and institutions. The wealth and prosperity that was built on the backs of enslaved people has been passed down through generations, and it continues to benefit white Americans today.
While African Americans have made significant progress in recent decades, they continue to face significant barriers to education, employment, and economic opportunity. These barriers are the result of systemic racism and discrimination, which have persisted despite the end of slavery and the Civil Rights movement. It is crucial for society as a whole to recognize and address these underlying issues in order to promote equality and justice for all people.

This, combined with the direct data comparing the descendants of free and enslaved blacks, and the historical and contemporary data on the inter generational transfer of wealth, and the historical recovery time for other populations following important hardships, gives us good reason to reject the view that slavery is an important cause of modern Black poverty rates. Some reasons blacks could be lagging behind whites in wealth and have a higher crime rate could be due to racial differences in behavior, something to be discussed below.

Look at the undisguised, seething contempt here. “Story time with Leftists.” As if the long, long history of black oppression in the United States is just some obscure Leftist fetish object and not an incredibly empirically verifiable pattern repeated countless times over. This blithely dismissive attitude and shifting of the burden of proof precludes critical thinking, such as “to what degree might historical oppression influence black behavior, and is that commensurate with the effect size?” or “even assuming there is zero racist oppression today, how long might historical oppression linger, and how much of the current disparity might be explained by these generational effects?”

Redlining

Moving onto redlining, Last (2019) has already discussed this issue at length. According to the available data, loaning is unrelated to the racial composition of an area once further controls are adjusted for.

It is not accurate to say that bank loaning is unrelated to the racial composition of an area once further controls are adjusted for in regression analysis. While it is true that regression analysis can be used to control for a variety of factors that may affect the relationship between bank loaning and the racial composition of an area, it is not always possible to fully control for these factors.
Racial discrimination and bias in the banking and lending industry are complex and multifaceted issues, and they cannot be fully explained or accounted for by regression analysis alone. There is a growing body of evidence that suggests that black and minority communities continue to face significant barriers to accessing credit and other financial services, even when other factors are controlled for.
Additionally, the relationship between bank loaning and the racial composition of an area is likely to be dynamic and changing, and it may be difficult to capture all of the relevant factors in a regression analysis. As a result, it is not accurate to say that bank loaning is unrelated to the racial composition of an area once further controls are adjusted for in regression analysis.

When it comes to public programs, Vaush is very vague with which programs he even means. Because of this, it’s not worth spending too much time on. Redlining in conjunction with generational poverty, redlining is argued to show how blacks were stuck in impoverished neighborhoods because they were denied loans. Due to this, blacks were stuck in impoverished locations which led to criminality and having lower wealth. As has already been noted above, the correlation between poverty and criminality is not causal, and controlling for income does not remove race differences in crime.

Controlling for income in regression analysis does not necessarily remove race differences in crime. While it is true that regression analysis can be used to control for a variety of factors that may affect the relationship between crime and other variables, such as income, it is not always possible to fully control for these factors.
There is a growing body of evidence that suggests that black and minority communities continue to face significant disparities and inequalities in the criminal justice system, even when other factors, such as income, are controlled for. For example, studies have shown that black people are more likely to be arrested, charged, and sentenced more harshly than white people for the same crimes. This suggests that race remains an important factor in the criminal justice system, even when income is controlled for in regression analysis.
Additionally, the relationship between crime, race, and income is likely to be dynamic and changing, and regression analysis may not be able to capture these changes over time. As a result, it is important to use caution when making conclusions about race differences in crime based on regression analysis, and to consider other evidence and factors that may be influencing the relationship between these variables.

Racial differences in the ability to acquire a loan are sometimes pointed to as evidence of white privilege or anti-black bias. These differences are said to lead to racial disparities in home ownership rates which in turn have a variety of long-term economic and social consequences. Data from Pew shows that black people are indeed more likely to be denied for a mortgage loan. However, even among blacks the rate of denial is only 27%. Desilver and Bialik (2017).

This means about 1 in 4 black people will get denied a loan which is not “only”. From the Pew figures a black mortgage seeker is 2.5 times more likely to be denied than a white mortgage seeker. And this helps SL’s argument against systemic racism how? It's nowhere near as “only” as for white people. Indeed, to quote Pew,

“they have a much harder time getting approved for conventional mortgages than whites and Asians, and when they are approved they tend to pay higher interest rates.”

Turning the interest rates, it is true that Black people are more than twice as likely as whites to get a mortgage interest rate of 8% of more. But this is very rare even among black mortgage holders. The average interest rate seems to be similar among whites, Hispanics, and blacks, though possibly significantly lower for asians. Pew (2017). To show that these disparities are due to racial bias, leftists try to show that black people have a harder time getting loans even after controlling for economically relevant variables. For instance, some have pointed out that Whites are more likely than Blacks to get loans approved when comparing people of equal incomes. But it would be fallacious to infer from this that racial bias was at work since Blacks and Whites with equal incomes do not have the same spending behavior. For instance, Borgo (2013) looked at data on 25,820 American households and found that Black homes had lower saving rates than White homes even after controlling for differences in income, age, family size, education, and marital status. Thus, it makes sense for banks to prefer White customers over Black ones even if they have the same incomes. Such differences in behavior explain why Blacks and Whites with equal incomes do not have the same credit scores.

Race differences in saving behavior cannot fully explain why black and white people with equal incomes do not have the same credit scores. While it is true that differences in saving behavior may affect a person's credit score, this is not the only factor that determines a person's credit score.
Credit scores are a measure of a person's creditworthiness, and they are based on a variety of factors, including a person's payment history, the amount of debt they have, the length of their credit history, and the types of credit they have used. While differences in saving behavior may affect some of these factors, they are not the only factor that determines a person's credit score.
Additionally, the relationship between race, income, and credit scores is likely to be influenced by other factors, such as discrimination and bias in the lending and financial services industry. Black people are more likely to face discrimination and unequal treatment when applying for credit and other financial services, and this can make it more difficult for them to access the same opportunities and benefits as white people, even when they have the same income.
Overall, while differences in saving behavior may play a role in the differences between black and white people's credit scores, they are not the only factor, and they cannot fully explain these differences.
Also, what is SL talking about here? The paper literally says

“the gap in savings remains for Mexican American households, but disappears for Black households after controlling for income.”

What journal was this paper even published in? Was it peer reviewed? Because the same author, Dal Borgo, has a paper in 2019 that actually posits the opposite: that there are no racial differences in saving rates. So does Gittleman and Wolff (2004).

Using quantile decompositions, Dal Borgo (2019) finds that after adjusting for socio-economic characteristics — including retirement assets — the gap in savings rates between Black and White households disappears. Based on SCF data, Black families are as likely or even more likely to save regular or irregular amounts as White families in each income quintile. For example, from 2010 to 2016, 70% of both Black and White families in the top income quintile said they saved regular or irregular amounts, whereas the share of savers was five to nine percentage points higher among Black families in lower income quintiles.

In addition, we chose savings rates close to national average saving rates, which, together with our other assumptions, also results in a total Black-White wealth gap at age 65 without policy interventions that resembles that observed in the SCF data for 2016.

Other studies also find that differences in savings behavior cannot explain the racial wealth gap: conditional on income, Black people save slightly more than Whites (Hamilton and Darity, 2010; Dal Borgo, 2019; Darity and Mullen, 2020). Additionally, Black households tend to save more than White households all else equal (Petach 2020 Hamilton; Darity (2017); Dal Borgo (2019); Darity and Mullen (2020) supports this).

As reported by The Washington Post: “The study found that whites earning less than $25,000 had better credit records as a group than that African Americans earning between $65,000 and $75,000. Overall, 48 percent of blacks and 27 percent of whites had bad credit ratings, as defined by Freddie Mac in this study.” – Loose, The Washington Post.

Ok so what we need to see is 1000 white people with a 600 credit score compared with a thousand black people with a 600 credit score. So unless SL has that statistic he is not proving anything relevant. black people are more likely to face discrimination and unequal treatment in the lending and financial services industry. This can make it more difficult for them to access credit and other financial services, even when they have higher earnings.
Another reason why white people with lower earnings may have better credit records is that they may have more access to wealth and other resources that can help them to build and maintain a good credit score. For example, white people may have access to financial support and resources from their families and communities, which can help them to pay their bills on time and avoid defaulting on their loans. This can give them an advantage over black people with higher earnings, who may not have the same access to these resources.
Additionally, the relationship between race, income, and credit scores is likely to be influenced by other factors, such as the availability of credit and financial services in different communities. Black and minority communities may have less access to credit and other financial services, which can make it more difficult for them to build and maintain a good credit score, even when they have higher earnings.
Overall, there are many complex factors that can contribute to the differences between black and white people's credit scores, and it is important to consider all of these factors when trying to understand and address these disparities.

That being said, some studies find that racial differences in loan acceptance persist even after adjusting for credit score differences. This is true, but it is also true that the credit scoring system doesn’t work equally well for Blacks and Whites. Consider the following from a report given to congress by the federal reserve on how well loan performance is predicted by credit scores: “Consistently, across all three credit scores and all five performance measures, blacks… show consistently higher incidences of bad performance than would be predicted by the credit scores.” In other words, if you give out a loan to a Black and a loan to a White with equal credit scores, you are more likely to get your money back from the White. So, the normal attempts to control racial differences in loan risk are insufficient and cannot be taken as providing good evidence for the existence of racial bias.

So this report is not conclusive and is being misrepresented. The report says,

“the data assembled for this study can provide only limited insights into the relationship of credit scores to credit performance, availability, and affordability (and essentially no insight into whether the relationship is one of cause and effect). The data do not contain key variables that would need to be taken into account. Missing data include other underwriting factors, such as loan-to-value ratios in the case of mortgages, and the weight given to credit scores relative to these other factors. Missing data also include underlying differences in socioeconomic factors such as wealth and employment experience; only a rough estimate of individual income is available. Moreover, the credit-record data used here are for a brief period in time and therefore cannot reflect changes over time in the relationship between credit scores and the availability or affordability of credit.”

So they admit that the relationships described are non causal, meaning being black doesn’t cause bad underperformance, and they admit that the scope of this report is limited at least time wise. Next the report says

“we use information from the Federal Reserve Board’s 2004 Survey of Consumer Finances (SCF) to explore the possibility that differences in, for example, wealth, employment history, and financial experience might explain some, or perhaps all, of the remaining differences in performance, availability, and affordability across groups. Inferences from this analysis are only suggestive because the information cannot be linked either to the individuals in the study sample or to their credit-related performance or loan terms. Assessment of the SCF data shows that younger families differ substantially from older families over a wide variety of financial dimensions. Variations across age groups in income, wealth, and their components and in debt-payment burdens and savings largely reflect the life-cycle pattern of income; that is, income rises as workers progress through their careers and falls sharply upon retirement. Also, younger individuals are more likely to experience recent bouts of unemployment. None of these factors were explicitly accounted for in the multivariate performance analysis conducted with the credit-record data. The SCF data show that income, wealth, and holdings of financial assets are substantially lower for black and Hispanic families than for non-Hispanic white families. Debt-payment burdens and propensities for unemployment are also higher for blacks and Hispanics. These racial patterns generally hold even after accounting for age, income, and family type. Differences in educational attainment and credit-market experience may relate to financial literacy. For example, high-school and college graduation rates among Hispanics are below those for blacks, which, in turn, are lower than those for non- Hispanic whites. Each of these factors, none of which were included in the credit-record analysis, may at least partially explain differences in performance across racial or ethnic groups.”

So they admit that other variables may be explaining performance gaps and not black people.

That being said, there are at least four lines of evidence which suggest that racial differences in loan acceptance, and interest rates, are due to race being accurately used as a proxy for investment risk and not due to racial animus.

It is not good to use race as a proxy for investment risk. Using race as a factor in investment decisions is not only unethical, but it can also be counterproductive and harmful.
Racial discrimination and bias are pervasive and systemic problems, and they can have a profound and lasting impact on black and minority communities. Using race as a proxy for investment risk can reinforce and perpetuate these inequalities, and it can deny black and minority communities access to the same opportunities and benefits as white communities.
Additionally, using race as a proxy for investment risk can lead to incorrect or misleading conclusions about the risks and potential returns of different investments. Race is not a reliable or valid predictor of investment risk, and using it as such can result in poor investment decisions and suboptimal outcomes.
Overall, it is important to recognize that using race as a factor in investment decisions is unethical and harmful, and it should be avoided. Instead, investment decisions should be based on objective, verifiable criteria that are relevant to the potential risks and returns of the investment.

First, this is suggested by the pattern of racial differences in approval rates by credit score. The previously noted report explains that black people being more risky than white people to give loans to after holding credit scores constant is mostly true of those with low credit scores. Among those with high credit scores, there isn’t must of a difference. Now, a study by the Chicago federal reserve found no racial bias in loan approval rates among those with a good credit score but a significant bias in favor of whites among those with a bad credit score.

This study has no sort of design to clearly follow. The methodology is basically non-existent and there’s no regression outputs or effect sizes or tables to see how conclusive and reliable these results actually are (its subject to type II error given its statistical underpower from the sample size). Another thing that’s weird about this study is that they used a logistic regression to see a relationship between race and loans but logistic regression is exclusively used for binary measures. There were many other regression techniques they could’ve used like linear regression. But it’s hard to conclude something on a 3 and a half page study with no sort of design.

Similarly, Ross et al. (2004) find that black borrowers have a tougher time getting loans but this is only true among those who don’t have mortgage insurance.

So first of all, their definition of “those that don’t have mortgage insurance” is inflated because they excluded those who were denied PMI from the analysis which obviously affects these results. This paper says from the abstract:

“This paper suggests that lenders may favor applicants from Community Reinvestment Act (CRA)-protected neighborhoods if they obtain private mortgage insurance (PMI) and that this behavior may mask lender redlining of low-income and minority neighborhoods.”

In other words, it literally says that for those who do have mortgage insurance, the disparity is still there but it’s just concealed. They only “appear to treat applications from these neighborhoods more favorably when the applicant obtains PMI.” In fact from table 6 we see that this to be the case. Per the study:

“When the threshold defining a low-income neighborhood is raised, the second two specifications in Table 6, the results are consistent with redlining by tract racial composition, and no evidence of redlining by tract income is found. Evidence of redlining by race reappears when the specification includes a tract income measure that is less highly correlated with race. This study cannot distinguish, however, between a situation in which racial redlining is masked by this high correlation and a situation in which redlining only occurs against low-income neighborhoods. The positive effect of PMI on loan approval still arises, however, based on tract income. The evidence points to the fact that CRA has improved credit access based on neighborhood income rather than neighborhood racial composition.”

So this paper isn’t making an argument on race, it’s on income. In fact it supports that race affects the probability of getting a loan.

Thus, lenders are acting exactly as we would expect them to if they were accurately using race as a proxy for investment risk. The second line of evidence deals with default rates by race. Like many studies, Berkovec et al. (1994) find that loans taken out by black people are more likely to end in default. This remains true after controlling for the size and type of loan as well as characteristics of the borrower such as their age, income, and liquid assets value. If you think black people are discriminated against in the loan market you might expect that black people must meet higher standards, and so ensure a lower risk of default, than white people in order to get loans. These results show that this is not true and so this is evidence against racial bias.

The study says that a large portion of default rates is explained by other variables:

“Thus, approximately one-half of the differential in observed default rates between whites and blacks can be explained by differences in other characteristics.”

This study isn’t that longitudinal as it only takes data from 1987 to 89.

Also, the risk of loan defaults is overestimated for black borrowers because the regressions have negative intercepts (-4.9410, -4.8248, -6.8573 for cohorts 1987, 1988, and 1989 respectively).

Let’s take note of antithetical claims in the paper:

“We also find that the proportion of census tract populations accounted for by blacks is not strongly and consistently related to the likelihood of loan default,”

In fact the study even shows the “CTBLACK” coefficient is negative and statistically insignificant in the 1989 cohort. Let’s also take note that this paper says

“While we have sought to exploit the data set as fully as possible to account for relevant determinants of default likelihoods, we clearly have not accounted for all such determinants, and to the extent that they are correlated with race or ethnicity, biased estimates will result. Also, since the basic prediction depends on a number of assumptions regarding the nature of discrimination, reported results are conditional on those assumptions holding. For example, the model assumes that lending bias takes the form of different standards of creditworthiness for different groups, rather than the form of random rejections of loan applications that differ on average across groups. Other forms of discriminationmight lead to different performance effects. For example, if discrimination led lenders to foreclose more quickly on black borrowers than other borrowers, this could result in higher default rates for black borrowers.”

Some may claim that high default rates among black borrowers are to be expected because they are charged far greatest interest rates than whites are. This explanation is not compelling though given the size of the unexplained gaps in interest rates between races which remain after controlling for obvious confounds. For instance, Cheng et al. (2014) analyzed data from the U.S. Survey of Consumer Finances from the years 2001, 2004, and 2006 and found that controlling for measures of consumer behavior and debt risk reduced the black-white average interest rate gap to .29%. This remaining gap is small and may itself be explained by variables correlated with race which Cheng et al. did not measure.

So the paper actually says,

“after all variables are controlled for, racial discrepancy in mortgage rates remain remarkably stable and persistent at around 29 – 31 basis points in favor of white borrowers.”

The t statistic was 11.02 meaning the gaps are very significant between black and white people. This is misleading on part of the author because he doesn’t know how basis points work. An interest rate gap of “.29%” means that black people pay 29 basis points higher than white people which is not small and actually matters in the mortgage market. 29 basis points every month translates to paying thousands of dollars more over the course of the loan. So applied over time, that’s a lot of money.

In any case, it seems unlikely that an interest rate gap of this size could explain racial gaps in default rates. Moving on, in my view, the strongest evidence against racial bias in lending comes from Bhutta and Hizmo (2019). They analyzed a data set consisting of all FHA-insured mortgages that originated in 2014 and 2015. After controlling for lender effects, credit score, and income, they found a black-white interest gap of .03% and a Hispanic-white gap of .015%. This result is similar to what we’ve already seen, but, unlike most research in this area, Bhutta and Hizmo also included data on discount points and this revealed a racial difference in favor of non-whites. Combining this data into a single model, they found no racial bias in borrower’s expected pay schedule’s. Even more importantly, it is shown that the expected revenue generated by a loan does not significantly differ by the race of the borrower.

This study needs to do several things in order to be valid:
1. A more granular measure of loan costs, such as the total dollar amount of fees and charges paid by borrowers, rather than relying solely on the APR. This captures differences in pricing that are not be reflected in the APR calculation and provide a more accurate measure of discrimination.
2. Include additional variables in the analysis that could affect interest rates but are not currently controlled for, such as borrower income or employment status.
A qualitative analysis of lender practices and borrower experiences in order to better understand the mechanisms underlying any observed differences in interest rates between black and white borrowers.
What about unaccounted temporal trends could affect the interpretation of the study's results? For example, if there were changes in lending practices or market conditions over time that disproportionately affected black borrowers, then this could lead to differences in interest rates that are not captured by the study's controls. Additionally, if there were changes in borrower characteristics or preferences over time that are correlated with race and affect interest rates, then this could also bias the results. One way to address this concern would be to conduct a difference-in-differences analysis that compares changes in interest rates for black and white borrowers before and after a policy change or other event that affects lending practices. This approach would help to isolate the effect of temporal trends from other factors that could affect interest rates and provide a more robust estimate of discrimination over time.
This study is not necessarily generalizable to the broader mortgage market, as it only includes data on FHA-insured home purchase loans and does not include other types of mortgages. Additionally, the time period covered by the study (2014 and 2015) may not be representative of the broader mortgage market at other times. It is always important to carefully consider the specific data and methods used in a study when interpreting its findings and determining its generalizability. I don’t consider these results conclusive since they were merging two different datasets (HMDA and FHA), 29% of their analysis sample had an unsuccessful merge rate for FHA loans (for HMDA loans it was 31%).
They also don’t correct for
1. Data quality and completeness: Both the FHA and HMDA data are subject to various sources of error and incompleteness, which can affect the reliability and validity of the matched data. For example, the FHA data may not include information on all FHA-insured loans, and the HMDA data may not include information on all mortgages originated by financial institutions. As a result, the matched data may not be representative of the entire population of borrowers and lenders, and may not provide a complete picture of mortgage lending patterns.
2. Limited analytical capabilities: The matched FHA and HMDA data can provide valuable information on mortgage lending patterns, but it has limitations in terms of the types of analysis that can be performed. For example, the matched data cannot be used to analyze the reasons behind any disparities in access to mortgage credit that may be identified, or to assess the impact of these disparities on borrowers' outcomes.
The study does not rule out the possibility that lenders may be using race and ethnicity as a proxy for unobserved risk factors and charging minorities more due to an expectation of elevated risk. If lenders are using race and ethnicity as a proxy for unobserved risk factors and charging minorities more due to an expectation of elevated risk, it would be an example of systemic racism because it perpetuates and reinforces existing racial disparities in access to credit and wealth accumulation. This practice would result in minorities paying more for credit than similarly situated non-minorities, which could limit their ability to build wealth and achieve financial stability. This could contribute to a cycle of poverty and inequality that disproportionately affects minority communities.
The study used regression to measure the revenue gaps between different groups, but this method has some potential drawbacks. For example, if there are differences in liquidity needs/preferences among borrowers from different backgrounds that lead them to sort into different locations on point-rate schedules then our estimates of interest rate disparities could be underestimated since we would not account for these systematic differences when looking at total fees charged by lenders. Additionally, APR rises slightly more than 1 for 1 with the interest rate - meaning our estimate of fees may be lower in loans associated with a higher interest rate which could also contribute towards underestimating any existing discrepancies across racial lines when it comes to mortgage pricing.
Also there was lack of data from rejected applicants making further research necessary if we want an accurate picture about disparities across these groups applying for mortgages.

This evidence is very hard to reconcile with racial bias. The fact that, once other differences are held constant, races experience the same expected pay schedules directly suggests that no bias exists. The fact that the expected revenue of loans does not differ by race strongly suggests that the differences in the terms of loans given to blacks and whites reflect lenders accurately forecasting the terms which will maximize profit within each race of borrowers. It is hard to see how this result could come about if people were acting on the basis of racial animus rather than economic rationality. Finally, more evidence that racism is not the cause of differences in loan approval rates comes from a study of several thousand banks which found that Black-owned banks discriminated far more harshly against Blacks than did White-owned banks. Specifically, at a White owned bank a Black person was found to have a 78% higher chance of rejection for a loan compared to a White person. At a Black-owned bank, this figure rose to 179%, an increase of 101%.

This paper lacks information that doesn’t yield us to make any sort of conclusion. For example, there’s no R2 or standard errors to assess the explanatory power these models even have. In fact the paper even says

“the intent of this paper is not to explain the underwriting decision”.

The author doesn’t understand what this paper is trying to prove. The authors are just suggesting that their evidence supports the notion that single equation models are not reliable tests of lending discrimination.

“Most of the traditional research in lending discrimination employs a single-equation model to test for bias. Importantly, while the single-equation method can be employed to test for the absence of discrimination (Rachlis and Yezer 1994), Yezer, Phillips, and Trost (1994) find that this statistical method produces false positive indicators of discrimination. They argue that these false indicators are more likely to be found at institutions that actively lend to minorities. This implies that tests of black-owned institutions may yield results that falsely indicate lending discrimination.”

So in essence, they are just trying to prove their own results of black owned bank discrimination is just the result of type I errros, false positives, unreliablity, etc.

Thus, racial differences in the riskiness of loans seem to account for why Blacks have a harder time getting loans than White people do, and why their interest rates tend to be slightly higher. A narrative related to racial bias in lending concerns the practice of redlining. Essentially, the idea is that in the 1930s the US government created maps demarcating certain neighborhoods as high risk for investment. One of the variables they utilized when estimating an area’s degree of risk was that area’s racial composition. Lenders then became less likely to give out loans to people in these communities and, through public housing and zoning laws, black people were moved into these same communities making them even blacker than they initially were. Thus, black people are said to be at a disadvantage in the loan markets because of the neighborhoods they live in. Importantly, this bias only impacts race indirectly. The discrimination is directly on neighborhoods and so should apply equally to people of all races who live in these majority black areas. So, that’s the story. It’s been tested many times and shown fairly consistently to be false. This falsification consists of studies showing that the probability of people getting a loan does not relate to the racial composition of their neighborhood once economically relevant con-founders are controlled for. In contrast to the standard left-wing argument, racial composition of a neighborhood does not correlate to the probability of someone getting a loan once further variables are adjusted for.

The problems in the further data of the article. First, the studies discussed I believe only take samples nearing the end of redlining. Based on this, it can be argued that the SES factor controlled for that leads to equality in housing loans actually stems from Redlining and this is bad controlling. Some more recent data find that there is a big disparity of black populations between red areas and other areas that doesn’t fit with this analysis.

Looking at data from Pittsburgh, Ahlbrant (1977) noted that race had no significant independent effect on loaning.

First of all Ahlbrant (1977) has to do with mortgage redlining, in a time when race was not required on the papers. Redlining is that which existed for businesses like banks which were given areas populated into 4 categories, and the 4th was for poor people and “negros”.

Second of all, Ahlbrant (1977) says

“although racial concentration is not statistically significant and racial change is positively associated with mortgage lending, this does not conclusively refute the hypothesis that racial redlining exists.”

Third of all the study has limitations not noted:

“Differences between the lending functions estimated for each of the housing market segments may be attributed to variations in the lenders' loan review criteria and/or to the unexplained variance in the regression equation. Given the relative similarity of the statistically significant variables, there is some confidence in assuming that the factors considered in the loan review process are comparable among the two sets of data. However, the weights assigned to these variables are dissimilar. A comparison of equations (10) and (12) shows that the lending function in neighborhoods with greater than 20% black population has a more negative constant term and a steeper slope. The constant term has little meaning from a policy standpoint and should not be interpreted as the equation's intercept in a mathematical sense; rather it represents the mean effect of the left out variables. However, the steeper slope, as evidenced by the larger regression coefficients, may be interpreted as reflecting a more conservative approach to lending in neighborhoods with higher racial concentrations. From the standpoint of the loan review process, this translates into greater emphasis on, or higher weight attached to, income and neighborhood risk factors. To the extent this is true, a marginally qualified borrower in a predominantly black area may be denied a loan that would otherwise be granted in a white neighborhood. Unfortunately, the analysis does not provide sufficient evidence to prove or disprove this scenario. Therefore, racial considerations could be indirectly operating through their influence on a lender's analysis of other variables in the loan review process.”

Data problems include:

“The small sample size for the greater than 20% black tracts introduces its own problems but could not be avoided. There is also a measurement error in the data which occurs as a result of relating a dependent variable for 1973 and 1974 to independent variables for 1970. In addition, measured income is used as an independent variable instead of permanent income. All of the variables are census tract data. A problem could arise in the interpretation of the results if there is not a reasonably good correspondence between the characteristics of the mortgage applicant and those of the census tract in which the loan is made (or turned down). The only two independent variables where potential problems might arise are income and race. It seems reasonable to assume that relative differences in census tract median incomes are maintained in the distribution of median incomes of mortgage applicants across census tracts.”

Again, the approach they use in this study is flawed. One approach is to estimate the equation by census tract; redlining is inferred if loan flows fall as minority population increases. Ahlbrandt (1977) was one of the first published studies of redlining using HMDA data and took this approach. This is a flawed test if redlining occurs by neighborhood: census tracts are too small to qualify as distinct “neighborhoods,” and larger communities are ignored.

LaCour-Little and Green (1998) published a variant of this census-tract-based approach and found that high minority areas were likely to have systematically lower appraisals, a factor that would lead to lower mortgage flows, ceteris paribus.

Once socioeconomic status variations were controlled for, Dingemans (1979) found that ethnicity and age contribute little explanation on getting loans.

First of all, literally in the next paragraph after the one that says “once the effects of socioeconomic status variations are accounted for, measures of ethnicity or age of the housing stock contribute little explanation.” The next one literally notes that this judgment is flawed:

They even say

“The lack of data for some major loan sources and the fact that nearly identical patterns [of Home Mortgage Disclosure Act data] are found in Sacramento - a metropolitan area where few complaints about redlining are heard - should act as reminders that the examination of Disclosure Act data above may not be a sufficient basis for making final conclusions about the processes and behaviors that underlie the patterns that are being found."

Based on his analysis, he should have said, "is not" rather than "may not" be sufficient. The correlation between minority neighbourhoods and lower loan rates can be seen in figs 3 & 5.

So the author gives primacy to overall socioeconomic status of neighborhoods across the entire dataset but also notes that ethnicity affected lending. He then goes on to point out the most important fact, which is that the dataset he was working from was severely limited, only covering the years 1975-78.

So no, “ethnicity and age contribute little explanation on getting loans” and “redlining wasn’t based on race” is very much not what the study is saying.

So, Sacramento lacks the characteristics usual v

associated with places troubled by redlining. The area is generally prosper.

ous; and because of rapid growth, few parts contain housing buflt before 1900

There are no areas with concentrations or deteriorated, abandoned nousing.

only about 5 percent or the population is black and 10 percent Chicano, while

the minorities are concentrated, there is substantial integration

Correlating the mortgage loans with census tract characteristics,

Dinge.

mans finds the same kind of patterns that manv other studies have found:

ene

number of loans per existing single family residence is strongly and positively

correlated with income, negatively with percent black or Chicano. Age of the

housind stock is negatively correlated with loans, as is blue collar eroloy.

Distance from minoritv concentrations and from the (B) itself is Dos

tively correlated with lendinq. The examination of home improvement loans

vielded aenerallv similar results, but there was no clear pattern to multi

family loans

Also, this study is only from Sacramento and it is from 1979. Here is a more up to date one:

https://www.linesbetweenus.org/sites/linesbetweenus.org/files/u5/redlining_revisited.pdf

Avery and Buyank (1981) found that areas which were stably black and stably white did not differ significantly in loan rates once economic variables were held constant.

Except they weren’t. Avery and Buyank says

“However, it also appears that the portion of mortgage financing provided by banks and savings and loans is significantly lower in integrated and all-black neighborhoods than in all-white neighborhoods. This is particularly prevalent in changing neighborhoods where the percentage of blacks is rising.”

They even say

“it should be stressed that these findings, like those of previous redlining studies, are based on reduced form regressions. It is difficult to know whether there have been sufficient controls for demand and risk factors such that strong inferences can be drawn about supply. There is also a concern that the seven- year to nine-year gap between the lending data and 1970 census tract demographics may have caused distortions, particularly in changing neighborhoods.”

They also say

“one explanation for this pattern is that, as argued earlier, financial institutions may feel that all-black and/or integrated neighborhoods are more risky than comparable all-white neighborhoods. Because of this higher perceived risk, banks and savings and loans may reason that they cannot offer conventional mortgage loans in these areas at the same rates as in white areas or at rates that can compete with government-insured and sometimes subsidized loans.”

This is precisely what redlining is. They then go on to say

“They could, of course, offer conventional financing but at higher rates. However, there seems to be a reluctance to offer differential interest rates by neighborhood. A more likely alternative would be to offer the same rate but set higher credit standards in risky neighborhoods, thus relegating a higher fraction of the mortgage business to other lenders. Over time, real-estate brokers, recognizing this fact and knowing the high transactions costs involved in mortgage applications, would steer high-risk neighborhood clients to FHA/VA-insured mortgage bankers where applications more likely would be accepted.”

Tootell (1996) looked at data from Boston and found that the racial composition of an area was unrelated to the proportion of loan applications that were rejected in the area.

Well first of all, Boston did not have this phenomenon to the extent of other cities like Chicago, New York, New Orleans, and other major cities of the time. It’s also worthy of note that the Federal Reserve was also in on redlining those communities. So they were also not going to admit it.

Even if it found out that it was “very little” it still doesn’t reconcile the fact that it was a practice allowed to happen in the banking industries.

Second of all, Tootell actually stated the opposite within the first couple of pages. Like literally in one of the first data points.

This is contradictory to the claim. And actually proves our point: That minorities in white neighborhoods created a ripple effect in how redlining was done. And clearly dates that minorities were not welcome into white real estate.

These results are also questionable because the data pool was not big enough to substantiate the claim itself. It’s also worth noting, the data pool was about 30% minority shared community with white people which is really skewed data.

Even with skewed data, it proves redlining still happened. Tootell says

“The racial composition of the neighborhood does not appear to directly affect the mortgage lending decision, but the race of the applicant does.”

Which is in line with what redlining is as it can be direct or indirect. The reason why they can’t detect an effect for racial composition is because they say

“minority applications in the Boston sample were not overly concentrated in minority areas.”

Tootell also says

“some evidence suggests that the decision to require PMI depends on the minority composition of the tract. This indirect form of redlining would increase the price paid by applications from these areas.”

Again, another indirect form of redlining.

Also the same author published another study in Boston and it was found when accounting for the relationship between redlining and private mortgage insurance, finds redlining against low-income neighborhoods, which in Boston are largely black Ross and Tootell et al. (1998). They even conclude that their test for redlining based on minority status has little power in the Boston area.

https://www.urban.org/sites/default/files/publication/66151/309090-Mortgage-Lending-Discrimination.PDF

(Bradbury et al. 1989 fond a significant effect for race in Boston but their regressions omit more variables). The idea that redlining increased racial inequality also seems unlikely in light of the fact that the black-white home ownership gap today is similar to what it was in the 1920s before redlining began. Collins and Margo (2011).

I mean this paper is saying home ownership improved for people after the civil war? Of course, they went from legally being property to being able to own property?? Redlining persisted from the 30s into the 60s effectively creating the inner cities with poverty and minorities through race based loaning. While white ppl were allowed to build generational wealth through real estate in the suburbs and pass that wealth on. By looking at just the gap in home ownership is one small piece of a large puzzle. Home ownership is one thing, but where and how expensive homes are is what makes the biggest difference… if an area is redlined and there is home ownership but the appraisal of the house is less than the houses white ppl are getting mortgages for based on race, they may both be home owners but the disparity still exists. The study even says, “The relaxing of credit constraints, along with veterans’ subsidies, would likely have their largest impact at younger ages, which is consistent with a pronounced widening of the racial gap between the ages of 25 and 34. It is also part of the conventional wisdom that the reforms of the 1930s solidified the practice of “red-lining” black neighborhoods, making it more difficult for black families to obtain mortgages.” After 1940, white movement to the suburbs increased metropolitan segregation, but white suburbanization also meant that blacks gained access to owner-occupied housing in urban neighborhoods where whites had previously lived (Leah P. Boustan and Robert A. Margo 2010). So homeownership in urban areas may have been there, classic examples of separate but not equal. Both get a water fountain, but when’s got a cooler and the other ones a garden hose. The amenities and opportunities in the suburbs along with property value kickstarter generational wealth while urban areas were left to decay with low property values and therefore underfunded schools.

Thus, the idea that black neighborhoods are racially discriminated against in a way that prevents their inhabitants from getting loans, and that this is tied to redlining from the 1930s which increased racial inequality, seems to be false. So does the idea that black individuals are unfairly discriminated against. Instead, lenders seem to be accurately using race at the individual and neighborhood level as proxies for investment risk and that is not racist. So, since race has nothing to do with loans once economic variables are controlled for, it’s doubtful that this can explain why blacks are in poverty.

Very weak argument. The HOLC "redlining" maps documented segregated urban neighborhoods as they existed in the late 1930's. These neighborhoods became segregated through the combination of many different social practices and policies Rothstein et al. (2017). The HOLC maps, by "redlining" those neighborhoods, compounded the impact of segregation by excluding access to capital. This occurred at an important point in American urban development, right before the boom in suburbanization of the post-war period, when VA and FHA loans became widely available to primarily White families desiring to move out of cities at that time. We cannot lay the entire burden of urban problems and poverty on "redlining," but it made access to capital all the more difficult and is part of a cluster of practices which concentrate disadvantage. It is unlikely that lenders accurately use race at the individual and neighborhood level as proxies for investment risk. While race can be correlated with certain economic and social characteristics that are associated with credit risk, such as income levels and education levels, it is not a reliable or valid measure of investment risk on its own.
Using race as a proxy for investment risk can lead to a number of problems, including:
Misclassification of borrowers: By using race as a proxy for investment risk, lenders may classify some borrowers as high-risk based on their race, even if they have other characteristics that suggest they are low-risk borrowers. This can result in some borrowers being unfairly denied credit or charged higher interest rates, which can have a negative impact on their financial well-being and opportunities.
Under- or over-estimation of risk: Using race as a proxy for investment risk can also lead to under- or over-estimation of risk, as it does not take into account the full range of factors that can affect credit risk. This can result in lenders making suboptimal lending decisions, which can have negative consequences for both borrowers and lenders.
Perpetuation of discrimination: Using race as a proxy for investment risk can also perpetuate discrimination, as it can reinforce existing biases and stereotypes about the creditworthiness of different racial and ethnic groups. This can create barriers to credit for borrowers who are unfairly judged based on their race, and can contribute to unequal access to credit and financial opportunities.
Overall, while race can be correlated with certain economic and social characteristics that are associated with credit risk, it is not a reliable or valid measure of investment risk on its own. Lenders should use a more comprehensive and objective approach to assessing investment risk, in order to avoid misclassification, under- or over-estimation of risk, and perpetuation of discrimination.

Bad Schooling

When it comes to bad schooling, this is based on the assumption that blacks go to bad schools. This is dealt with in Last (2019).

Per Pupil Spending

When discussing why it is that black people are poorer than white people, one variable that is often brought up is educational opportunity. It is alleged that Black Americans have less in the way of educational opportunity than do whites and that this is an important part of the explanation for why it is that Black Americans are, on average, poorer than white Americans. Racial differences in educational outcomes certainly are real. For instance, Asian Americans complete more years of schooling, on average, than do whites, and whites complete more years of schooling than do black and Hispanic Americans. We see the same pattern when we look at data on GPA by race. In fact, the difference in grades between black and white people is significantly greater than is the difference between poor and non-poor people. With respect to SAT scores, Black Students with household incomes of over $100,000 score below white students with family incomes between $20,000 and $25,000. Seeing all this, some people assume that black students must have lower quality schools, and this must explain their poor academic performance. It is often alleged that black schools receive less funding than white schools. Before looking at this issue directly, it’s worth briefly reviewing the literature on whether or not educational spending matters. Lafortune et al. (2015) analyzed the impact on test scores of 68 changes to school spending that took place at the state level between 1990 and 2013. They estimate that a $500 increase in spending led to a .09 SD increase in test scores. Jackson et al. (2018) look at how variation in school funding caused by the Great Recession impacted student performance. They estimate that a 10% reduction in spending led to a .078 SD reduction in test scores and a 2.6% decrease in graduation rates. Miller (2018) analyzed data on changes in school funding due to changes in property valuation across 24 states. On this basis, Miller estimates that a 10% increase in spending leads to a .05 to .09 SD increase in test scores and a 2.1% to 4.4% increase in graduation rates. Jackson et al. (2015) analyzed changes to school spending and estimated the following relationships: “a 10 percent increase in per-pupil spending each year for all twelve years of public school leads to 0.27 more completed years of education, 7.25 percent higher wages, and a 3.67 percentage-point reduction in the annual incidence of adult poverty; effects are much more pronounced for children from low-income families.” Thus, several analyses utilizing different data sets suggest that school spending has a moderate effect on test scores and future life outcomes. Turning to race, to establish that black students have fewer resources dedicated to their education, liberals often cite research showing that whiter school districts are better funded: What is ignored in such analyses is the fact that, within school districts, money is disproportionately given to blacker schools.

If you’re discussing primarily white districts vs primarily black districts and conceding that primarily white districts get more money it’s going to be suburbs vs urban. Suburban schools have way more advanced schools and more funded schools, it’s not even close.

This was the finding of Ejdemyr and Schores (2017) who concluded that “we find that poor and minority students on average receive 1 to 2 percent more resources than non-poor and white students in the same district”.

The study did not find that minority schools receive more resources. In fact, it found that a large share of districts under-allocate resources to disadvantaged students, including poor and minority students.
This is a working paper so it’s not peer reviewed.
So additional to the fact that this study isn’t that longitudinal (2008-09, 2011-12, and 2013-14, 2010 was excluded), these results are applicable because this uses CRDC data however the OCR has removed school expenditures data from the CRDC: https://www2.ed.gov/about/offices/list/ocr/frontpage/faq/crdc.html
This study provides no clear definition of “resources” which leads to the following:
1. Ignoring disparities in quality: If the definition of resources only considers the quantity of resources provided to schools, it may overestimate the amount of resources minority schools receive if those resources are of lower quality than those provided to non-minority schools.
2. Failing to account for hidden costs: If the definition of resources does not account for hidden costs such as transportation or fees for extracurricular activities, it may overestimate the amount of resources minority schools receive if they have higher transportation costs or fewer opportunities for extracurricular activities.
3. Not considering differences in student needs: If the definition of resources does not consider differences in student needs, it may overestimate the amount of resources minority schools receive if they serve a higher proportion of students with greater needs (e.g., English language learners or students with disabilities).
The results are also biased because in 2011-12 the reporting process for 2011-12 required all schools and districts to respond to each question on the CRDC prior to certification. Null or missing data prevented a school district from completing their CRDC submission to the Office for Civil Rights. Therefore, in cases where a school district did not have complete data, some schools or districts reported a zero value in place of a null value. As such, it’s the case that the item response rates may be positively biased.

When it comes to the CRDC, data from school districts are self-reported.

Despite being drawn from a universe of respondents, the submitted data may differ from their actual values due to the occurrence of non-sampling errors (i.e including definitional difficulties, the inability of respondents to provide accurate data, differences in the interpretation of questions, errors made in collection (e.g., in recording or coding the data), and errors made in estimating values for missing data).

There is no consideration of the caveats for analyzing the state and national estimations included in the CRDC.

They attempt to show that the CRDC is robust by correcting it to Texas inequality measures however their scatter plots show some heteroskedastic patterns, predictive error, and overestimation given the residuals under the regression line:

They show how stochastic their regression parameters are based on B^. Based on this we see that outcomes and year, district enrollment, number of teachers, and number of schools are negatively correlated with CRDC/Texas differences, suggesting that CRDC under-reports spending shares on Hispanic and black students in larger districts. They don’t correct for this measurement error.

They show lots of factors influencing funding, but they don’t control for these statistically. Instead they did 102 bivariate correlations instead of using a model with a subgroup analysis to see the effects of race once all the other variables were held constant.

There should be controls for how large these schools and classes are in terms of population since black schools tend to be overcrowded. 1 to 2% more funding doesn’t mean anything when there’s not enough of it to go around to every black school since they are larger, so white schools would still be getting more funding.

The authors also didn’t take cost of living into account. Everything from teacher salaries to the price of land is much greater in one area than in another which is why an area with more black people spends more in absolute dollars. For example, when compared to proximate districts in the state with a comparable CoL Philadelphia is next to last in per student spending within its region. Yet Titusville receives a significantly greater portion of its budget from the state than Philadelphia does.

They should also take into account a higher amount of referrals of black students to special ed (especially for emotional disturbances). And as we know, special ed is expensive.

They should also take into account Alaska. Some states seem to show opposite stories. On average, predominantly nonwhite school districts in Alaska have $3,077 MORE per student in funding than predominantly white school districts. That's because of Fed funding for Native American schools, and general Alaskan stinginess elsewhere. That doesn't apply in most states nor black people. Also, per the study:

“…district racial income inequality, measured as the logged difference in black/Hispanic and white median income, is negatively correlated with spending inequality, meaning that in districts with greater income disparity between ethnic groups, the black and Hispanic school resource share is smaller.”

The thing about Ejdemyr and Schores is that the study didn’t describe a crosswalk for converting a state’s school finance data system to the NCES data system.

They also just measure government funding which accounts for a small portion of funding for schools meaning funding is still not anywhere near equally distributed. The federal government chips in about 8 to 9% of school budgets nationally, but much of this is through programs such as Head Start and free and reduced-price lunch programs. States and local governments split the rest, though the method varies depending on the state. One reason why there is still a gap is because schools are funded off property taxes. Nationally, high-poverty districts spend 15.6% less per student than low-poverty districts do, according to U.S. Department of Education.

It could also be because the private donations of wealthy parents and parent-led school-supporting nonprofits can serve to widen the spending gap (Brown et al., 2017; Nisbet, 2018).

It should also be noted that in figure 4, we still spending ratios that favor white people:

We also see that in the first 5 deciles of table 2, average per pupil expenditures favor whites over black people.

It should also be noted that the study said this:

“But also that a large share of districts under-allocate resources to disadvantaged students. There is variation among districts, however. We estimate that in the districts in which poor or minority students fare worst (the 10th percentiles on each inequality measure), the spending gap between these students and their non- poor or white counterparts is $300 to $500 per pupil.”

Indeed, there was a lot of variability in fact a lot of the standard deviations had a coefficient of variation greater than 1. The study says makes an asterisk note of this:

“Variation among districts is non-trivial; in districts with the most intra-district inequality, the poor receive up to $500 less per pupil than the non-poor.”

Because of this, there was a lot of heterogeneity reflective of the factors they were measuring. They didn’t control for the heterogeneity meaning these results aren’t conclusive whatsoever especially given the low statistical power.

Because whiter districts get more funding but whiter schools get less funding within a given district, these sorts of statistics cannot be used to derive per pupil spending. To do that, we have to look at school level analyses. Murray and Rueben (2008) calculated spending per pupil for US schools between the years 1972 and 2002. They found the following: “In 1972, the ratio of nonwhite to white spending was .98; this trend had reversed by 1982, as spending per pupil for nonwhite students was slightly higher than for white students in most states and in the United States as a whole and has been for the past 20 years” Thus, since 1982, spending on non-white students has been greater than spending on white students.

SL omits where it says

“The results presented thus far need to be considered with a few caveats. These ratios do not reflect that the costs of educating students of different groups differ and that minority students are often found in urban districts that have higher cost structures. Part of the movement to an adequacy standard in court cases reflects the understanding that equalizing educational attainment or outcomes depends on factors other than money, and it may cost more to reach a given standard for a specific set of students or schools serving different populations. In addition, although spending differences have lessened between districts, it is unclear whether inequities are lessened at the school level. According to a recent study, the 10 largest school districts in California have spending gaps between high- and low-poverty high schools— from $64,000 to $500,000 per school (Education Trust-West 2005). This problem is not limited to California. A study of Baltimore, Cincinnati, and Seattle indicated district funding differences for high- and low-poverty schools ranging from $400,000 to $1 million (Roza and Hill 2004). These studies identified large disparities in school funding within districts, with schools serving high-poverty students receiving substantially less district funding. These spending disparities can undermine existing systems trying to close achievement gaps if it means the most at-risk students are not receiving their fair share of highly qualified teachers. A significant part of the disparities found in spending and staffing across districts is related to staffing rules and the right to transfer and fill jobs districtwide based on seniority or tenure within a district. Districts often allocate a certain number of staff to a school, rather than giving schools a per student amount for staff compensation. As teachers gain experience, they often take advantage of seniority rules to move to more affluent schools where students are perceived as easier to teach (Roza and Hill 2004). This can lead to more experienced teachers clustering at low-poverty schools with vacancies at schools serving underserved populations filled by new teachers. As a result, new teachers (who have much lower salaries than experienced teachers) work disproportionately in schools in the poorest neighborhoods. Because of the large range in staff pay, schools with the highest needs within a district often receive substantially less funding because they employ the least experienced teachers. Betts, Rueben, and Danenberg (2000) find that although spending per pupil is largely equalized across districts in California, resources (including experience and qualification levels of teachers) vary dramatically across schools serving high- and low-income (and white and nonwhite) students. Schools serving low-income students typically have a larger percentage of inexperienced and non credentialed teachers, and the variation in teacher qualifications is greater in large urban districts than in the state as a whole. Given that teacher salaries make up about 40 percent of a school district’s budget,3 this difference in experience levels translates into large differences in money spent at the school rather than the district level.”

This issue was revisited by Richwine (2011) who found that spending on black students was 1% greater than spending on white students, while spending on Asian and Hispanic students was a few percentage points lower.

The report ignores a crucial reality highlighted, ironically, in another Heritage publication. “Schools serving low-income students are often poorly funded,” as put in 2000 by Samuel Casey Carter in “No Excuses: Lessons from 21 High-Performing, High-Poverty Schools.” This reality matters because poverty rates vary among racial and ethnic groups. The student poverty rate tends to increase, on average, with the percentage of African American, Hispanic, or Native American students in a district while it tends to decrease with the percentage of white or Asian students.

Given the tangled relationships among education funding, poverty, and race, The reason Richwine is calling this a myth is that his analysis aggregates spending figures to the regional and national level, thus obscuring disparities within states or within districts. Other studies have shown evidence of racial disparities in education funding using state-by-state analysis of district level data. They also highlight a new and growing body of evidence, which Richwine ignores, of racial funding disparities within districts. They employ Richwine’s basic methodology, which has the potential for shedding light on racial disparities, but they make starkly different analytic choices along the way. The choices flow first from an understanding of the fractured nature of U.S. education funding.

School funding systems certainly don’t set out to create disparities in school funding across racial categories. That would be illegal. But despite a couple of generations of litigation, court action, and legislation, school districts in high-poverty areas are still often funded less generously than districts elsewhere. The problem is that funding for districts still derives substantially from local property taxes, and state-level funding arrangements don’t necessarily level the playing field for high-poverty districts. Some states, notably Massachusetts and New Jersey, deserve credit for progressive efforts to give high-poverty districts more funding from state and local sources than low-poverty districts. But other states including Illinois and New York ignominiously fund wealthy districts better than poor ones.

Richwine’s analysis would make more sense were federal money a big part of the picture, but it’s not. Federal funds account for about a dime of each education dollar in the typical state. Moreover, most federal funds, driven by population and poverty figures, do not contribute to racial disparities. Title I of the Elementary and Secondary Education Act, the largest school program operated by the federal Department of Education, distributes money to bolster spending in areas of concentrated poverty. Part B of the Individual with Disabilities in Education Act, the next biggest program, distributes funding based on shares of children and poverty rates. Lets focus on non federal funds devoted to elementary and secondary education because that’s where one finds the funding arrangements contributing, albeit indirectly, to any racial disparities.

Lets also focus on current expenditures—the funds used to pay the day-to-day expenses of operating school districts. This focus corresponds with federal funding formulas, and with the great majority of work examining resource equity in education, including the Education Trust’s Funding Gap series. Volatility and documented racial inequity in capital expenditures such as school construction financed by the sale of tax-exempt bonds create a burden of explanation for researchers choosing as Richwine does to fold them in with current expenditures. He provides none.

Richwine made do with just one year’s expenditures, but it was scarcely greater effort for us to muster three consecutive years’ worth of data. Other studies used Consumer Price Index values to render expenditures from school years ending in 2006 and 2007 in terms of real 2008 dollars. And they copy Richwine in using the Department of Education’s Comparable Wage Index to adjust district’s expenditures for variation within and between states in the costs of providing education.

They employ the same procedures that Richwine used to create per pupil expenditures for each racial or ethnic group, but we go state by state. A state’s per pupil expenditures for a particular racial or ethnic group is a ratio incorporating information from three years. The numerator is the sum of the average district per pupil expenditures multiplied by the number of district students representing the racial group in question. The denominator is the total number of students in the group. Results of a state-by-state analysis of school expenditures in 13,458 unique districts appearing in one or more of the school years from 2005-06 to 2007-08 show the “broadly similar” percentages Richwine celebrates are also present in many of the states. In California, Colorado, Connecticut, Delaware, Florida, Georgia, and Idaho, for example, per pupil expenditures for Black and Hispanic students are within 3 percentage points of those corresponding to white students.

But racial disparities in education spending clearly exist in a host of other states. In Illinois, New York, and Pennsylvania, per pupil expenditures for black and Hispanic students hover around 90 percent of those for white students. This finding is a reflection of these states’ regressive funding tendencies, and the fact that people of color tend to be more concentrated in high-poverty districts. The flipside of this disturbing evidence comes from states such as Massachusetts and New Jersey in which high-poverty districts receive greater support from state and local sources than low-poverty districts.

The evidence shows that Richwine’s “backgrounder” for the Heritage Foundation misses the trees for the forest. Meaningful levels of racial disparity clearly exist in the provision of school funds in some states. But that does not mean that states with progressive funding or those exhibiting similar rates of expenditure across racial and ethnic groups are in the clear. A growing body of literature documents funding inequity, including racial disparities, within districts. The Education Trust’s Salary Gap series, for example, found differences as large as $10,000 between average teacher salaries in districts’ lowest and highest minority enrollment schools. And analyses of school-level per pupil spending data from Florida reveal a negative relationship between spending rates and the percentages of African American and Hispanic students in schools, even after controlling statistically for student poverty rates.

It is agreeable with Richwine’s assertion that intradistrict funding disparities are of special concern in “big-city districts,” and it is agreeable wholeheartedly with him about something else. It does little good to focus on funding levels and disparities without also calling attention to how well schools use the funds available to them.

We are sure to learn more about racial disparities in education funding within districts later this year when the Department of Education releases Phase II of its 2009-10 Civil Rights Data Collection featuring, for the first time, school-level expenditure information. Even when these new data are available, advocates and researchers will have to take care to match their analytic approaches to their questions. Abandoning this practice, as Richwine has done in “The Myth of Racial Disparities in Public School Financing” is as sure to please some audiences as it is to slight others. Such behavior is divisive, at best.

Shorter Version
According to the issue brief’s authors Raegen Miller and Diana Epstein, however, Richwine’s analysis aggregates spending figures to the regional and national level, which obscures disparities within states or within districts. Miller and Epstein’s state-by-state analysis of district-level data provides fresh evidence of racial disparities in education funding. They find that racial disparities in education spending clearly exist in a host of states, including Illinois, New York, and Pennsylvania, where per pupil expenditures for black and Hispanic students hover around 90 percent of those for white students. This finding is a reflection of these states’ regressive funding tendencies, and the fact that people of color tend to be more concentrated in high-poverty districts. The flipside of this disturbing evidence comes from states such as Massachusetts and New Jersey in which high-poverty districts receive greater support from state and local sources than low-poverty districts. The evidence shows that Richwine’s “backgrounder” for the Heritage Foundation misses the trees for the forest. Meaningful levels of racial disparity clearly exist in the provision of school funds in some states. But that does not mean that states with progressive funding or those exhibiting similar rates of expenditure across racial and ethnic groups are in the clear. A growing body of literature documents funding inequity, including racial disparities, within districts.

By and large, leftists ignore attempts to accurately calculate school funding by race. An exception to this was provided by a 2011 report from the Center for American Progress. The authors wrote a response to Richwine in which they analyzed three years worth of data and broke their results down by state. Their analysis confirms Richwine’s findings, but is interpreted in a truly bizarre fashion. They write “But racial disparities in education spending clearly exist in a host of other states. In Illinois, New York, and Pennsylvania, per pupil expenditures for black and Hispanic students hover around 90 percent of those for white students. This finding is a reflection of these states’ regressive funding tendencies, and the fact that people of color tend to be more concentrated in high-poverty districts. The flip side of this disturbing evidence comes from states such as Massachusetts and New Jersey in which high-poverty districts receive greater support from state and local sources than low-poverty districts.” Seemingly, the authors found it noteworthy that Richwine’s average statistic was an average. They express dismay at the fact that, in some states, black children receive 10% less funding than white children, but seem relieved that in others black children receive as much as 18% more funding than white children.

Yeah dude, reparations are good and necessary to heal racial divides. It's also quite hypocritical and telling that he thinks black kids receiving 10% less funding on average is admissible but black kids receiving 1-2% more funding on average is note-worthy, and definitely owns the libs. His claim is that within a primarily black district white students will get 1-2% less resources per student spending? That comes nowhere near the difference you see between what big time high schools in our country are getting in primarily white schools versus inner cities.

https://youtu.be/od3s3lZWbWM

Their language seems to imply a sort of anti-white bias on the part of the authors. In any case, if we are trying to explain why, on average, African American life outcomes differ from white life outcomes, and we are talking about national populations, then average spending per pupil across the nation is obviously the correct statistic to look at.

It is incorrect to look at average statistics when looking at average African American life outcomes because doing so can mask significant variations and disparities within the African American population.
Average statistics, such as the average income or the average educational attainment of African Americans, are useful for providing a general overview of the population, but they do not capture the full range of experiences and outcomes among African Americans. For example, the average income of African Americans may not accurately reflect the income of individuals within the population, as there may be large differences in income between individuals with different levels of education, occupation, or location.
Additionally, looking at average statistics can also obscure disparities and inequalities within the African American population. For example, the average income or education level of African Americans may not accurately reflect the experiences of African Americans who are from low-income families, who live in disadvantaged neighborhoods, or who face discrimination and other barriers to opportunity. These individuals may have significantly different life outcomes compared to the average African American, and their experiences may be overlooked or ignored when looking at average statistics.
Overall, while average statistics can provide valuable information on the general characteristics and trends of the African American population, they should not be used as the sole basis for understanding the experiences and outcomes of individual African Americans. Instead, it is important to consider the full range of variation and disparity within the population, and to use a more nuanced and detailed approach to analyzing African American life outcomes.

Thus, spending doesn’t seem to systematically favor white people and, relative to black people, there is actually a slight anti-white bias in funding.

Class Size

Turning to class size, Glass and Smith (1979) analyzed 77 previous papers on the relationship between class size and achievement scores, finding that students in small classes do significantly better when the classes are quite small, but that differences in class size past around 20 students per teacher didn’t make much of a difference.

They didn’t even include ¾ of the documents they obtained.

“Approximately 80 studies on the class size and achievement relationship were included in the analysis.”

This was out of 300 potential ones. They even admit the reviews on the studies selected are inconclusive:

“Even though the corpus of 80 studies exceeds by 50%…and these reviews are narrative and inconclusive.”

They didn’t include studies credited to school districts either. They clarify that very few studies were used out of many potential ones. Half of the studies analyzed weren’t even controlled (49.9%), so this meta analysis is on studies that use crappy methodologies.
The paper says that they adjusted statistically for differences however they said only some did that. How much is “some”? Let’s take note that it says

“The curve for the well-controlled studies then, is probably the best representation of the class-size and achievement relationship...A clear and strong relationship between class size and achievement has emerged... There is little doubt, that other things being equal, more is learned in smaller classes.”

In fact, from the meta analysis, 60% of the delta S-L’s were positive. Then, in figure 4, it seems the indicated class sizes do matter past 20 for well controlled studies.

Additionally they say this:

“When all those comparisons for which S = 1 were removed the curve in Figure 4 for well-controlled studies was even steeper than that shown…When all comparisons for which S was less than 6 were removed, the curve for well-controlled studies became less steep; however, it still rose from the 50th distribution at size 40 to the 60th at size 10, the 67th at size 5 and the 74th at size 1.”

But this study is kinda bad anyway because of important exclusions, unsure phrases, assuming real word distributions are normal when they’re not, etc. Regardless, small class sizes (20 students or less) were associated with improved academic performance. Effects were strongest in the early primary grades and among low-income students.

More recently, Hattie (2005) aggregated data from 1,165 effect sizes and produced a meta-analytic effect of .13.

John Hattie and his research on class size reduction has been heavily criticized. He includes Glass and Smith’s meta-analysis however Professor Glass has criticized Hattie’s work saying “Averaging class size reduction effects over a range of reductions makes no sense to me. It's the curve that counts. Reductions from 40 to 30 bring about negligible achievement effects. From 20 to 10 is a different story. But Teacher Workload and its relationship to class size is what counts in my book.” Again, “The result of a meta-analysis should never be an average; it should be a graph.” (Robinson, 2004, p. 29). Ever since Hattie’s more recent and larger meta analysis, Visible Learning, that’s when he faced intense scrutiny. Hattie has a “hinge point” of 0.4. Anything that falls below that hinge point does not have a large effect on student achievement. An effect size of above 1.0 actually is equivalent to a year’s worth of growth. Hattie’s research gives class size an effect of 0.21. However there are several problems with his analysis.
1. Effect size: The effect size statistic is the cornerstone of Hattie's work. He claimed that the larger the effect size, the greater the impact, "know thy impact", on student learning and this enabled him to rank educational influences and list the top ranked influences as "what works best". However, in 2018, in an interview with Ollie Lovell, he admitted his rankings were misleading and that he does not rank anymore see: https://soundcloud.com/ollielovell/errr-018-john-hattie-defends-the-meta-analysis#t=1:21:00
Hattie admits if you mix the the random and fixed effect size calculation methods up you have significant problems interpreting your data, “combining or comparing the effects generated from the two models may differ solely because different models are used and not as a function of the topic of interest” (VL, p. 12). Slavin (2015), Bergeron & Rivard (2017) also identified this issue with Hattie's work, with Bergeron & Rivard (2017) stating, “These two types of effects are not equivalent and cannot be directly compared... A statistician would already be asking many questions and would have an enormous doubt towards the entire methodology in Visible Learning and its derivatives.” Hattie & Hamilton (2020), in a small footnote on page 4, “Note that there are two methods for calculating effect size (pre-post and intervention comparison) and they can lead to different interpretations.” Then in Wisniewski, Zierer & Hattie (2020). The Power of Feedback Revisited: A Meta-Analysis of Educational Feedback Research. Hattie reverses his view stated in VL, on the 'fixed method', “the use of a fixed-effect model may not be appropriate. A meaningful interpretation of the mean of integrated effect with this model is only possible if these effects are homogenous (Hedges and Olkin, 1985). Because previous research on feedback includes studies that differ in variants of treatment, age of participants, school type, etc., it is highly likely that the effect size varies from study to study, which is not taken into account by a fixed-effect model. By contrast, under the random-effects model, we do not assume one true effect but try to estimate the mean of a distribution of effects. The effect sizes of the studies are assumed to represent a random sample from a particular distribution of these effect sizes (Borenstein et al., 2010).” (p. 2). Hattie's inconsistency is concerning as he continues to compare studies using both the fixed and random methods with his commercial partner Corwin. A clear example is with Feedback. Originally the effect size for feedback was 0.73, but using this random effects method, Wisniewski, Zierer & Hattie (2020) published a significantly reduced result of 0.48. However, with Corwin, Hattie continues to combine studies using any method of effect size calculation to get 0.62 (Corwin, October, 2021). Once again, indicating Hattie's original claim of comparing effect sizes from disparate studies is neither reliable nor valid. Even worse, Hattie continues to use a 3rd method of converting a correlation to an Effect size. Despite Hattie's claim above of using the 'fixed' method in VL, a number of scholars, Bergeron & Rivard (2017), Blatchford (2016), Wrigley (2018), Bakker et al. (2019) & Kraft (2020) have identified that Hattie most often uses a 3rd method of converting a correlation to an effect size, but there are major problems with this - see https://visablelearning.blogspot.com/p/correlation.html
Bakker et al. (2019) & Kraft (2020) detail this problem and Kraft warns, "Knowing whether an effect size represents a causal or correlational relationship matters for interpreting its magnitude. Comparing meta-analytic reviews that incorporate effect size estimates from observational studies (e.g., Hattie, 2009; Lipsey & Wilson, 1993) to those that only include experimental studies (e.g., Hill et al., 2008; Lipsey et al., 2012; Lortie-Forgues & Inglis, 2019) illustrates how correlational relationships are, on average, substantially larger than causal effects. It is incumbent on researchers reporting effect sizes to clarify which type their statistic describes, and it is important that research consumers do not assume effect sizes inherently represent causal relationships." (p. 3). Sue Cowley cleverly writes about educational authorities using the high correlation between vocabulary size and achievement to direct teacher practice & shows its more complicated than that. The largest, independent & most reputable evidence organisations in the world - the USA, What Works Clearinghouse (WWC) and the English, Education Endowment Foundation (EEF) consider correlation studies to be of very low quality. Another example is the large English organisation Evidence Based Education. In their Great Teaching Toolkit, they warn about these types of correlation studies, “Much of the available research is based around correlational studies; in these the relationships between two variables is measured. While interesting, the conclusions drawn from them are limited. We cannot tell if the two have a causal relationship – does X cause Y, or does Y cause X? Or might there be a third variable, Z? Therefore, while we may find a positive correlation between a teaching practice and student outcomes, we do not know if the practice caused the outcome.” (p.11). Gilmore et al. (2021) in their report to the Association of Mathematics Teachers regarding the misuse of research, warn of pedological conclusions made from correlation studies, “correlational evidence does not tell us the direction of the relationship nor whether other un-measured factors cause the relationship.” (p. 36). This in part, accounts for the large differences in their evidence when compared to Hattie see: https://visablelearning.blogspot.com/p/other-researchers.html
Many peer reviews, Lipsey et al. (1993, 2012), Bakker et al. (2019) detail the problem of converting correlation into an effect size, e.g., Kraft (2018), “Effect sizes from studies based on correlations or conditional associations do not represent credible causal estimates.” Many scholars have asked Hattie to remove these low-quality studies. However, Hattie ignores this with an astonishing caveat, there is, “no reason to throw out studies automatically because of lower quality” (VL, p. 11). Snook et al. (2009, p. 2), “Hattie says that he is not concerned with the quality of the research... of course, quality is everything. Any meta-analysis that does not exclude poor or inadequate studies is misleading, and potentially damaging if it leads to ill-advised policy developments. He also needs to be sure that restricting his data base to meta-analyses did not lead to the omission of significant studies of the variables he is interested in.” Terhart (2011), “It is striking that Hattie does not supply the reader with exact information on the issue of the quality standards he uses when he has to decide whether a certain research study meta-analysis is integrated into his meta-meta-analysis or not. Usually, the authors of meta-analyses devote much energy and effort to discussing this problem because the value or persuasiveness of the results obtained are dependent on the strictness of the eligibility criteria" (p. 429).” But, Hattie constantly boasts that he has the largest set of studies so somehow this overrides the quality issue. Bergeron & Rivard (2017) on Hattie's huge numbers: “We cannot allow ourselves to simply be impressed by the quantity of numbers and the sample sizes; we must be concerned with the quality of the study plan and the validity of collected data.” Larson (2014), “the megalomaniac additive annexation of all sorts of meta-analyses is not concerned with methodologically critical self-reflections, nor with validity claims, i.e., it does not specify the limits to what can be said and made commensurable. The risk is that knowledge in the collected empirical data piles disappears when it is formalised in a second-, third-, and-fourth-order perspective” (p. 6). Prof Terry Wrigley (2015) in Bullying by Numbers, gives a detailed analysis of this problem. Also, Wrigley (2018, p. 365) in The power of ‘evidence’: Reliable science or a set of blunt tools? highlights the problem of Hattie's use of correlation quoting Hubert and Wainer (2013: 119), “One might go so far to say that if only the value of rXY is provided and nothing else, we have a prima facie case for statistical malpractice.” Bergeron & Rivard (2017) show how r is converted to d: “Hattie confounds correlation and causality when seeking to reduce everything to an effect size. Depending on the context, and on a case by case basis, it can be possible to go from a correlation to Cohen’s d (Borenstein et al., 2009):
but we absolutely need to know in which mathematical space the data is located in order to go from one scale to another. This formula is extremely hazardous to use since it quickly explodes when correlations lean towards 1 and it also gives relatively strong effects for weak correlations. A correlation of .196 is sufficient to reach the zone of desired effect in Visible Learning...It is with this formula that Hattie obtains, among others, his effect of creativity on academic success (Kim, 2005), which is in fact a correlation between IQ test results and creativity tests. It is also with correlations that he obtains the so-called effect of self-reported grades, the strongest effect in the original version of Visible Learning. However, this turns out to be a set of correlations between reported grades and actual grades, a set which does not measure whatsoever the increase of academic success between groups who use self-reported grades and groups who do not conduct this type of self-examination.” Here is an example of the problem with correlation here using a class of 10 students see: https://docs.google.com/spreadsheets/d/1R8EkDv6MFQ8UbpN1T79EVR17xVhGuz-cFXDo0aquS24/edit?usp=sharing
A moderate correlation of r = 0.69 gets converted into one of the largest effect sizes in Hattie's book of d = 1.91 - this would rank #1 on Hattie's list. A weak correlation of r = 0.29 gets converted into an effect size of d = 0.61 - this would rank #20. Blichfeldt (2011) on Hattie's correlation, “correlations or correspondence do not provide grounds for causation. Hattie mentions that correlations should not be confused with causal analyzes. It is striking that the book is first and foremost presented so that it is read easily as causal analyzes, of "what works" or leading to good test results and not that he ranks the 138 variables thereafter - as a list of disconnected factors.” DuPaul & Eckert (2012, p. 408) “randomised control trials are considered the scientific "gold standard" for evaluating treatment effects... the lack of such studies in the school-based intervention literature is a significant concern.” Kelley & Camilli (2007, p. 33) - Teacher Training. Studies use different scales (not linearly related) for coding identical amounts of education. This limits confidence in the aggregation of the correlational evidence. Studies inherently involve comparisons of nonequivalent groups; often random assignment is not possible. But, inevitably, this creates some uncertainty in the validity of the comparison (p. 33). The correlation analyses are inadequate as a method for drawing precise conclusions (p. 34). Research should provide estimates of the effects via effect size rather than correlation (p. 33). Breakspear (2014, p. 13) states, “Too often policy makers fail to differentiate between correlation and causation.” Blatchford (2016, p. 94) commenting on Hattie's class size research, “Essentially the problem is the familiar one of mistaking correlation for causality. We cannot conclude that a relationship between class size and academic performance means that one is causally related to the other.” We are constantly warned that correlation does not imply causation! Yet, Hattie confesses: “Often I may have slipped and made or inferred causality” (p.237). Lind (2013, p. 197) also questions Hattie about his use of correlation and accuses Hattie of displaying the correlation and not the effect size when it suits him. This means the effect appears small but when converted it is large. The example Lind gives is VL, p. 197, where Hattie cites r = 0.67 for kinesthetic learning, but when converted d = 1.81. This is a huge effect! But, Hattie rejected this study, “It is difficult to contemplate that some of these single influences... explain more of the variance of achievement that so many of the other influences in this book.” Sue Cowley, “It sounds so wonderfully simple doesn’t it? All you have to do to become ‘smarter’ is to know more words. And this ties so perfectly into the learning is memory narrative – memorise more words and hey presto! You are smart....But you can’t work backwards like that from research. It makes a nonsense of the vast complexity of the process. Correlation, as we should never tire of saying, is not causation. Sure, there’s a link, but you can’t put the cart in front of the horse. Knowing more words didn’t happen first – you can’t use it as a substitute for best practice in EYFS because it came as a result of something else. Which, in the case of early child development, is what we call ‘serve and return’ conversations, where loving and attentive caregivers pay careful attention to small children in order to support them, within rich and imaginative environments that enable learning. And there ain’t nothing simple about that.” In his updated version of VL 2012 (summary) Hattie once again emphasises he mostly uses the 'fixed' method. Again, he makes no mention of using the weaker methodology of correlation (p. 10). Yet, after not revealing, let alone not justifying the use of correlation studies and admitting there are issues when comparing, Hattie ignores the problem and directly compares effect sizes from the 3 different methods without comment or adjustment. Simpson (2017, p. 452) explains, “...while calculating an effect size may be simple enough for a first course in statistics, there are considerable subtleties in understanding it sufficiently well to ensure that the processes of combining effect sizes in meta-analyses allow valid conclusions to be drawn.” There are further variations in the effect size calculation that different researchers use, e.g., Cohen's d, Hedges' g and Glass’s Δ. Each of these methods uses a different Standard Deviation. This creates more problems when comparing studies. This is best summarised by a report from John Mandrola, “The Year’s Most Important Study Adds to Uncertainty in Science.” Mandrola summarises a large study by Nosek et al (2018) who recruited 29 teams comprising 61 researchers to use the SAME data, but came up with 29 totally DIFFERENT effect sizes! The Random Method insists on random assignment of students to a control & experimental group. Note the medical method also insists on "double blindness". That is, neither the control or experimental group nor the staff know, who is getting the treatment. This is done to remove the effect of confounding variables. Few of the studies that Hattie cites use random allocation. Cheung & Slavin (2016) support the concern of which method to use to calculate the effect size, “...effect sizes are significantly higher in quasi-experiments than in randomized experiments.” Slavin (2015) details the difference, “Matched quasi-experiments did produce inflated effect sizes (ES=+0.23 for quasi-experiments, +0.16 for randomized). This difference is not nearly as large as other factors we looked at, such as sample size (small studies greatly exaggerate outcomes), use of experimenter-made measures, and published vs. unpublished sources (experimenter-made tests and published sources exaggerate impacts). But our findings about matched vs. randomized studies are reason for caution about putting too much faith in quasi-experiments.” Berk (2011) concurs with Slavin and Bergeron & Rivard, “when the studies are not randomized experiments, there is a strong likelihood that a collection of biased treatment effect estimates is being combined. How is one then better off? Biased estimates are not random errors and do not cancel out. The result can be just a more precise causal estimate that has the wrong sign and is systematically far too large or far too small.” (p. 199). DuPaul & Eckert (2012), details their concern, “randomised control trials are considered the scientific gold standard for evaluating treatment effects ... the lack of such studies in the school-based intervention literature is a significant concern” (p.408). Note that the Education Endowment Foundation (EEF), as part of their quality control, only accept studies that use the randomised method (they would disregard MOST of Hattie's 1400+ studies). Similarly, the What Works Clearinghouse (WWC) reserve their highest rating for studies that use the randomised method (they would also disregard MOST of Hattie's studies). Many scholars are critical of Hattie's lack of quality control, e.g., Hattie cites many meta-analysis from Prof Bob Slavin, but Slavin (2017), is very critical of Hattie's method, “Hattie includes literally everything in his meta-meta analyses, including studies with no control groups, studies in which the control group never saw the content assessed by the post-test, and so on.” Slavin (2017) explains the use of some sort of quality control would remove, “…a lot of the awful research that gives Hattie the false impression that everything works, and fabulously.” As a result of all these issues, Slavin (2018) posted that John Hattie is Wrong! Hattie's lack of quality control, in part, explains the vast difference in the conclusions of Hattie compared to EEF, WWC and others. Hattie also used different tests: Standardised tests, specific tests, physical tests a mother's rating of their child out of 5, IQ and many examples of measuring something else like hyperactivity & engagement. Hattie claims all of the studies he used focused on student achievement. But this is clearly NOT the case as many studies measured something else like hyperactivity. Wecker et al. (2017, p. 28) confirm this saying, Hattie mistakenly included studies that do not measure academic performance.
Fletcher-Wood (2021) commenting on Hattie's prime Feedback study,
"Kluger and DeNisi focused on the way feedback affects behaviour – not how it affects learning."
Even if all the studies did measure Student Achievement, there is a growing body of evidence showing that the test used can determine the effect size, e.g., standardised tests generate lower effect sizes than specific tests.
Simpson (2017, p. 461) gives examples of specific tests designed for a particular influence, e.g., improving algebra skills, resulting in a 40% higher effect size than a standardised test on the same students!
Kraft (2019) confirms this,
"Even among measures of student achievement, effect sizes for researcher-designed and specialized topic tests aligned with the treatment are often two to four times larger than effects on broad standardized state tests (Lipsey et al., 2012; Cheung & Slavin, 2016)" (p. 8).
Simpson (2017, p. 462) details more problems with tests. He shows tests with more questions give 400% higher effect sizes.
Simpson (2018b, p. 5) also shows problems with different standardised tests for the SAME maths intervention,
"The effect size... for the PIM test was 0.33 and for the SENT-R-B test was 1.11."
Slavin (2019) has also written extensively on this issue and confirms Simpson's analysis.
Hattie just ignores these issue and jumbles any test together.
So comparing these effect sizes is the classic 'apples versus oranges' problem.
The page Student Achievement discusses in more detail the HUGE question of what is student achievement? There is NO consensus of what it is. So, how does one measure it???
Blichfeldt (2011),
"We also get no information about how 'learning outcomes' are defined or measured in the studies at different levels, what tests are used, which subjects are tested and how."
Many of the scholars that Hattie used also comment on this problem,
DuPaul & Eckert (2012),
"It is difficult to compare effect size estimates across research design types. Not only are effect size estimates calculated differently for each research design, but there appear to be differences in the types of outcome measures used across designs." (p. 408).
Kelley & Camilli (2007),
"methodological variations across the studies make it problematic to draw coherent generalisations. These summaries illustrate the diversity in study characteristics including child samples, research designs, measurement, independent and dependent variables, and modes of analysis." (p. 7).
Simpson (2017) & Bergeron & Rivard (2017) give examples of how the same influence, depending on how you define the control and experimental groups, can give effect sizes ranging from 0 to infinity!
As a result, Bergeron & Rivard & Simpson call into question Hattie's entire use of effect size comparisons, e.g., Simpson (2017, p. 463),
"standardised effect size is a research tool for individual studies, not a policy tool for directing whole educational areas. These meta-meta-analyses which order areas on the basis of effect size are thus poor selection mechanisms for driving educational policy and should not be used for directing large portions of a country’s education budget."
Bakker et al. (2019) confirmed these problems (detailed below) with Hattie's work and conclude,
"...his lists of effect sizes ignore these points and are therefore misleading."
Also, Simpson (2017, p. 455) details other problems,
"the experimental condition in some studies and meta-analyses is the comparison condition in others."
You can listen to podcasts - Bergeron (2018) and Simpson (2018p).
Many other scholars also warn of this problem,
Wrigley (2018) also discusses the problem of control groups with regard to Hattie's work,
"should the control group experience the absence of the practice being trialled, or simply ‘business as usual’?
This ambiguity concerning the control group can seriously distort attempts to calculate an ‘effect size’.
We do not learn whether teachers and teaching assistants in the control group had any access to training comparable to that of the treatment group, whether they also taught small
classes, or what ‘business as usual’ actually involved" (p. 363).
"Sometimes Hattie uses ‘effect size’ to mean ‘as compared to a control group’ and at other times to mean ‘as compared to the same students before the study started" (p. 368).
Poulsen (2014, p. 3) identifies that Hattie often uses studies that do not have control groups,
"It does not appear if the many effects studies were in general investigations control groups. Control groups mentioned, but in what sense were they actually compatible with the trial groups? If not, much cannot be concluded about learning outcomes" (translated from Danish).
Nielsen and Klitmøller (2017, p. 4) concur,
"The meta-analyses... do not have uniform standards for, how they measure the effect. In many meta-analyses, studies involving the effect are not related to the use of control groups" (translated from Danish).
Also, Lervåg & Melby-Lervåg (2014),
"If you do not have a control group, the effect size will be calculated only on the basis of performance on the mapping before and after the action. The effect size will then be artificially high without this being a correct image. An example of this from Hattie's book is that vocabulary programs come out with a very high effect size."
Hattie's flagship study on Feedback, Kluger & DeNisi (1989), also warn generally of the problem of lack of control groups in educational studies,
"Without control groups, we may know more about the relative merits of several types of FI messages, but we have no idea if they are better, equal, or inferior to no intervention. This state of affairs is alarming." (p. 276)
Becker (2012) in his critique of Marzano (but relevant for Hattie) states,
"Marzano and his research team had a dependent variable problem. That is, there was no single, comparable measure of 'student achievement' (his stated outcome of interest) that they could use as a dependent variable across all participants. I should note that they were forced into this problem by choosing a lazy research design [a meta-analysis]. A tighter, more focused design could have alleviated this problem."
Hattie Combines Studies that have Totally Different Definitions of Influences, i.e., the Apples vs Oranges problem
This is a major problem with all of Hattie's work. Examples include -
Self Report - Hattie combines peer assessment with self report.
Feedback - Hattie combines background music on an assembly line with monetary rewards, with feedback to teachers and feedback to students.
Snook et al. (2009) was one of the earliest critiques of Hattie's Visible Learning (VL).
Hattie responded to some of their critiques, however, Snook et al. (2010) reply that they were surprised Hattie did not respond to what they consider to be the major problem with Visible Learning, i.e., the lack of consistency in defining variables & carefully defined concepts. They give this example,
'In education, however, the variables being studied are often poorly conceptualised and the studies often far from rigorous. How does one clearly distinguish for research purposes between a classroom that is “teacher centred” and one which is “student centred”' (p. 96 )
Later, Yelle et al. (2016) also summarise the problem with Hattie's combining of studies,
"In education, if a researcher distinguishes, for example, project-based teaching, co-operative work and teamwork, while other researchers do not distinguish or delimit them otherwise, comparing these results will be difficult. It will also be difficult to locate and rigorously filter the results that must be included (or not included) in the meta-analysis. Finally, it will be impossible to know what the averages would be.
It is therefore necessary to define theoretically the main concepts under study and to ensure that precise and unambiguous criteria for inclusion and exclusion are established. The same thing happens when you try to understand how the author chose the studies on e.g., problem-based learning. The word we find is general, because it compiles a large number of researches, dealing with different school subjects. It should be noted that Hattie notes variances between the different school subjects, which calls for even greater circumspection in the evaluation of the indicators attributed to the different approaches.
This is why it is crucial to know from which criteria Hattie chose and classified the meta-analyses retained and how they were constituted. How do the authors of the 800 meta-analyses compiled in Hattie (2009) define, for example, the different approaches by problem? In other words, what are the labels that they attach to the concepts they mobilize?
As for the concepts of desirability and efficiency from which these approaches must be located, they themselves are marked by epistemological and ideological issues. What do they mean? According to what types of knowledge is a method desirable? In what way is it effective? What does it achieve?
Hattie's book does not contain information on these important factors, or when it does, it does so too broadly. This vagueness prevents readers from judging for themselves the stability of so-called important variables, their variance or the criteria and methods of their selection. The lack of clarity in the criteria used for the selection of studies is therefore a problem."
Pant (2014, p. 85) is also critical of Hattie aggregating a wide variety of interventions under one label -
"which calls into question the theoretical relevance of the analysis."
A great example of this is in the studies on class size.
A comparison of the studies shows different definitions for small and normal classes, e.g. one study defines 23 as a small class but another study defines 23 as a normal class. So comparing the effect size is not comparing the same thing!
Schulmeister & Loviscach (2014),
"Even where he has grouped meta-analyses correctly by their independent variables such as instructional interventions, Hattie has in many cases mixed apples and oranges concerning the dependent variables. In some groupings, however, both the independent and the dependent variables do not match easily. For instance, in the group “feedback”, a meta-analysis using music to reinforce behavior is grouped with other studies using instructional interventions that are intended to elicit effects on cognitive processes."
"Many of the meta-analyses do not really match the same effect group (i.e., the influence) in which Hattie refers to them. For instance, in the group 'feedback', studies investigating the effect of student feedback on teachers are mixed with studies that examine the effect of teacher feedback on students."
Nielsen & Klitmøller (2017) discuss in detail the many problems of different definitions of feedback and large versus small class sizes. Blatchford (2016b) also raises this issue about Hattie,
"it is odd that so much weight is attached to studies that don't directly address the topic on which the conclusions are made" (p. 13). Hattie defense in Visible Learning (VL) was,
"A common criticism is that it combines 'apples with oranges' and such combining of many seemingly disparate studies is fraught with difficulties. It is the case, however, that in the study of fruit nothing else is sensible" (p. 10).
In his latest defense, Hattie & Hamilton (2020), "Real Gold Vs Fool's Gold", Hattie continues with this "fruit" response and once again does not address the significant issues of disparate studies,
"Any literature review involves making balanced judgements about diverse studies. A major reason for the development of meta- analysis was to find a more systematic way to join studies, in a similar way that apples and oranges can make fruit salad. Meta-analysis can be considered to ask about “fruit” and then assess the implications of combining apples and oranges, and the appropriate weighting of this combination. Unlike traditional reviews, meta-analyses provide systematic methods to evaluate the quality of combinations, allow for evaluation of various moderators, and provide excellent data for others to replicate or recombine the results. The key in all cases is the quality of the interpretation of the combined analyses. Further, as noted above, the individual studies can be evaluated for methodological quality." (p. 3-4). Finally, Hattie's has admitted that these differences, or heterogeneity of definitions, is a major problem. In his recent analysis, he has changed to Method 1 (The Random Model), Wisniewski, Zierer & Hattie (2020). "The Power of Feedback Revisited",
"...the significant heterogeneity in the data shows that feedback cannot be understood as a single consistent form of treatment." (p. 1) . The 3 common effect size calculations are, Cohen's d, Hedges' g and Glass’s Δ. The difference is in the SD they use.
Cohen uses the pooled SD, Hedges also uses the pooled SD but adjusts for sample size and Glass uses the control group SD.
Prof Gene Glass the inventor of the meta-analysis in this seminal paper, warned of SD problems, Integrating Findings: The Meta-Analysis of Research (1977).
Glass shows that since the effect size is calculated by dividing by the standard deviation (see formulas above) the standard deviation that is chosen can change the effect size in a significant way!
Glass gives this example (p. 370):
"The definition of ES appears uncomplicated, but heterogeneous group variances cause substantial difficulties. Suppose that experimental and control groups have means and standard deviations as follows:
The measure of experimental effect could be calculated either by use of Se or Sc or some combination of the two, such as an average or the square root of the average of their squares or whatever. The differences in effect sizes ensuing from such choices are huge:
The third basis of standardization—the average standard deviation—probably should be eliminated as merely a mindless statistical reaction to a perplexing choice. It must be acknowledged that both the remaining 1.00 and 0.20 are correct; neither can be ruled out as false... However, the control group mean is only one-fifth standard deviation below the mean of the experimental group when measured in control group standard deviations; thus, the average experimental group subject exceeds 58 percent of the subjects in the control group. These facts are neither contradictory nor inconsistent; rather they are two distinct features of a finding which cannot be captured by one number."
Note: A few years after Gene Glass wrote this Cohen (1988) added another method to calculate standard deviation - the 'pooled standard deviation' which averages the variances first then finds the standard deviation. This seems to be the accepted method now and using this would get d = 0.39.
As can be seen in this example the effect size can be either 0.20, 0.33, 0.39 or 1 for the same data!
If comparing effect sizes across studies, as Hattie does, then Gene Glass warns,
"If some attempt is not made to deal with this problem, a source of inexplicable and annoying variance will be left in a group of effect-size measures" (p. 372).
Hattie references this seminal paper from Glass in VL, but once again ignores the problem.
As a general rule, older studies used Glass’s Δ, while newer studies used Cohen's d or Hedges' g. Note the huge WWC use Hedges' g.
For example, in the studies Hattie used for feedback, there was a range of standard deviations used. Standley (1996, p. 109) used the Glass’s Δ while the other studies used Cohen's d.
"The effect sizes of experimental results in this analysis were estimated by contrasting the means of experimental/treatment conditions (Exp) and control/base-line conditions (Con) divided by the standard deviation (SD) of the control/baseline condition, as in the formula below:"
In Problem Based Learning, Gijbels et al. (2005) use Glass’s Δ.
Topphol (2011) also discusses a slight variation of this problem with Hattie's work.
"...in these two cases, the difference between the mean values is the same, D = 20 point. The distribution to the control group is drawn with a solid curve while dotted curve is used for the treatment group. Standard deviations are different, 5 to the left and 17 to the right. This gives different effect sizes, d = 4 and d = 1.18." (p. 464)
Sampling students from small or abnormal populations:
What Topphol displays above is a well-known issue for meta-analyses for a number of reasons: effect sizes are erroneously larger (due to a smaller standard deviation) and moderating variables are exacerbated.
Using such samples makes it invalid to generalise influences to the broader student population.
Professor Dylan Wiliam explains: https://youtu.be/6ajXJ6PbDcg
Simpson (2017) details this problem,
"Researchers can make legitimate design decisions which alter the standard deviation and thus report very different effect sizes for identical interventions. One such design decision is range restriction" (p. 456).
Simpson then insightfully explains that sampling from smaller populations is a major reason why effects for influences such as feedback, meta-cognition, etc are high while effects for whole school influences - class size, summer school, etc are low.
"One cannot compare standardised mean differences between sets of studies which tend to use restricted ranges of participants with researcher designed, tightly focussed measures and sets of studies which tend to use a wide range of participants and use standardised tests as measures" (p. 463).
Allerup (2015)also identifies this problem, if one distribution has very little spread, and, moreover, lies entirely within the second sharing outer boundaries then an effect size is almost impossible to calculate (p. 6).
Kraft (2019) and Bakker et al. (2019) confirm this problem with SD.
But, Hattie just ignores these issues and uses meta-analyses from abnormal student populations, e.g., ADHD, hyperactive, emotional/behavioural disturbed and English Second Language students.
Also, he uses abnormal subjects from NON-student populations, e.g., doctors, tradesmen, nurses, athletes, sports teams and military groups.
Professor John O'Neill's (2012b) letter to the NZ Education Minister regarding major issues with Hattie's research. One of the issues he emphasises is Hattie's use of students from abnormal populations.
Some examples from the research Hattie used is Standley (1996) that Hattie used in Feedback. Standley reported effect sizes up to 35.44 and noted that these were based on very small sample sizes (p. 109). Shannahan (2017, p. 751) provides a detailed example,
"What Hattie seems to have done is just take an average of the original effects reported in the various meta-analyses. That sometimes is all right, but it can create a lot of double counting and weighting problems that play havoc with the results.
For example, Hattie combined two meta-analyses of studies on repeated reading. He indicated that these meta-analyses together included 36 studies. I took a close look myself, and it appears that there were only 35 studies, not 36, but more importantly, four of these studies were double counted. Thus, we have two analyses of 31 studies, not 36, and the effects reported for repeated reading are based on counting four of the studies twice each!
Students who received this intervention outperform those who didn't by 25 percentiles, a sizeable difference in learning. However, because of the double counting, I can't be sure whether this is an over- or underestimate of the actual effects of repeated reading that were found in the studies. Of course, the more meta-analyses that are combined, and the more studies that are double and triple and quadruple counted, the bigger the problem becomes."
Shannahan (2017, p. 752) provides another detailed example,
"this is (also) evident with Hattie's combination of six vocabulary meta-analyses, each reporting positive learning outcomes from explicit vocabulary teaching. I couldn't find all of the original papers, so I couldn't thoroughly analyze the problems. However, my comparison of only two of the vocabulary meta-analyses revealed 18 studies that weren't there. Hattie claimed that one of the meta-analyses synthesized 33 studies, but it only included 15, and four of those 15 studies were also included in Stahl and Fairbanks's (1986) meta-analysis, whittling these 33 studies down to only 11. One wonders how many more double counts there were in the rest of the vocabulary meta-analyses.
This problem gets especially egregious when the meta-analyses themselves are counted twice! The National Reading Panel (National Institute of Child Health and Human Development, 2000) reviewed research on several topics, including phonics teaching and phonemic awareness training, finding that teaching phonics and phonemic awareness was beneficial to young readers and to older struggling readers who lacked these particular skills. Later, some of these National Reading Panel meta-analyses were republished, with minor updating, in refereed journals (e.g., Ehri et al., 2001; Ehri, Nunes, Stahl, & Willows, 2002). Hattie managed to count both the originals and the republications and lump them all together under the label Phonics Instruction—ignoring the important distinction between phonemic awareness (chldren's ability to hear and manipulate the sounds within words) and phonics (children's ability to use letter–sound relationships and spelling patterns to read words). That error both double counted 86 studies in the phonics section of Visible Learning and overestimated the amount of research on phonics instruction by more than 100 studies, because the phonemic awareness research is another kettle of fish. Those kinds of errors can only lead educators to believe that there is more evidence than there is and may result in misleading effect estimates."
Wecker et al (2017, p. 30) also detail examples,
"In the case of papers summarizing the results of several reviews on the same topic, the problem usually arises that a large part of the primary studies has been included in several of the reviews to be summarized (see Cooper and Koenka 2012 , p. 450 ff.). In the few meta-analyzes available so far, complete meta-analyzes of the first stage have often been ruled out because of overlaps in the primary studies involved (Lipsey and Wilson 1993 , 1197, Peterson 2001 , p.454), as early as overlaps of 25% (Wilson et al Lipsey 2001 , p. 416) or three or more primary studies (Sipe and Curlette 1997, P. 624).
Hattie, on the other hand, completely ignores the doubts problem despite sometimes significantly greater overlaps.
For example, on the subject of web-based learning, 14 of the 15 primary studies from the meta-analysis by Olson and Wisher ( 2002 , p. 11), whose mean effect size of 0.24 is significantly different from the results of the other two meta-analyzes on the same topic (0.14 or 0.15), already covered by one of the two other meta-analyzes (Sitzmann et al., 2006 , pp. 654 ff.)"
Kelley & Camilli (2007, p. 25) Many studies use the same data sets. To maintain the statistical independence of the data, only one set of data points from each data set should be included in the meta-analysis.
Hacke (2010, p. 83),
"Independence is the statistical assumption that groups, samples, or other studies in the meta-analyses are unaffected by each other."
This is a major problem in Hattie's synthesis as many of the meta-analyses that Hattie averages use the same data-sets - e.g., much of the same data is used in Teacher Training as is used in Teacher Subject Knowledge. Hattie's averaging hides much of the complexity, for example, Snook et al.(2009), on Homework:
"There is also the difficulty which arises amalgamating a large number of disparate studies. When results of many studies are averaged, the complexity of education is ignored: variables such as age, ability, gender, and subject studied are set aside. An example of this problem can be seen in Hattie’s treatment of homework: does homework improve learning or not?
Overall, Hattie finds that the effect size of homework is 0.29. Thus a media commentator, reading a summary might justifiably report: “Hattie finds that homework does not make a difference.” When, however, we turn to the section on homework we find that, for example, the effect sizes for elementary (primary in our terms) and high schools students are 0.15 and 0.64 respectively.
Putting it crudely, the figures suggest that homework is very important for high school students but relatively unimportant for primary school students.
There were also significant differences in the effects of homework in mathematics (high effects) and science and social studies (both low effects). Results were high for low ability students and low for high ability students. The nature of the homework set was also influential. (pp 234-236). All these complexities are lost in an average effect size of 0.29" (p. 4).
Schulmeister & Loviscach (2014),
'The effect size given per influence is the mean value of a very broad distribution. For instance, in “Inductive Teaching” Hattie combines two meta-analyses with effect sizes of d = 0.06 and d = 0.59 to a mean effect size of d = 0.33 with a standard error of 0.035. This is like saying ”this six-sided dice does not produce numbers from 1 to 6; rather, it produces the number 3.5 in the mean, and we are pretty sure about the first decimal place of this mean value.”'
Dr. Jim Thornton (2018) Professor of Obstetrics and Gynaecology at Nottingham University said,
"To a medical researcher, it seems bonkers that Hattie combines all studies of the same intervention into a single effect size... In medicine it would be like combining trials of steroids to treat rheumatoid arthritis, effective, with trials of steroids to treat pneumonia, harmful, and concluding that steroids have no effect! I keep expecting someone to tell me I’ve misread Hattie."
Another example from Nilholm (2013) It's time to critically review John Hattie on Inductive Teaching,
"Hattie reports two meta-analyzes. One is from 2008 and includes 73 studies related to 'inductive teaching', it shows that the work method generally gives a relatively strong effect. According to a meta-analysis from 1983, which includes 24 studies of inductive teaching in natural sciences, the work method gives a weak effect.
Hattie simply takes the mean of these two meta-analyzes and thus "inductive teaching" can be dismissed. A more reasonable conclusion would be that "inductive teaching" in science subjects has weak support but that generally it seems to be a good way of' working. Alternatively, it did not appear to work before, but later research gives a much more positive picture" (p. 2).
Nilholm (2013) details another example using "problem-based learning".
This problem is widespread in Hattie's work other examples include class size, feedback, ability grouping. Also, many of Hattie's researchers warn about averaging:
Mabe and West (1982),
"considerable information would be lost by averaging the often widely discrepant correlations within studies" (p. 291).
Wrigley (2018),
"What now stands proxy for a breadth of evidence is statistical averaging. This mathematical abstraction neglects the contribution of the practitioner’s accumulated experience, a sense of the students’ needs and wishes, and an understanding of social and cultural context" (p. 359).
Wrigley (2018) then goes into detail about inappropriate averaging by Hattie and the EEF,
"... quite dissimilar studies are thrown together and an aggregate mean of effect sizes calculated. Although some tolerance is acceptable in meta-analysis, since no two research studies are exactly alike, serious problems can arise from aggregating and averaging studies using different definitions of an issue, and based on different curriculum areas, ages and attainment levels of students, types of school, education systems, and so on...
Indeed, Gene Glass, who originated the idea of meta-analysis, issued this sharp warning about heterogeneity: 'Our biggest challenge is to tame the wild variation in our findings not by decreeing this or that set of standard protocols but by describing and accounting for the variability in our findings. The result of a meta-analysis should never be an average; it should be a graph.'(Robinson, 2004: 29, my italics)" (p. 367).
Wrigley (2018) then quotes Coe,
"One final caveat should be made here about the danger of combining in-commensurable results. Given two (or more) numbers, one can always calculate an average. However, if they are effect sizes from experiments that differ significantly in terms of the outcome measures used, then the result may be totally meaningless...
In comparing (or combining) effect sizes, one should therefore consider carefully whether they relate to the same outcomes... One should also consider whether those outcome measures are derived from the same (or sufficiently similar) instruments and the same (or sufficiently similar) populations... It is also important to compare only like with like in terms of the treatments used to create the differences being measured. In the education literature, the same name is often given to interventions that are actually very different. It could also be that... the actual implementation differed, or that the same treatment may have had different levels of intensity in different studies. In any of these cases, it makes no sense to average out their effects. (Coe, 2002, my italics)" (p. 367). Prof Gene Glass (1977), the inventor of the meta-analysis, who Hattie quotes regularly, warned of this problem in his seminal paper, Integrating Findings: The Meta-Analysis of Research.
"Precisely what weight to assign to each study in an aggregation is an extremely complex question, one that is not answered adequately by suggestions to pool the raw data (which are rarely available) or to give each study equal weight, regardless of sample size. If one is aggregating arithmetic means, a weighting of results from each study according to SRT(N) might make sense" (p. 358).
Fixed Methods scholars recommend weighting (Pigott, 2010, p. 9). Larger studies are then weighted greater. If this were done this would affect all the reported effect sizes of Hattie and his rankings would totally change.
The range of students numbers in studies that Hattie used is enormous. In the influence 'Comprehensive teaching reforms' Hattie cites Borman & D'Agostino (1996) using nearly 42 Million Students! While in the 'gender - attitudes' influence Hattie cites Cooper, Burger & Good (1980) with 219 students. These have equal weight in Hattie's work.
Shannahan (2017, p. 752) gives more detailed examples,
"when meta-analyses of very different scopes are combined - what if one of the meta-analyses being averaged has many more studies than the others? Simply averaging the results of a meta-analysis based on 1,077 studies with a meta-analysis based on six studies would be very misleading. Hattie combined data from 17 meta-analyses of studies that looked at the effects of students’ prior knowledge or prior achievement levels on later learning. Two of these meta-analyses focused on more than a thousand studies each; others focused on fewer than 50 studies, and one as few as six. Hattie treated them all as equal. Again, potentially misleading."
Pant (2014, p. 95) verifies Shannahan's analysis and provides another detailed example:
"Hattie (2009) aggregates the mean effect sizes of the original meta-analyzes without weighting them by the number of studies received. Meta-analyzes, which are based on many hundreds of individual studies, enter the d- barometer with the same weight as meta-analyzes with only five primary studies. The consequences of this approach for the content conclusions will be briefly demonstrated by a numerical example from Hattie's (2009) data. The determined from four meta effect of the teaching method of direct instruction (Direct Instruction) is to Hattie (2009 , p 205;) d = 0.59 and thus falls into the 'desired zone' ( d > 0.4).
Direct instruction is by no means undisputed, highly structured, and teacher-centered teaching. Looking at the processed meta-analyzes one by one, it is striking that the analysis by far the largest in 232 primary studies (Borman et al., 2003 ) is the one with the least effect size (i.e. = 0.21). If the three meta-analyzes for which information on the standard error were presented were weighted according to their primary number of studies (Hill et al. 2007, Shadish and Haddock 2009), the resulting effect size would be d = 0.39 and thus no longer in the 'desired' zone of action defined by Hattie."
Wecker et al. (2017, p. 31) give an example of using weighted averages:
"This would mean a descent from 26th place to 98th in his ranking."
Professor Peter Blatchford (2016) also warns of this problem,
"unfortunately many reviews and meta-analyses have given them equal weighting" (p. 15).
See (2017) emphasises the issue of quality of evidence & averaging by Hattie, Marzano, and others,
"there are studies which involved only one participant, some had no comparator groups and some involved children with specific learning difficulties or had huge attrition as large as 70%. These may form the majority of studies reporting huge positive effects. On the other hand, the few good quality studies may report small effects.
Averaging effect sizes from across studies of different quality giving equal weights to all can lead to misleading conclusions" (p. 10).
Arnold (2011),
"I was surprised that Hattie has chosen to summarise the effect sizes of the 800 meta-analyses using unweighted averages. Small and large meta-analyses have equal weight, while I would assume that the number of studies on which a meta-analysis is based indicates its validity and importance. Instead I would have opted for weighted averaging by number of studies, students or effect sizes. At a minimum, it would be interesting to see whether the results are robust to the choice of averaging."
Proulx (2017) and Thibault (2017) also question Hattie's averaging.
Example - Visual Perception Programs-
Hattie's effect size is d = 0.55. But if we weight according to the number of students (with the assumption studies reporting no students are assigned the lowest number of students, 4,400 (highlighted yellow). We get a weighted effect size d = 0.79 shooting this up from #35 to #7.
Nielsen & Klitmøller (2017) also show this problem in their detailed analysis of Hattie's use of feedback studies- see feedback.
In his latest 2020 defense, Real Gold Vs Fool's Gold, Hattie does not address and simply dismisses the detailed issues presented by all the peer reviews above (p. 2),
Problem 9. Confounding Variables:
Related to problem 1 - the research designers usually put a lot of thought into the controlling of other variables. Random assignments and double blindness are the major strategies used. Unfortunately, most of the studies Hattie cites, do not use these strategies. This introduces major moderating variables into the study. Class size is a good example, many studies compare the achievement of small versus large classes in schools, but many schools assign lower achieving students to smaller classes, they do not use random assignment.
Thibault (2017) gives other examples (English translation),
"a goal of the mega-analyzes is to relativize the factors of variation that have not been identified in a study, balancing in some so the extreme data influenced by uncontrolled variables. But by combining all the data as well as the particular context that is associated with each study, we eliminate the specificities of each context, which for many give meaning to the study itself! We then lose the richness of the data and the meaning of what we try to measure.
It even happens that brings together results that are deeply different, even contradictory in their nature.
For example, the source of the feedback remains risky, as explained by Proulx (2017), given that Hattie (2009) claims to have realized that the feedback comes from the student and not from the teacher, but it is no less certain that his analysis focused on feedback from the teacher. It is right to question this way of doing things since the studies quantitatively seek to control variables to isolate the effect of each. When combining data from different studies, the attempt to control the variables is annihilated. Indeed, all these studies have not necessarily sought to control the same variables in the same way, they have probably used instruments different and carried out with populations difficult to compare. So these combinations are not just uninformative, but they significantly skew the meaning."
Nielsen & Klitmøller (2017) discuss the problem of Hattie not addressing moderating factors, the interaction of factors and the disparate operational definitions of different studies,
"it is our assessment that in four of the five "heaviest" surveys that mentioned in connection with Hattie's cover of Feedback, it is conceptually unclear whether they are operates with a feedback term that is identical with Hattie's" (p. 11, translated from Danish).
Blichfeldt (2011),
"to validly put more blurred variables into accurate calculations seems problematic...
...he allows a very low degree of precision as to what variables are included in the calculations as to what may be expected and how results can be understood. At the same time, he uses calculations and statistics that should require precision and control that it is hard to find coverage for. Which does not prevent him from producing results as very precise with two decimal places...
What he studies is summarized statistical relationships between unclear variables and skill tests."
Nilholm (2013) confirms this problem,
"Hattie's major failure is to report summative measurements of meta-analysis without taking into account so-called moderating factors. Working methods can work better for a particular subject, a certain grade, some students and so on. Hattie believes that the significance of such moderating factors is less than one can think. I would argue that they are often very noticeable, as in the examples I reported [see problem-based learning and inductive teaching] Unless such moderating factors are taken into account, direct generalizations will be made directly" (p. 3).
Allerup (2015) in 'Hattie's use of effect size as the ranking of educational efforts', calls for a more sophisticated multivariate analysis,
"it is well known that analyses in the educational world often require the involvement of more dimensional (multivariate) analyses" (p. 8).
Hattie rarely acknowledges this problem now, but in earlier work, Hattie & Clinton (2008, p. 320) they stated:
"student test scores depend on multiple factors, many of which are out of the control of the teacher."
Another pertinent example is from Kulik and Kulik (1992) - see ability grouping:
Two different methods produced distinctly different results. Each of the 11 studies with same-age control groups showed greater achievement average effect size in these studies was 0.87.
However, if you use the (usually 1 year older) students as the control group, The average effect size in the 12 studies was 0.02. Hattie uses this figure in the category 'ability grouping for gifted students'.
Hattie does not include the d = 0.87. I think a strong argument can be made that the result d = 0.87 should be reported instead of the d = 0.02 as the accelerated students should be compared to the student group they came from (same age students) rather than the older group they are accelerating into.
The Combination of Influences:
In addition, a study may be measuring the combination of many influences. For example, using class size, how do you remove other influences from the study? For example, time on task, motivation, behaviour, teacher subject knowledge, feedback, home life, welfare, etc.
Nielsen & Klitmøller (2017) discuss this problem in detail.
But, Hattie wavers on this major issue. In his commentary on 'within-class grouping' about Lou et al. (1996, p. 94) Hattie does report some degree of additivity,
"this analysis shows that the effect of grouping depends on class size. In large classes (more than 35 students) the mean effect of grouping is d = 0.35, whereas in small classes (less than 26 students) the mean effect is d = 0.22."
But in his summary, he states,
"It is unlikely that many of the effects reported in this book are additive" (p. 256).
Problem 10. Quality of Studies:
"Extraordinary claims require extraordinary evidence." Carl Sagan
Hattie's constant proclamation (VL 2012 summary, p. 3),
"it is the interpretations that are critical, rather than data itself."
Is opposite to the Scientific Method paradigm as Snook et al. (2009, p. 2) explain:
"Hattie says that he is not concerned with the quality of the research... of course, quality is everything. Any meta-analysis that does not exclude poor or inadequate studies is misleading, and potentially damaging if it leads to ill-advised policy developments. He also needs to be sure that restricting his data base to meta-analyses did not lead to the omission of significant studies of the variables he is interested in."
Professor John O'Neill (2012a) writes a significant letter to the NZ Education Minister & Hattie regarding the poor quality of Hattie's research, in particular, the overuse of studies about University, graduate or preschool students and the danger of making classroom policy decision without consulting other forms of evidence, e.g., case and naturalistic studies.
"The method of the synthesis and, consequently, the rank ordering are highly problematic" (p. 7).
Hattie ignored O'Neill's critique and constantly proclaims,
"Almost all of this data is based on what happens in real schools with real kids..."
See (2017), emphasises the lack of quality in the evidence by Hattie,
"there are several problems with relying on such evidence taken from meta-analyses of meta-analyses for policy and practice.
First, much of it is not particularly robust (small-scale, involving non-randomisation of participants, based on summaries of effects across a wide range of subjects and age groups).
Second, no consideration was taken of the quality of research in the synthesis of existing evidence. For example, there are studies which involved only one participant, some had no comparative groups and some involved children with specific learning difficulties or had huge attrition as large as 70%. These may form the majority of studies reporting huge positive effects. On the other hand, the few good quality studies may report small effects. Averaging effect sizes from across studies of different quality giving equal weights to all can lead to misleading conclusions" (p. 10).
Schulmeister & Loviscach (2014),
"Many of the meta-analyses used by Hattie are dubious in terms of methodology. Hattie obviously did not look into the individual empirical studies that form the bases of the meta-analyses, but used the latter in good faith."
Nielsen & Klitmøller (2017) also discuss the problems of quality using examples from VL, p. 75 and 196 - see feedback.
"Hattie does not deal with the potential problems in his own investigation but instead refers to others who have to deal with problems in connection with meta-analyses generally. In other words, Hattie is not directly concerned about the quality of his own investigation.
In some selected contexts nevertheless, Hattie does throw out studies based on quality, but this neither consistent nor systematic" (p. 10 translated from Danish).
Nielsen & Klitmøller's criticism is based on Hattie sometimes using the following protocols to justify exclusion of meta-analyses,
"mainly based on doctoral dissertations, ..., with mostly attitudinal outcomes, many were based on adult samples ... and some of the sample sizes were tiny" (VL, p. 196).
Lind (2013) confirms this and also uses more examples from VL, pp. 196 ff. Where, he accused Hattie of disregarding studies that do not suit him, e.g. kinesthetic learning.
The Encyclopedia of Measurement and Statistics outlines the problem of quality:
"many experts agree that a useful research synthesis should be based on findings from high-quality studies with methodological rigour. Relaxed inclusion standards for studies in a meta-analysis may lead to a problem that Hans J. Eysenck in 1978 labelled as garbage in, garbage out."
Or in modern terms, Dr. Gary Smith (2014, p. 25),
"garbage in, gospel out."
Most researchers that Hattie used warn about the quality of studies, e.g., Slavin (1990, p. 477),
"any measure of central tendency in a meta-analysis... should be interpreted in light of the quality and consistency of the studies from which it was derived, not as a finding in its own right.
'best evidence synthesis' of any education policy should encourage decision makers to favour results from studies with high internal and external validity—that is, randomised field trials involving large numbers of students, schools, and districts."
Janson (2018),
"Hattie compiles large numbers of meta-analyses of all kinds for his meta-meta-analyses, without paying too much attention to the meaning or quality of the original studies."
The U.S. Department of Education has set up the National Center for Education Research whose focus is to investigate the quality of educational research. Their results are published in the What Works Clearing House. They also publish a Teacher Practice Guide which differs markedly from Hattie's results - see Other Researchers.
Importantly they focus on the QUALITY of the research and reserve their highest ratings for research that use randomised division of students into a control and an experimental group. Where students are non-randomly divided into a control and experimental group for what they term a quasi-experiment, a moderate rating is used. However, the two groups must have some sort of equivalence measure before the intervention. A low rating is used for other research design methods - e.g., correlation studies.
However, once again, Hattie ignores these issues and makes an astonishing caveat, there is,
"no reason to throw out studies automatically because of lower quality" (p. 11).
Problem 11. Time over which each study ran:
Given Hattie interprets an effect size of 0.40 as equivalent to 1 year of schooling, and his polemic related to this figure:
"I would go further and claim that those students who do not achieve at least a 0.40 improvement in a year are going backwards..." (p. 250).
In terms of teacher performance, he takes this one step further by declaring teachers who don't attain up to an effect size of 0.40 are 'below average'. Hattie (2010, p. 87).
This means, as Professor Dylan Wiliam points out, that studies need to be controlled for the time over which they run, otherwise legitimate comparisons cannot be made.
Professor Wiliam, who also produced the seminal research, 'Inside the black box', also reflects on his own research and cautions,
"it is only within the last few years that I have become aware of just how many problems there are. Many published studies on feedback, for example, are conducted by psychology professors, on their own students, in experimental sessions that last a single day. The generalizability of such studies to school classrooms is highly questionable.
In retrospect, therefore, it may well have been a mistake to use effect sizes in our booklet 'Inside the black box' to indicate the sorts of impact that formative assessment might have.
I do still think that effect sizes are useful... If the effect sizes are based on experiments of similar duration, on similar populations, using outcome measures that are similar in their sensitivity to the effects of teaching, then I think comparisons are reasonable. Otherwise, I think effect sizes are extremely difficult to interpret."
Hattie (2015) finally admitted this was an issue:
"Yes, the time over which any intervention is conducted can matter (we find that calculations over less than 10-12 weeks can be unstable, the time is too short to engender change, and you end up doing too much assessment relative to teaching). These are critical moderators of the overall effect-sizes and any use of hinge=.4 should, of course, take these into account."
Yet this has not affected his public pronouncements nor additions or reductions of studies to his database. He has not made any adjustment to his section on feedback, whereas Professor Wiliam states many of the studies are on university students over 1 DAY. Hattie does not appear to take TIME into account!
The section A YEARS PROGRESS? goes into more detail about this issue.
These issues have been known for a long time and many researchers, e.g., Berk (2011), recommend a focus on high quality INDIVIDUAL studies (as does the What Works Clearing House),
"One should applaud the view that public policy is to be based on evidence. However, what qualifies as evidence, let alone strong evidence, is too often left unspecified. Into this vacuum has been drawn a mix of evaluations ranging from excellent to terrible.
...the importance of meta-analysis for estimating causal effects has been grossly overrated. A conventional literature review will often do better. At the very least, readers will not be swayed by statistical malpractice disguised as statistical razzle-dazzle" (p. 199).
Problem 12. Researcher Bias:
Wolf et al. (2020) - conclude that effect sizes conducted by a program's developers are 80% larger than those done by independent evaluators (0.31 vs 0.14) with ~66% of the difference attributable to publication bias.
Problem 13. The assumption of Normality:
Allerup (2015) in 'Hattie's use of effect size as the ranking of educational efforts', shows that deviations to this assumption in the form of skewed or Cauchy distributions, which have wider tails than normal distributions, give very different effect size measures and therefore it becomes difficult for appropriate interpretations of effect size (p. 10).
Allerup gives the examples that International evaluations under The OECD (PISA) and The IEA (TIMSS) are not normally distributed (p. 7).
Problem 14. meta-analysis vs META-meta-analysis
These issues have been known for a long time and many researchers, e.g., Berk (2011), recommend a focus on high quality INDIVIDUAL studies (as does the What Works Clearing House),
"One should applaud the view that public policy is to be based on evidence. However, what qualifies as evidence, let alone strong evidence, is too often left unspecified. Into this vacuum has been drawn a mix of evaluations ranging from excellent to terrible.
...the importance of meta-analysis for estimating causal effects has been grossly overrated. A conventional literature review will often do better. At the very least, readers will not be swayed by statistical malpractice disguised as statistical razzle-dazzle" (p. 199).
Finally, Hattie's has admitted that there are significant problems with his Synthesis of Meta-analyses or META-meta-analysis method, Wisniewski, Zierer & Hattie (2020) "The Power of Feedback Revisited",
"The question arises, whether synthesizing research on feedback on different levels, from different perspectives and in different directions and compressing this research in a single effect size value leads to interpretable results. In contrast to a synthesis approach, the meta-analysis of primary studies allows to weigh study effects, consider the issues of systematic variation of effect sizes, remove duplets, and search for moderator variables based on study characteristics.
Therefore, a meta-analysis is likely to produce more precise results." (p. 2-3)
To understand this monumental change it is REALLY important to understand the subtle difference between Hattie's and the EEF's approach of a META-meta-analysis versus a simpler meta-analysis approach.
This all leads to significant criticism of VL:
Snook et al. (2009):
"Any meta-analysis that does not exclude poor or inadequate studies is misleading and potentially damaging" (p. 2).
Terhart (2011):
"It is striking that Hattie does not supply the reader with exact information on the issue of the quality standards he uses when he has to decide whether a certain research study meta-analysis is integrated into his meta-meta-analysis or not. Usually, the authors of meta-analyses devote much energy and effort to discussing this problem because the value or persuasiveness of the results obtained are dependent on the strictness of the eligibility criteria" (p. 429).
Rømer (2016)
"...I have demonstrated that there are problems with the dependent variable, the learning yield, i.e., the effect of the intervention. It is weakly understood and there is an unpredictable contradiction between the theory of learning theory and the theory of education theory" (p. 15, translated from Danish).
David Didau gives an excellent overview of Hattie's effect sizes, cleverly using the classic clip from the movie Spinal Tap, where Nigel tries to explain why his guitar amp goes up to 11.
https://learningspy.co.uk/featured/unit-education/
Hooley (2013), in his review of Hattie - talks about the complexity of classrooms and the difficulty of controlling variables,
"Under these circumstances, the measure of effect size is highly dubious" (p. 44).
Neil Brown: https://academiccomputing.wordpress.com/2013/08/05/book-review-visible-learning/
"My criticisms in the rest of the review relate to inappropriate averaging and comparison of effect sizes across quite different studies and interventions."
The USA Government Funded Study on Educational Effect Size Bench Marks - https://ies.ed.gov/ncser/pubs/20133000/
"The usefulness of these empirical benchmarks depends on the degree to which they are drawn from high-quality studies and the degree to which they summarise effect sizes with regard to similar types of interventions, target populations, and outcome measures."
and also defined the criterion for accepting a research study, i.e., the quality needed (p. 33):
Search for published and unpublished research dated 1995 or later.
Specialised groups such as special education students, etc. were not included.
studies were restricted to those using random assignment designs (that is method 1) with practice-as-usual control groups and attrition rates no higher than 20%.
NOTE: using these criteria virtually NONE of the 800+ meta-analyses in VL would pass the quality test!
The U.S. Department of Education standards:
The intervention must be systematically manipulated by the researcher, not passively observed.
The dependent variable must be measured repeatedly over a series of assessment points and demonstrate high reliability.
Method 1 (random allocation) is the gold standard.
Method 2 is accepted but with a number of caveats. They use the phrase quasi-experimental design, which compares outcomes for students, classrooms, or schools who had access to the intervention with those who did not but were similar in observable characteristics. In this design, the study MUST demonstrate baseline equivalence.
In other words, the students can be broken into a control and experimental group (without randomization), but the two groups must display equivalence at the beginning of the study.
However, the rating of these types of studies is 'Meets WWC Group Design Standards with Reservations.'
So at BEST most of the studies used by Hattie would be classified by The U.S. Department of Education as 'Meets WWC Group Design Standards with Reservations.'
But Hattie uses Millions of students!
A large number of students used in the synthesis seems to excuse Hattie's from the usual validity and reliability requirements. For example, Kuncel (2005) has over 56,000 students and reports the highest effect size of d=3.10 but it does not measure what Hattie's says - a self-report grade in the future; but rather, student honesty with regard to their GPA a year ago. So this meta-analysis is not a valid or reliable study of the influence of self-report grades. The 56,000 students are totally irrelevant.
Note, many of the controversial influences have only 1 or 2 meta-analyses as evidence.
Bergeron & Rivard (2017) - on Hattie's huge numbers:
"We cannot allow ourselves to simply be impressed by the quantity of numbers and the sample sizes; we must be concerned with the quality of the study plan and the validity of collected data."
Larsen (2014),
"the megalomaniac additive annexation of all sorts of meta-analyses is not concerned with methodologically critical self-reflections, nor with validity claims, i.e., it does not specify the limits to what can be said and made commensurable. The risk is that knowledge in the collected empirical data piles disappears when it is formalised in a second-, third-, and-fourth-order perspective" (p. 6).
Sjøberg (2012) also argues that Hattie uses a rhetorical strategy of an overwhelming number of meta-analyses instead of supporting a hypothesis to heighten the effects of the meta-analyses public impact.
Wrigley (2015) in Bullying by Numbers, gives a detailed analysis of this problem.
David Weston gives a good summary of issues with Effect Sizes:
2min - contradictory results of studies are lost by averaging
4min 30sec - Reports of studies are too simplified and detail lost
5min - What does effect size mean?
6min 15 sec - Hattie's use of effect size
7min - Issues with effect size
8min 40sec - problems with spread of scores (standard deviation)
9min 30sec - need to check details of Hattie's studies
10min 30sec - problem with Hattie's hinge point d=0.40 (see A Year's Progress)
16min 50secs - Prof Dylan Wiliam's seminal work - 'Inside the Black Box', is an example of research that has been oversimplified by Educationalists - e.g., 'writing objectives on the board' but other more important findings have been lost.
18min - Context is king
Professor Robert Coe's detailed paper on Effect size Calculations here.
David Weston uses a great analogy of a chef with teaching (5min onwards).
John Oliver gives a funny overview of the problems with Scientific Studies:
Another overview the issues with published studies-
A short video on the issues with Social Science Research.
2. Class size:
Hattie's claims were based on averaging (d) the following three meta-analyses: Gene Glass & Mary Lee Smith 1979, McGiverin et al. 1999, Goldstein, Yang, Omar, & Thompson 2000. Ironically, Gene Glass also invented the meta-analysis methodology, defined most of the protocols, and in contrast to Hattie stated,
"The result of a meta-analysis should never be an average; it should be a graph." (Robinson, 2004, p. 29).
Prof Peter Blatchford, the lead author of the most comprehensive review of class size, Class Size Eastern and Western perspectives (2016) challenges Hattie's promotion that class size is a "disaster",
"One reason for the prevalence of the unimportant view are several highly influential reports which have set in motion a set of messages that have generated a life of their own, separate from the research evidence, and have led to a set of taken for granted assumptions about class size effects.
Given the important influence these reports seem to be having in government and regional education policies, they need to be carefully scrutinised in order to be sure about the claims that are made" (p. 93).
Blatchford names Hattie's interpretation & summary of studies as the major source of the evidence provided by these reports.
Interestingly, later in the same book, the weight of evidence from the 16 other academics forces Hattie to concede,
"The evidence is reasonably convincing - reducing class size does enhance student achievement" (p. 113).
Yet, around the same time, in the TV series Revolution School (part 3, 1min 20sec) Hattie definitively states,
"Reducing class size... does not make a difference to the quality of education!"
Worse, in Hattie's many public presentations from 2005-2015, he promoted the view that 'reducing class size' is a "disaster" and a "distraction" - e.g. from Hattie's 2005 ACER lecture:
Worse, Hattie and his commercial partner Corwin publish (Feb 2020): This advertisement seems to contradict Hattie's aim,
"One aim of this book is to develop an explanatory story about the key influences on student learning - it is certainly not to build another ‘what works’ recipe." (VL, p. 6).
Blatchford et al. (2016), also talk about the lack of quality evidence,
"there are in fact relatively few high-quality dedicated studies of class size and this is odd and unfortunate given the public profile of the class size debate and the need for firm evidence based on purposefully designed research fit for purpose" (p. 275).
Then, when pressed in an interview with Ollie Lovell, Hattie says (audio here),
"It's about the story, not the numbers!"
Most teachers would agree with Eddie Woo, the teacher who was named Australian of the Year, who said (video here - start at 41min):
"Don't tell me that class size does not make a difference!" 1. Gene Glass and Mary Lee Smith (1979) investigate a range of class sizes reductions and summarise their results on page 11.
Then on page 15-
Hattie is often ambiguous when he reports this study. In VL (p. 87) he says he got his average from the class size reduction of 25 down to 15. But the study does not report 0.09 for class size of 25 down to 15.
Then, in the Blatchford (2016), Hattie states,
"Glass and Smith (1979) reported an average effect of 0.09 based on 77 studies..." (p. 106).
It appears Hattie got that result from this statement by Glass & Smith (1979), however, in the same paragraph they take issue with Hattie's approach,
"over all comparisons available-regardless of the class sizes compared- the results favored the smaller class by about a tenth of a standard deviation in achievement. This finding is not too interesting, however, since it disregards the sizes of the classes being compared." (p. 10).
I also contacted Prof Glass to ensure I interpreted his study correctly, he kindly replied,
"Averaging class size reduction effects over a range of reductions makes no sense to me.
It's the curve that counts.
Reductions from 40 to 30 bring about negligible achievement effects. From 20 to 10 is a different story.
But Teacher Workload and its relationship to class size is what counts in my book."
Glass & Smith (1979) also conclude,
"The curve for the well-controlled studies then, is probably the best representation of the class-size and achievement relationship...
A clear and strong relationship between class size and achievement has emerged... There is little doubt, that other things being equal, more is learned in smaller classes." (p. 15).
They also detail,
"The class size and achievement relationship seems consistently stronger in the secondary grades than in the elementary grades." (p. 13).
Barwe and Dahlström (2013) also pick-up on this issue and agree Hattie misrepresents this study by reporting one effect size (p. 16).
If you look at this meta-analysis in more detail a totally different STORY emerges, which is not represented by using this one average (Hattie only uses the one incorrect average).
Bergeron (2017) reiterates,
"Hattie computes averages that do not make any sense."
Thibault (2017) Is John Hattie's Visible Learning so visible? also questions Hattie's method of using one average to represent a range of studies (translation to English),
"We are entitled to wonder about the representativeness of such results: by wanting to measure an overall effect for subgroups with various characteristics, this effect does not faithfully represent any of the subgroups that it encompasses!
... by combining all the data as well as the particular context that is associated with each study, we eliminate the specificity of each context, which for many give meaning to the study itself!"
In his 2020 defense, Real Gold Vs Fool's Gold, Hattie responded to my claim that he reports a different conclusion to that of the actual authors of the studies (p. 12),
Yet, it is ironic that the author of the class size study, Professor Gene Glass, who also invented the meta-analysis methodology, wrote a book with 20 other distinguished academics contradicting Hattie, '50 Myths and Lies That Threaten America's Public Schools: The Real Crisis in Education'.
In Myth #17: Class size doesn't matter; reducing class sizes will not result in more learning, the 21 scholars collaboratively say,
"Fiscal conservatives contend, in the face of overwhelming evidence to the contrary, that students learn as well in large classes as in small... So for which students are large classes okay? Only the children of the poor?"
2. McGiverin et al. (1989).
Their study was only on 2nd-year students with properly controlled studies using experimental and control groups (although not randomly assigned). They decided a more pragmatic definition of a large class size is about 26 and a small class size is about 19 (p. 49).
So this study is specifically comparing a class size of about 26 students with 19 students.
So this is a totally different study than the Glass study above, which was looking at a range of different class size comparisons.
Barwe and Dahlström (2013) also point out the difference between this and the Glass & Smith study. Also, they point out Hattie claimed his focus was on reducing 25 students down to 15, which is different to reducing 26 students down to 19 (p. 19).
McGiverin et al., state that, the lack of experimental control and diverse definitions of large and small are among the reasons cited for inconsistent findings regarding class size (p. 49).
They introduce a caveat by quoting Berger (1981, p. 49).
"Focusing on class size alone is like trying to determine the optimal amount of butter in a recipe without knowing the nature of the other ingredients."
Whilst they get a reasonably high d = 0.34 they advise caution in the interpretation of this result (p. 54). Also, they make special mention of the confounding variables - the Hawthorne effect, novelty, and self- fulfilling prophecy.
Note: Hattie's best quality study on Feedback, Kluger & DeNisi (1996) get an effect size = 0.38. Goldstein et al. (2000) state their aim:
"The present paper focuses more on the methodology of meta-analyses than on the substantive issues of class size per se."
For a more detailed discussion on class size, they recommend looking at their previous papers (p. 400).
Summary of results from page 401:
Hattie once again reports in the Blatchford book (p. 106),
"Goldstein, Yang, Omar, and Thompson (2000) 0.20 based on nine studies..."
But Hattie does not detail how he got this value of 0.20 from the above table. It seems that Hattie just averaged the table.
The authors DO NOT report an average of 0.20.
A comparison of the studies shows different definitions for small and normal classes, e.g. study 2 defines 23 as a small class whereas in study 9 it is a normal class.
Nielsen & Klitmøller (2017) in Blind spots in Visible Learning - Critical comments on the "Hattie revolution", discuss the disparate definitions of large and small classes in different studies (p. 7).
So comparing the effect size from different studies is not comparing the same thing!
The authors comment on another problem we have seen throughout VL (p. 403),
"we have the additional problem that different achievement tests were used in each study and this will generally introduce further, unknown, variation."
From a later paper, Goldstein et al. (2003, p. 20),
The 4 scholars contradict Hattie's claim that age makes no difference,
"Our results show how vital it is to take account of the age of the child when considering class size effects."
They give more detail about the complexity of class size,
"A reduction in class size from 30 to 20 pupils resulted in an increase in attainment of approximately 0.35 standard deviations for the low attainers, 0.2 standard deviations for the middle attainers, and 0.15 standard deviations for the high attainers" (p. 17). So once again, the detail of the study is lost when Hattie uses ONE averaged effect size d value to represent that study.
Barwe and Dahlström (2013) also highlight the issue of Hattie reporting effect sizes which contradict the original studies and also Hattie claiming age make no difference, when the studies conclude it does!
"However, here we have three meta-studies that show a strong connection between class reduction and student achievement, but in Hattie's synthesis does not emerge these conclusions. That's remarkable" (translated p. 22).
One of the best explanations for the need of smaller class sizes is by Steven Kolber - here.
Sarah Lawrence Lightfoot defined effective teaching as,
"ideas conveyed through relationships."
Dylan Wiliam, re-affirming Lightfoot's definition:
"probably the best 4-word definition I have seen."
Hattie claims he regularly teaches to a 1000 students?
In this 2019 podcast (52mins)- "...in my work I deal with class sizes of 1000, I know how to do that, I’ve got quite good at it over the last 40 years, how to teach to class of 1000."
Mm mm, that's strange, the Melbourne Graduate School, where Hattie has been for last 9 years, does not have a room or theatre that would fit anywhere near 1000 students. Also, he mostly works 1-to-1 with PhD students doing thesis.
This is another example of Hattie's propensity for exaggeration and misrepresentation. Hattie's Representation Changes:
For 10 years in his presentations to administrators, politicians and principal's, Hattie promoted the meme "class size does not work".
For example, with Pearson (2015) - What Doesn't Work in Education - the politics of distraction, he names class size as one of the major distractions. In previous presentations, he consistently labelled class size a "disaster" or as "going backwards" (Hattie's 2005 ACER presentation).
Yet, Hattie (2015) responding to critiques of his work he concludes:
"The main message remains, be cautious, interpret in light of the evidence, search for moderators, take care in developing stories..."
Using polemic language like 'disasters' is not being very cautious!
Corwin, the commercial arm of Hattie's Visible Learning continue to promote "reduction of class size does not work" (Feb 2020). At least, in the most comprehensive peer review of class size so far, Class Size Eastern and Western perspectives (2016), Hattie retreats from his polemic and concedes to 16 other academics,
"The evidence is reasonably convincing - reducing class size does enhance student achievement" (p. 113).
He should instruct Corwin, to do the same!
But, Hattie then cleverly shifts the debate,
"Why is the (positive) effect so small?" (p. 105).
In the interview with Ollie Lovell, Hattie answers his own question when he admits standardised tests are too narrow and get low effect sizes. Strangely, he forgets his class size studies used standardised tests!
Interview segment - here.
Another answer to 'why class size effect sizes' are low, is pretty obvious when you look at the tables above. Hattie derives his lowest effect size of 0.09 (incorrectly). Then when you average very small effect sizes from class sizes of 40 down to 30 with large effect sizes of 20 down to 15 you get a low average.
Then continuing in the Lovell interview (mostly to teachers) he changes tact and says (@24mins),
"it's about the story, not the numbers..."
Wow what a change in interpretation!
Hattie's Story???
Hattie now says the low effect size means teachers do not change their teaching from large classes to small.
Firstly I do not accept Hattie's contention that the effect size is small and he has little evidence that teaching does not change.
In fact, if you read some of the studies e.g., , Goldstein et al. (2003) show that a whole host of factors change in smaller classes:
Class size effects are compounding.
Behaviour of students.
More individualised learning.
Teaching changes dramatically in small classes.
Students have more focused and sustained attention.
More immediate feedback and questions.
The age of students is important.
Benefit children who are most in need academically, and who thus have most ground to make up.
Blatchford (2003), also showed significant changes,
"We found consistent relationships between class size and teaching... children in small classes were more likely to interact with their teachers, there was more teaching on a one-to-one basis, more times when children were the focus of a teacher’s attention, and more teaching overall. In short, there was more teacher task time with pupils... there was more teacher support for learning, as reflected in the amount of individual attention paid to students, and in terms of the immediate, responsive, sustained and purposeful nature of teacher interactions with children, the depth of a teacher’s knowledge about children, and sensitivity to individual children’s needs." (p. 149).
Prof David Zyngier provides some support that teachers don't change but also cites research showing other differences,
"In larger class sizes teachers used class groupings, and these classes had lower achievement, while in smaller classes it was more common to teach to the whole class. There were more student questions in larger classes (usually seeking help or clarification), but more teacher follow-up of questions in smaller classes. There was a greater use of homework, assignments and oral tests for assessment purposes in smaller classes. The amount of time teachers spent directly interacting with students, and monitoring students’ work, was differently related to class size; more direct interaction occurred in smaller classes, whereas teachers lectured or explained more in larger classes. Finally, larger classes had fewer interactions overall between teachers and students, had higher noise levels, required more management than smaller classes, and the time spent in this way did not assist student learning" (2014, p. 6).
STANDARDISED TESTS:
Simpson (2017) also raises the standardised test issue, showing standardised tests get low effect sizes. Simpson also goes further and shows that changing the test can yield effect sizes from 0 to infinity for the same intervention!
Simpson poses this as one of the reasons influences like 'feedback' have high effect sizes (they use specific tests) while influence like 'class size' have low effect sizes as narrow standardised tests are used.
This answers Hattie's often claim that structural changes make no difference to education. Structural changes are measured by standardise tests! Simpson also details that sampling from smaller populations is another major reason why effects of influences such as 'feedback', 'meta-cognition', etc are high while the effects for whole school influences - 'class size', 'summer school', etc are low (p. 463),
"One cannot compare standardised mean differences between sets of studies which tend to use restricted ranges of participants with researcher designed, tightly focused measures and sets of studies which tend to use a wide range of participants and use standardised tests as measures."
Bergeron (2017) and Slavin (2016) also confirm Simpson's analysis. Prof Slavin has a blog devoted to this question here.
Hattie's Interpretation Is Used by Politicians for Public Policy:
"Hattie’s work has provided school leaders with data that appeal to their administrative pursuits" Eacott (2017, p. 3).
The Australian Government used Hattie & The Grattan Institute to block significant funding to redress the socioeconomic imbalance in Australian Schools - called the Gonski Review.
In a report by Dean Ashenden, summarising and interviewing the then Education Minister, Christopher Pyne (2012), Ashenden states,
"The best single source of evidence about the relative effectiveness of class size reductions and many other educational strategies is John Hattie’s Visible Learning.."
Pyne was reported to say,
"the evidence overwhelmingly shows that investing in teacher effectiveness rather than the number of teachers is the most successful method of improving student learning and creating top performing education systems."
Professor Blatchford (2016c) comments about this,
"When Christopher Pyne [the then Australian Education Minister] talked about prioritising teacher quality, rather than reducing class sizes, he set up a false and simplistic dichotomy" (p. 16).
From New Zealand, a similar example, where Professor John O'Neill (2012a) writes a significant letter to the NZ Minister of Education on the problem of using Hattie's research for class size policy, then publishes these issues O'Neill (2012b).
Further O'Neill (2012b) states,
"...the Minister of Education declined to rule out increases in class size. In short, this was because the ‘independent observation’ of Treasury and the research findings of an influential government adviser, Professor John Hattie, were that schooling policy should instead focus on improving the quality of teaching." (p. 1).
Also, on Hattie's class size interpretation O'Neill (2012b) warns that,
"Much of the terminology is ambiguous and inconsistently used by politicians, officials and academic advisers. The propositions are not demonstrably true – indeed, there is evidence to suggest they are false in crucial respects. The conclusion is, at best, uncertain because it does not take into account confounding evidence that larger classes do adversely affect teaching, learning and student achievement" (p. 2).
I am concerned about the unwavering confidence that Hattie displays when he talks about class size, given the caution and reservation that the scholars of each of his 3 studies discuss as well as other reputable scholars around the world. Reservations due to the lack of quality studies, the inability to control variables, the major differences in how achievement is measured, major confounding variables and benchmark effect sizes.
The Largest Analysis and Peer Review of the Class Size Research:
Class Size Eastern and Western perspectives (2016), edited by Prof Blatchford et al. Note: Prof Blatchford has a dedicated website to class size research - http://www.classsizeresearch.org.uk
The editor's state,
"there are in fact relatively few high-quality dedicated studies of class size and this is odd and unfortunate given the public profile of the class size debate and the need for firm evidence based on purposefully designed research fit for purpose" (p. 275).
"What often gets overlooked in debates about class size is that CSR is not in itself an educational initiative like other interventions with which it is often (and in a sense unfairly) compared, for example, reciprocal teaching, teaching metacognitive strategies, direct instruction and repeated reading programmes; it is just a reduction of the number of pupils in a classroom" (p. 276).
Prof Blatchford warns again about correlation studies,
"Essentially the problem is the familiar one of mistaking correlation for causality. We cannot conclude that a relationship between class size and academic performance means that one is causally related to the other" (p. 94).
The editors conclude,
"the chapters in this book are only a start and much more research is needed on ways in which class size is related to other classroom processes. This has implications for research methods: we need more systematic studies, e.g. which use systematic classroom observations, but also high-quality multi-method studies, in order to capture these less easily measured factors.
There is some disagreement about which groups are involved but often studies find it is low attaining and disadvantaged students who benefit the most. Blatchford et al. (2011) found evidence that smaller classes helped low attaining students at secondary level in terms of classroom engagement. Hattie (Chapter 7) develops the view that we might expect low attaining students to benefit from small classes in terms of developing self regulation strategies" (p. 278).
Blatchford concludes,
"The aim is move beyond the rather tired debates about whether class size affects pupil performance and instead move things on by developing an integrative framework for better understanding the relationships between class size and teaching, with important practical benefits for education world wide" (p. 102).
Hattie's contribution to the book (Chapter 7):
Hattie appears to be an outlier in this book. Of the 17 scholars who have contributed to the book ONLY Hattie myopically uses the effect size statistic to fully interpret the research. All the others use contextual and detailed features of the research to reach the conclusion that class size is important and significant.
At least the weight of scholarship has caused Hattie to retreat from his polemic on reducing class size as 'a disaster' and 'going backwards' and he finally concedes,
"The evidence is reasonably convincing - reducing class size does enhance student achievement" (p. 113).
But, Hattie cleverly re-frames the issue to
"Why is the (positive) effect so small?" (p. 105).
Given the significant amount of critique about Hattie's methodology - the lack of quality studies, the use of, disparate measures of student achievement, university students or pre-school children, correlation, the inconsistent definition of small and large class sizes, indiscriminate averaging, benchmark effect sizes, etc, etc. I was disappointed that Hattie did not address any of these issues. But rather focused on attacking Zyngier (2014b) meta-review,
"Zyngier's review misses the elephant in the room" (p. 106).
But if Zyngier misses the elephant in the room, then so do all the other 16 researchers contributing to the book. For example, in the following chapter (8) Finn & Shanahan, display what they believe to be significant findings (p. 124):
Hattie once again sidesteps the SIGNIFICANT issues raised by Zyngier (+ many others): e.g., the control of variables - the differing definition of large and small classes. Studies also differ on how to measure class size, some studies use a student/teacher ratio (STR) which includes many non-teaching staff like the principal, welfare staff, library, etc.
"Past research has too often conflated STR with class size" (p. 4).
Blatchford et al. (2016), also comment on this STR problem,
"they are not a valid measure of the number of pupils in a class at a given moment" (p. 95).
Hattie just re-states that meta-analyses provide a reasonably robust estimate and myopically focuses on the effect size statistic. But he provides no defence for the validity issues. However, he concedes STR and class size are different, but he does not resolve the validity issue of using these disparate measures and just fobs off the argument by using a red herring - STR and Class size are related (p. 112) but he provides no evidence for this claim.
Given the importance of class size research, STR and Class size need to be MORE than just related.
They need to be the SAME!!!!
Hattie includes a 4th study to his effect size average, Shin and Chung (2009) - effect size d = 0.20. But he conveniently does not inform the reader that this study re-analysed the same data (the Tennessee STAR study) as the previous meta-analyses that he used.
Ironically, Shin and Chung warn against creating an effect size from repeated use of the same data,
"If a study has multiple effect sizes, the same sample can be repeatedly used. Repeated use of the same sample is, however, a violation of the independent assumption" (p. 14).
They also warn,
"we found too many Tennessee STAR studies... We worry about the dependence issue" (p. 15).
It seems to me Hattie's strategy is to take the focus off the scrutiny of his evidence and re-direct our attention elsewhere - a strategy for politicians, NOT for researchers!
Join the group 'Class Size Matters' - here.
Teacher Morale:
Blatchford et al. (2016), comment on the associated issue of teacher morale and class size,
"Virtually all class size studies report that teacher morale is higher in small classes than in larger classes. The personal preference for small classes was demonstrated by STAR third-grade teachers interviewed at the end of the school year. Teachers were asked whether they would prefer a small class with 15 students or a $2,500 salary increase. Seventy percent of all teachers and 81 percent of those who had taught small classes chose the small class option over a salary increase" (p. 129).
Prof Gene Glass agrees,
"Teacher Workload and its relationship to class size is what counts in my book."
PISA
Blatchford et al., challenge the statements of the head of PISA, Andreas Schleicher,
"there was reference to ten myths of education, as expressed by Andreas Schleicher, one of which was the myth that smaller classes benefited academic performance. The editors of this book tend to side with Berliner and Glass (2014) who address what they see as the 50 myths and lies which threaten American public schools. Myth no. 17 in their list is the belief that reducing class size will not result in more learning" (p. 275).
The Literacy/Numeracy Initiative
In Hattie's jurisdiction, the state of Victoria Australia, the Education Dept is implementing the largest educational initiative in Australia, costing over $200 Million. The project aims to improve literacy and numeracy.
The program incorporates a wide range of research, e.g., The G.R.I.N Program from Monash University., which recommends,
"Prior to the normal daily mathematics lesson, trained tutors conduct GRIN sessions with small groups of students (ideally three)."
A number of teachers commented in my session that this is in conflict with Hattie's presentations the class size does not matter.
Other Commentary
The Australian Education Union (2014) has published a comprehensive analysis of the class size research. They summarise that reducing class size does seem to improve student outcomes. Also, they highlight the problems with Hattie's methodology:
"The critics have cited the methodological problem of synthesising a whole range of meta-studies each with their own series of primary studies. There is no quality control separating out the good research studies from the bad ones. The different assumptions, definitions, study conditions and methodologies used by these primary studies mean that Hattie’s meta-analysis of the meta-analyses is a homogenisation which may distort the evidence (comparing apples with oranges)" (p. 13).
"The 0.21 effect he claims for class size is an average so that some studies may have found a significantly higher effect than that. For example, ‘gold standard’ primary research studies (using randomised scientific methodology) such as the Tennessee STAR project recorded a range of effect sizes including some at 0.62, 0.64 and 0.66, clearly well above the ‘hinge-point’ and the same as most variables which Hattie regards as very important" (p. 14).
O'Neill (2012a) riases more complex issues using the detailed case/naturalistic study by Blatchford (2011),
"...Blatchford makes the point that class size effects are ‘multiple’. For children at the beginning of schooling, there are significant potential gains in reading and maths in smaller classes. Children from ethnic minorities and children who start behind their peers benefit most. There is also a positive effect on behaviour, engagement and achievement, particularly for low achievers, where classes are smaller in the lower secondary school" (p. 10).
Leading researcher, Professor Dylan Wiliam states that the evidence is pretty clear that if you teach smaller classes you get better results. The problem is smaller classes cost a lot more (7min into full lecture).
Also, many scholars point out the irony in Hattie's view, that class size is a distraction - because the number of students in a class limits the ability of teachers to implement the kinds of changes that Hattie shows have the biggest effect, e.g., formative evaluation, micro teaching, behavior, feedback, teacher-student relationships, etc.
For example, Zyngier (2014) in his meta-review -
"The strongest hypothesis about why small classes work concerns students’ classroom behaviour. Evidence is mounting that students in small classes are more engaged in learning activities, and exhibit less disruptive behaviour" (p. 17).
Each of these studies also discusses their limitations. In particular, Goldstein et al. (2000) emphasise the issue, that has emerged for all of Hattie's synthesis;
"...we have the additional problem that different achievement tests were used in each study, and this will generally introduce further, unknown, variation" (p. 403).
Goldstein et al. (2003) go into detail about the problems of comparing correlation studies with random controlled experiments;
"…correlational studies that ... examined relationships between class size and children’s achievements at one point in time, are difficult to interpret because of uncertainties over whether other factors (e.g., non-random allocation of pupils to classes) might confound the results" (p. 3).
Goldstein et al. (1998) point out another major confounding variable:
"There is a tendency for schools to allocate lower achieving children to be in smaller classes. This bias means a considerable number of large cross-sectional studies (correlational) need to be ignored due to validity requirements" (p. 256).
Zyngier (2014) in his meta-review on class size -
"Noticeably, of the papers included in this review, only three authors supported the notion that smaller class sizes did not produce better outcomes to justify the expenditure" (p. 3).
"The highly selective nature of the research supporting current policy advice to both state and federal ministers of education in Australia is based on flawed research. The class size debate should now be more about weighing up the cost-benefit of class size reductions, and how best to achieve the desired outcomes of improved academic achievement for all children, regardless of their background. Further analysis of the cost-benefit of targeted CSR is therefore essential" (p. 16).
"Recognised in the education research community as the most reliable and valid research on the impact of class size reductions at that time, the Tennessee STAR project was a large series of randomised studies, followed up in Wisconsin by the SAGE project. After four years, it was clear that smaller classes did produce substantial improvement in early learning and cognitive studies, and that the effect of small class size on the achievement of minority children was initially about double that observed for majority children" (p. 7).
Zyngier concludes:
"Findings suggest that smaller class sizes in the first four years of school can have an important and lasting impact on student achievement, especially for children from culturally, linguistically and economically disenfranchised communities" (p. 1).
Snook et al. (2009) in their peer review of Hattie, also comment in detail about class size. They also discuss the STAR study reporting effect sizes did reach 0.66. They conclude:
"The point of mentioning these studies is not to 'prove' that Hattie is 'wrong' but to indicate that drawing policy conclusions about the unimportance of class size would be premature and possibly very damaging to the education of children particularly, young children and lower ability children. A much wider and in depth debate is needed" (p. 10).
Dan Haesler has a detailed look at class size and other issues.
Shorter Version
Hattie's book Visible Learning 2009 contains the same argument as his 2005 paper, "The paradox of class size reduction". What he did in 2009, was remove all the individual papers that he used in his 2005 paper and just keep the 3 big - meta-analyses and he just average these 3 meta-analyses. I've looked in detail at these 3 meta-analyses and shown that Hattie misrepresents them. All 3 conclude that class size reductions make a significant improvement in student achievement - the opposite of Hattie's claim. There are many issues with Hattie's analysis. Firstly on p. 397 he claims he compares reductions of 25 down to 15. This is NOT the case, e.g. the largest study, Glass & Smith (1978) graphs a range of DIFFERENT class size reductions (look at the graph in my blog below) and Hattie just averages all of these. Prof Glass responded to Hattie's interpretation of his study - "Averaging class size reduction effects over a range of reductions makes no sense to me. It's the curve that counts. Reductions from 40 to 30 bring about negligible achievement effects. From 20 to 10 is a different story. " Details here -https://visablelearning.blogspot.com/p/class-size.html

Of course, these studies are largely cross sectional in nature and so don’t permit causal inference. Slavin et al. (1989). meta-analyzed 8 studies that either utilized random assignment or controlled for pre-existing differences in school achievement. The median effect in this data set was also .13, lending credence to the cross sectional research.

Slavin conceded that Cooper’s (1989) conclusion that “low achievers in the early grades [are] the group most likely to benefit from smaller classes” (p. 109) is valid. He suggested that “reducing class size may be justified on morale and other quality-of-life grounds.”

Since then, several states have undergone programs to reduce class size, and they seem to have improved student performance. For instance, Krueger (1999) analyzed the effects of class characteristics on student achievement when students were randomly assigned to classes via Tennessee’s Project STAR. Class size had a significant effect such that being assigned to a class with fewer than 17 students predicted a .22 SD increase in scores. Nye (1999) found that gains from the STAR program were still evident in a five year follow up. Cocoran et al. (2003). Similarly, Molnar et al. (1999) report on Wisconsin’s SAGE program, a 5 year project which reduced class sizes in Wisconsin to 15 students per teacher. Molnar et al estimate that the program improved test scores by around .20 SD. On the other hand, as reported in Ehrenberg et al. (2001), in 1996 California reduced its school classes sizes from an average of 29 maximum of 20. This reduction is estimated to have increased scores by .05 to .10 SD. This is a smaller effect, but that is to be expected given the non-linear relationship between class size and student performance. In any case, the data here is of weaker quality than in other studies because students were not properly tested prior to the intervention. So it seems like class size has an effect. Turning to race, we see that racial differences in class size were non-existent by the early 1970s. Cocoran et al. (2003).

This concludes nothing about class. It's a comparison of the expenditure on education in different income groups which is a poor representation of actual per pupil funding, particularly if corruption plagues the eventual allocation of those funds as well Shcherbatyuk, (2007). And in the conclusion the study itself states that the inequality was reduced significantly through policy, not that the class divides didn't exist in fact it says

“we find that inequality fell by 20 to 35 percent between 1972 and 1997.”

Those estimates use district-level measures of teachers and students and abstract from any within-district and within- school variation in class assignment. You could still have the same underfunded groups and inequality can still persist. Take for example the enrollment disparities in private vs public schools between racial demographics.
If SL is talking about pupil-teacher ratios then they’d be referring to Boozer, Krueger, and Wolkon (1992) which actually says

“black students currently have a higher pupil-teacher ratio than do white students in all regions of the country but the South. In the Northeast, for instance, there is an average of 0.6 more students per teacher in the average school attended by black students than there is in the average school attended by white students, and the difference is 1.7 students per teacher for high schools.”

Other studies such as Reber et al. (2013) have even said the samples were small and imprecise. Cascio et al. (2005) says it didn’t systematically assess the role of financial incentives and was based on small data sets. Johnson et al. (2011) says cites how the Boozer and Cocoran says racial segregation in public schools remained constant throughout the of the 1970s, however it has increased slightly since then. Ferguson et al. (2005) said that the racial and socioeconomic parity in average class sizes has no implications for whether class size reductions might be important (or not) for some students, in some schools, at some times. Heckman et al. (2004) says racial differences in school quality cannot be reflected in traditional school indicators such as pupil-teacher ratios.

A study by one of the same authors even said black students still attended large classes. Measures of the school's average class size suggest that black students are in larger classes. Further, the two measures result in differing estimates of the importance of class size in an education production function. They also conclude that school level measures may obscure important within-school variation in class size due to the small class sizes for compensatory education. Since black students are more likely to be assigned to compensatory education classes, a kind of aggregation bias results. They find that not only are black people in schools with larger average class sizes, but they are also in larger classes within schools, conditional on class type. The intra school class size patterns suggest that using within-school variation in education production functions is not a perfect solution to aggregation problems because of non-random assignment of students to classes of differing sizes. However, once the selection problem has been addressed, it appears that smaller classes at the eighth grade lead to larger test score gains from eighth to tenth grade and that differences in class size can explain approximately 15 percent of the black-white difference in educational achievement Boozer & Rouse (2001).

In fact, even in the segregated south racial class size differences were probably too small to matter by the late 1940’s. Card and Krueger (1992).

First of all, term length and average annual salaries decreased which indicates it does matter.

“As recently as 1940 pupil-teacher ratios were 25 percent higher in black schools, the average term length was 10 percent shorter in the black schools, and average annual salaries were 45 percent lower for black teachers.”

Not to mention how in the abstract, it only explains a small portion of the closing black-white earnings gap.

“Improvements in the relative quality of black schools explain 20 percent of the narrowing of the black-white earnings gap between 1960 and 1980…Changes in relative pupil-teacher ratios, term lengths, and teachers' salaries can explain at least one half of the intercohort growth in black-white relative returns to education, and 15-25 percent of the overall convergence in black relative returns to education between 1960 and 1980.”

Second of all, some states such as South Carolina didn’t even improve:

“At the beginning of our sample period, the quality of black schools in South Carolina ranked near the bottom of the entire country. On the other hand, schools for whites were actually better in South Carolina than in many Southern states, including North Carolina.”

Third of all, return to education decreased as pupil-teacher ratios increased:

“The coefficient of - 7.45 (t = 4.2) on the pupil-teacher ratio in column (2) indicates that a higher pupil-teacher ratio is associated with a lower return to education.”

Overall this study is nowhere near what SL concludes.

Thus, racial differences in class size are not a plausible cause of recent racial differences in academic performance.

Teacher Quality

Turning now to teacher quality, there is a great deal of controversy about the degree to which commonly measured teacher characteristics matter. In regressions with lots of controls, teacher characteristics don’t seem to predict future income much at all. Betts (1995)

The thing about Betts (1995) is that they’re data has important limitations. Specific aspects of the datasets, including the young age of the individuals and the relatively small number of observations, make it very difficult to obtain precise estimates of any school quality effects.

The standard errors of the estimates from the NLSY are large, making it difficult to rule out small positive effects with a reasonable degree of confidence. They don’t adjust the standard errors of their estimates for the fact that there are as many as 10 wage observations per individual in the NLSY sample. When using their sample to calculate standard errors that account for the correlation across earnings residuals for the same individual over time, the adjustment raises the estimated standard errors by up to 100%.

In addition, the sample has an average age of just 23, which means that many of the individuals have not yet finished school or settled into their careers, so wage effects for those with higher levels of schooling may be difficult to find. The youthfulness of the sample in studies of school quality is a potential problem for at least two reasons.

First, many determinants of labor market performance are only revealed with experience. For example, it is widely acknowledged that the return to the quantity of schooling is understated among very young workers (Mincer, 1974). One might expect a similar understatement of the effect of school quality in the first few years of the work career. Indeed, this assumption approximately holds in Wachtel's (1976) comparison of returns to school quantity and school quality for theThorndike-Hagen sample of veterans in 1955 and 1969. Between 1955 and 1969, as the average age of the sample rose from 32 to 46, the rate of return to education rose from 0.030 to 0.079 (163%), while the "return" to school quality (measured as the coefficient of school expenditures per pupil in a Class I regression model) rose from 0.291 to 0.684 (135%).

Second, samples of young workers tend to under-represent individuals of a given age with higher education. If higher school quality leads individuals to acquire more education, such samples will contain too few earnings observations for individuals from higher-quality schools, leading to an under statement of any school quality effects. Betts (1996) attempted to respond to this criticism however Sahadewo (2019) attempted to replicate and actually found that the measures of school quality affect workers’ earnings when they were older. Specifically, the percentage of teachers with a graduate degree has a significant effect on workers’ prime-age earnings.

However, these sorts of regressions are controversial because usually control for educational attainment, and school quality may impact wages indirectly by impacting college attendance. Strayer et al (2002) looked at the relationship between school quality, earnings, and college choice, in the NLSY. Concerning the proportion of teacher’s at a high-school with graduate degrees, they found that a 5.75 point increase in the proportion of highly educated teachers predicted a 1 point increase in college attendance and a .25 point increase in an individual’s wages.

However, the correlations between these variables and schooling decisions are no longer significant when including additional covariates such as father’s education and AFQT test score into the model specification. This is due to a strong correlation between these measures of school quality with AFQT score and father’s education.

Without going further into this area, let’s assume, for the sake of argument, that commonly measured teacher characteristics impact future income. Well, on average, blacker schools have more experienced teachers with more formal education and more pay. Cocoran et al. (2003).

What is he talking about here? They literally describe table 12 saying:

“in 1993-94, new teachers in schools where 90 percent or more students were minority were less likely to be certified in their primary teaching field than new teachers in schools that had 10 percent or fewer minority students.”

And across the school districts, the base salaries weren’t even adjusted for cost of living.

Even in the segregated south, black and white teachers pay equalized in the 1950’s. Card and Krueger (1992).

This is because of the NAACP’s salary equalization campaign in South Carolina which resulted in African-American teachers obtaining salaries more comparable to that of whites.

However, at the same time, the state’s response to the salary equalization effort represents one of the ways in which segregationists circumvented equality in educational systems, while seeming to accommodate the anti-discrimination norms established by the NAACP’s legal campaign. In place of the kinds of overt racial restrictions that LDF attorneys attacked, the state wove a web of more subtle – and thus more legally justifiable – discriminations designed to ensnare African-American educators.

The effects of these traps, originally set during the 1940s and 1950s, persist today and perpetuate structural inequalities within educational systems, see https://www.tandfonline.com/doi/pdf/10.1080/09612029900200193

This paper is suspect because it notes the data they have may be inaccurate:

“There is some reason for concern about the accuracy of the data in the early part of our sample. We have tried to eliminate obvious errors in individual reports and have cross-checked the data whenever possible. We have also compared reported teacher salaries with mean annual earnings by state and race for teachers in the 1940, 1950, and 1960 Censuses, and found very high correlations between the two series (e.g., 0.95 in 1940).”

What were the correlations for 1950 and 1960 and so on? Never mind the fact that it notes,

“It should be stressed that other dimensions of relative school quality may have lagged behind the measures that we concentrate on. Bond [1934, pp. 151-71] notes that expenditures on schoolhouses, equipment, and school buses for white students rose very quickly in the early 1930s, while similar expenditures for black students lagged.”

And nevermind the fact they didn’t include Missouri.

Thus, teacher quality, at least as measured by commonly reported characteristics, is unlikely to explain racial differences in school performance.

Class Offerings

Another dimension of school quality consists of the courses offered by the school. Sometimes, it is pointed out that African Americans have, on average, fewer classes offered to them than do white Americans. For instance, 71% of white Americans, 70% of Asian Americans, and 67% of Hispanic Americans attend high-schools that offer algebra one and two, geometry, calculus, biology, chemistry, and physics. By contrast, this set of courses is only available for 57% of African Americans. CRDC (2014). It is possible that these differences impact income. Empirical estimates suggest that taking a set of advanced classes in high-school might increase an individual's income in adulthood by as much as 7% (Rose et al,. 2004). However, I’ve been unable to find studies that account for the possibility that the traits which allow someone to do well in advanced classes, intelligence and self discipline, lead to higher income, and thus create a misleading correlation between taking advanced classes and future income. In any case, for this to be a meaningful difference in educational opportunity we must assume that African Americans who would otherwise succeed in advanced classes are not being offered such classes. There is a 14 point gap between blacks and whites in the probability that they will attend a high-school that offers the full possible set of STEM courses but this does not mean that there is a 14 point gap among blacks and whites who are able to succeed in such classes. Looking at data on AP sheds some light on this question. There doesn’t seem to be much of a relationship between how black a school is and the probability of it offering AP courses.

Well when it comes to disparities in AP classes, there are two drivers of these inequities:

1. Schools that serve mostly Black and Latino students do not have as many seats in advanced classes as schools that serve fewer Black and Latino students; and

2. Schools, especially racially diverse schools, deny Black and Latino students access to those seats that they do have. Additionally, fair access doesn’t mean sufficient access: Too many students attend schools that do not offer these opportunities at all.

Since the 90s, the schools least likely to offer AP courses have actually been the whitest schools there are.

Cocoran et al. (2003).

This table isn’t looking at the fraction of schools offering AP between predominantly Black vs. White schools. It’s looking at predominantly Black vs Predominantly non-Black schools which is not the same as predominantly white because it could just be predominantly minority. So this table is being interpreted incorrectly. A contradictory quote of the study:

“schools with majority black and majority disadvantaged student populations were almost always much less likely to offer these courses than largely non-black or non-poor schools. For example, in 1972 students in 90 percent or higher black schools were 30 percent less likely to have the opportunity to take AP courses than students in schools where less than 10 percent of students were black.”

So this table is being interpreted the wrong way.

Another quote from the study:

“Our calculations from these surveys of the overall fraction of schools offering AP courses are much larger (in all years) than those reported by the College Board, the organization which administers AP exams.”

This means the reason why it seems schools with 90 to 100% black offer more AP courses is because their calculations caused them to overestimate.

Second it says

“While these surveys specifically asked whether the school offered ‘College Board Advanced Placement Courses,’ the responses may reflect some confusion among survey respondents as to what an ‘advanced placement’ course meant. To the extent that survey respondents’ definitions of advanced placement courses were consistent across schools and across time, our calculations should be representative of differences in AP offerings across schools. However, these numbers should be interpreted with appropriate caution.”

It should also be noted that it says

“NLS72 and HSB asked whether or not the school offered college board AP courses; NELS asked what fraction of the student body receives AP courses, and the number of 12th graders in AP courses. For the NELS, we assumed the school offered AP if either of these numbers was nonzero. The NLS72 sample consists of public high schools participating in base year (1972) administrator survey. The HSB sample consists of public high schools participating in first follow up (1982). The NELS sample consists of public high schools participating in first follow up (1990).”

Even then, it literally says on table 13 that for class of 1992, over 70% of AP courses were offered to those 0 to 10% percentage black.

I’ve been unable to locate more recent national data, but several state-level analyses support this conclusion. For instance, in Florida, Black and Hispanic students are roughly as likely as White students to attend a high-school offering AP courses for Math, Science, English, or Social Studies. Conger et al. (2009).

So table 3 shows that black students are roughly as likely to attend schools with AP courses (even though we see black students are 3 percentage points less likely to attend a school with AP science). This paper doesn’t seem conclusive because it is only looking at 11% of all AP courses. Also it is looking at students that attend a schools with at least 1 AP course. If white students attend schools with more AP courses than black students, it won’t be reflected in the data as it isn’t stratified appropriately. Also this study did not control for selection bias and confounding. If schools with certain characteristics were more likely to enroll minority students who were already more academically motivated, then any observed relationship between school characteristics and course-taking could be due to these pre-existing differences rather than any causal effect of school characteristics. Similarly, if certain schools were more likely to participate in the study than others, then any observed effects may not be representative of all schools.
The multivariate probit model used in the study estimates the probability of taking each of the four AP/IB courses simultaneously, and it assumes that there are no unobserved factors that affect course-taking decisions. If there are unobserved factors that affect course-taking decisions, then this could lead to biased estimates of the effects of observed factors on course-taking decisions. For example, if black students who attend schools with high proportions of low-income students face additional barriers to taking AP/IB courses that are not captured by observable characteristics, then this could lead to an overestimate of the effect of race on course-taking decisions.
The sample size and power for this study are not explicitly stated.

The same seems to be true of Texas: Klopfenstein (2004).

Let’s take note that this says

“the TSMP indicates whether or not a particular AP course was taught but does not indicate how many sections of the course were offered. Therefore, it is not possible to discern how many students were in one section of an AP class or to differentiate between a school that offered one section and a school that offered multiple sections of the same class.”

This means that on average, high schools with Black students offer a similar number of AP courses as high schools with White students. However, this does not necessarily mean that Black and White students have equal access to specific AP courses, such as Math, Science, English, or Social Studies. For example, a high school with both Black and White students may offer 10 AP courses in total, but only offer Math and Science AP courses to White students. In this case, there would be an unequal distribution of AP course offerings between Black and White students within the same school. I also don’t take these descriptive statistics at face value given their only means and standard deviations. There should be some warrant for the distribution since it isn’t normal.

In California, there seem to be only minuscule black-white differences in the AP course offerings, though there are more substantial differences for Hispanics. Klugman (2013). On the whole then, there doesn’t seem to be good reason to think that Black people have less opportunities to take AP classes.

Literally the next source he cites says

“Black and Latino students make up 37% of students in high schools, 27% of students enrolled in at least one Advanced Placement (AP) course.”

Now, Black people account for roughly 9% of those who take AP classes, but only 4% of those who pass an AP exam. CRDC (2014). In fact, among blacks who take an AP test, only 26% receive a qualifying score. To pass an AP test, you must score a 3 or higher. For whites who take the test, the mean score is 2.97. This suggests that roughly the right number of whites are taking AP courses because restricting the set of whites who take these tests to a more elite subset would raise the mean score above 3, meaning that qualified people wouldn’t be taking the tests. By contrast, the Black mean score is 1.91. This suggests that the set of black students who are capable of passing these tests is probably smaller than the set of black students currently attempting them. JBHE (2008). So, returning non-AP class offerings, the following seems likely to be true: if every person who is able to complete all the STEM classes attends a high-school in which they are offered, black people will be less likely than white people to attend a high-school offering this set of classes since a smaller proportion of black people are able to pass them. I can’t say exactly how large this gap will be, but I suspect it would be large. Consider the AP Calculus AB class, a rough equivalent of calculus 1 in college. On average, white students pass this class, while only 28% of black students do. Given this, it would seem entirely appropriate if white students were far more likely than black students to attend schools that offer calculus. This is at least a partial explanation for racial gaps in class offerings. I don’t know that it is the only explanation that matters, but it is hard to think of what else could be going on. It doesn’t seem plausible that black schools having tons of students who are able and willing to take calculus classes and the schools simply refuse to offer such classes anyhow because the students are black. And it isn’t as if black schools lack teachers with the formal education needed to teach such classes. Again, black schools have more educated teachers than white schools. At the very least, I think we can say that there isn’t good evidence showing unfair racial differences in class offerings. By contrast, there is good evidence suggesting that racial differences in class offerings are at least partially fair. Given this, and the fact that class offerings have not been shown to causally impact income, and the fact that racial differences in class offers are only moderate in size, the impact that unfair racial differences in class offers have on racial differences in income is probably very small.

Student Quality

One plausible determinant of an individual’s educational opportunity is the quality of their peers. In this respect, I do think schools with a greater proportion of black students are at a disadvantage, because, on average, black students are lower quality students than white students. This is not to say that there are not bad white students, or that most black students are not good students. Instead, it is merely to say that there is an average difference such that the proportion of bad students is higher among African Americans even if that is a minority of all groups. The evidence for this is fairly overwhelming. Most obviously, this is suggested by grades and SAT scores. It is also evidenced by the amount of time students spend on homework. Black and Hispanic students spend less time on homework than whites, and whites spend less time than Asians. This is true despite the fact that black and Hispanic students are more likely than whites and Asians to have parents who check to see that their homework is completed. NCES (2011). Given racial differences in GPA scores, it shouldn’t be surprising that races differ in other measures of scholastic aptitude But there is also strong relationship between how non-white a school is and how much violence goes on in the school. Cocoran et al. (2003).

Why is the author calling this a strong relationship when there’s no R2 or standard error to warrant this? This is a spurious correlation. There’s no control for grade level or socioeconomic status. Additionally they use the Principal/School Disciplinarian Survey on School Violence and the table reports data from the 1996/97 school year. However, the data reported in this survey describe the number of incidents of crime, not the number of individuals involved in such incidents. It should be noted that an incident could involve more than one individual perpetrator or individual victim. Similarly, an individual perpetrator or victim could be involved in multiple incidents. They didn’t control for this. Then there’s obviously teacher bias in reporting.

At the individual level, this is reflected in racial differences with respect to school punishments. As the LA Times reports “The Civil Rights Data Collection, a national survey conducted by the U.S. Department of Education, gathered information on more than 50 million students at more than 95,000 schools… The survey included 1,439,188 preschool students in 28,783 schools… black preschool children overall were 3.6 times as likely to be suspended as white preschoolers.” Other research shows that racial gaps in suspension rates persist as kids grow older and remain even after controlling for socio-economic status (Skiba, 2002). However, these racial differences do not persist when comparing people with the same previous histories of behavioral problems, or when comparing people who were both sent to the principals office for the same offense (Wright et al., 2014; Macdonald, 2014).^[12] These findings suggest that racial differences in suspension and expulsion rates are due to differences in behavior rather than bias in rule enforcement. Given this data, it should be unsurprising that there are also racial differences in bullying behavior. Farris (2006) defines outdegree bullying as the number of other students each student bullied in the past three months and indegree bullying as the number of other students who picked on each student.

So this paper is a dissertation so it isn’t even a peer reviewed source.

Farris (2006) is largely speculative and has major limitations not mentioned.

“There are limitations to this study, chief among them its geographic limitations. The sample comes from three largely rural counties in North Carolina, and so it may not be possible to generalize readily to other areas of the country. Additionally, factors such as neighborhood disorder may operate differently in urban environments. Also, while the study spans middle school and high school, it may not generalize to earlier ages…There are several limitations to this analysis. First, it is set in a rural setting. Findings may not generalize to other areas, particularly those with very different racial compositions. Second, this analysis only includes 14 schools, so school-level findings in particular are tentative. It is possible that the racial diversity of the school will have different consequences in other studies. Finally, this analysis is not longitudinal, and causal inferences should be made cautiously.”

Its data collection is also a voluntary response sample because it was based on counties “willing to participate”. This is not a valid method of sampling because since they are voluntarily, they will likely have stronger opinions than the rest of the population, which makes them an unrepresentative sample. So it suffers from uncoverage bias, self selection bias, and nonresponse bias.

There was also, at most, a 20% unresponsive rate throughout the waves of data collection. The sample is also readily available and non random meaning it is a bit of a convenience sample.

This study suffers from non probability sampling which does not meet the criteria of valid sampling. In fact, this study has no sort of design, methodology, descriptive statistics anywhere so it’s hard to trust.

Farris finds that Black students are more likely to be bullies than are Whites students while White students are more likely than black students to be the victims of bullying.

Farris also finds that white students are more likely to engage in indirect bullying and even bully more frequently on a weekly basis.

Latino students are more likely than black students to be engaged in, and be a victim of, bullying. Farris also finds that racial differences in family SES, neighborhood SES, attachment to friends, parents, and school, and physical development, don’t explain racial differences in bullying.

First of all, they have a poor measure of socioeconomic status as it didn’t include things like occupation.

Second of all, it also says

“We are unable to explain higher levels of bullying perpetration by African- Americans and Latinos, despite inclusion of variables covering a wide range of theoretical domains, from SES to the influence of aggressive peers, and a variety of contexts, from the psychology of the individual to the characteristics of the neighborhood. There are undoubtedly a number of explanations that could not be tested, and we cannot conclude definitively that higher rates of bullying among minorities is not mediated by other factors.”

With respect to interracial bullying, Farris finds that Black on White bullying is 64% more common than is White on Black bullying.

However most bullying is interracial anyway.

In part, these differences probably arise because bullying is socially rewarded in non-white student subcultures.

It actually says

“the race effect also could not be explained away by low school attachment or conventional beliefs, mechanisms that are common to some cultural arguments. Because of this, these findings cannot be interpreted as confirming these cultural theories, as there are too many potentially explanatory variables that could not be included.”

After controlling for gender, age, academic performance, family structure, parental educational attainment, and extracurricular actives, Farris find that the more non-white students bully others, the more popular they are among their peers. This effect does not exist among white students. So, with respect to in-class behavior, bullying behavior, scholastic achievement, and study habits, black and Hispanic students seem to under-perform white and Asian students. Given this, it should be unsurprising that attending a school with more minority students predicts various negative student outcomes. Perhaps most dramatically, Farris finds the following: “Regardless of race, attending a high-minority school increases risk of suicide significantly: for every one percentage point increase in the percent minority in the school, the likelihood of suicide increases by one percent.”

This just has several problems:
First, the causal inference of the study is questionable, as the association between attending a high-minority school and the risk of suicide does not necessarily demonstrate a causal relationship. This study has not meet various causal assumptions including:
1. Mediator Monotonacity: It does not clearly establish that attending a high-minority school is the mediator in the causal pathway, and it does not provide evidence that the strength of the mediator's effect on the outcome increases as the strength of the exposure increases.
2. Treatment Ignorability: It does not clearly establish that attending a high-minority school is a treatment that is being randomly assigned to individuals, and it does not control for all other factors that may affect the risk of suicide.
3. Mediator Ignorability: It does not meet the causal assumption of mediator ignorability because it does not clearly establish that attending a high-minority school is a mediator in the causal pathway, and it does not control for all other factors that may affect the risk of suicide.
Other factors, such as socioeconomic status, access to mental health resources, or other variables not controlled for in the study, may be responsible for the observed association.
There’s no way to no the explanatory power of this relationship. Where is the pseudo R2 or maximum likelihood to make this prediction or more or less if this is even a relationship? What we have to go off of is the standard error. “Percent minority” wasn’t even statistically significant in the first 2 models and it was probably only marginally significant in the third. It also wouldn’t be a surprise if it was just a type I error from p hacking given it was just a one tailed test and there were many statistically insignificant variables included in the model. In fact it would be better if we saw z values and if the p values are greater or less than them. Otherwise their model is extremely uninformative. How could you even make that prediction with an unclear measurement of the x variable, “percent minority”? This can’t be used to compare denoted changes in “risk of suicide” with a unit increment in the poorly explained measurement of “percent minority”. What we do see is that an increase attending a higher black school has no statistically significant likelihood in risk of suicide. This is the same for Latinos and for “other minorities”. Not to mention how the interaction effects of race, diversity, and bullying aren’t statistically significant. They also didn’t include socio economic factors into their regression model either. There would also have to be replication to make such a bold claim.

Similarly, Hanish and Guerra (2000) analyzed data on bullying among 1956 children employing a longitudinal design over a two year period. They found that “White children attending predominantly non-White schools were at greater risk of being victimized than those attending predominantly White schools (b = .44). In contrast, African-American children were slightly more likely to be victimized in predominantly African-American schools than in predominantly non African-American schools (b=.14, p .05). For Hispanic children, risk of being victimized was fairly constant across the range of school ethnic compositions (b= .11, p =.12).”

This is because of a lack of controls. In fact, the “African American” and “Hispanic” variables explained only ~2% of victimization in Step 1. The study even says this:

“This basic model, however, explained only minimal variance in victimization (R2 = .015).”

Only ~2% of victimization is explained by the previous variables and Ethnic Composition in Step 2 (R2 = .016). In Step 3, which the author referenced, these variables and others only explained ~4% of peer victimization (R2 = .043). In step 4, when they include more controls, we see that the beta coefficient is statistically insignificant and close to 0 for the “African American” variable and statistically insignificant for “Ethnic composition” variable.

This means that the results SL presented aren’t even statistically significant and have a smaller effect size when more variables are included.

Controlling poverty did not alter this finding.

Controlling poverty and other variables did alter the finding as indicated in step 4 of Table III.
It is also recommended that the unstandardized coefficients are interpreted. We see that none of them are statistically significant.

We see a similar relationship with scholastic performance. For instance, Bohrnstedt (2015) finds that the both Black and White students score worse on standardized tests the greater the proportion of their school that is Black. This effect was more pronounced for Black males, with their scores dropping the most as the black-ness of schools increased. Even after controlling for socioeconomic status, going to a blacker school continued to negatively predict performance among black students although it did not predict worse performance among white students.

This study found components attributable to socioeconomic status and components not attributable to socioeconomic status, including differences between schools, differences within schools, and differences that could not be determined. The study made no attempt to establish a causal relationship between any of these factors. The study states socioeconomic factors explain a significant portion of the gap. I did not say that it explains all of the gap. From the executive summary:
“In addition, the size of the achievement gaps within each category of Black student density was smaller when the analysis accounted for student SES and other student, teacher, and school characteristics (except in the highest density category), suggesting that these factors explained a considerable portion of the observed achievement gap.”
It is therefore misleading to suggest that the achievement gap is not caused by poverty. The segment on page 7 seems like it tries to figure out why that's the case (i.e all environmental explanations.) These sorts of numbers make me think there might be a weird comparison going on. We're comparing a small amount of white students with a lot of black students! There are a few groups. A lot of this stuff is specific to certain regions too. After skimming for what they controlled for, a lot of these explanations could still hold ground even after their analysis. Even 9% of white school students would add up to a lot of people, considering that we're working with the totality of data at a national scale right? Because this is not taking into account where most of the black density schools were located in the south and in urban Midwest, where people already score low. The north has been more educated since the founding of Harvard and Yale, they have about a 200 year head start. https://files.eric.ed.gov/fulltext/ED596492.pdf
They also don’t give out a sample size.
This study is not representative as NAEP mathematics achievement at Grade 8 for Black males but no significant gap for Black females. This is likely due to stereotype threat and the differential selection among Black males in school. school characteristics are somewhat distinct from classroom characteristics, with the latter representing the immediate learning environment. Stratification in this work was also different—done according to school-level density of Black student enrollment (percentage)—whereas we stratified models by teachers’ race and controlled for schools’ percentage minority enrollment as a variable in the model. Careful consideration of how classroom-level matching and school-level racial segregation are distinctly captured by model covariates, stratification factors, and selection effects is warranted. The potential for bias in the samples of students and teachers in various contexts to influence results must be also examined (e.g., endogeneity and selection effects). In fact, there’s no such causal mediation framework tested in this study (two stage least squared regression or differences in differences, etc.)

Goldhaber et al. (1999) analyzed data on roughly 18,000 students, and found a more complex relationship between racial demographics and student performance. There was a negative relationship between a student’s math scores and the proportion of their school that was white. That is, students going to whiter schools did worse. However, there was an even stronger negative relationship between a student’s math scores and the proportion of their math class that was non-white. So, the ideal scenario according to this model would be to go to a school that wasn’t all that white, but then go to a class that was completely white. These effects remained after controlling for the student’s family income and parental education, their race and sex, the region of the country they lived in, the school size, their class size, and their teacher’s degree of experience and formal education. Given what was covered previously, this school-level may be due to whiter schools being slightly less well funded than blacker schools while the class-level effect may be due to the culture and peer level variables that I’ve just covered. School diversity even predicts negative outcomes with respect to student-rated school satisfaction, Consider the evidence from Rothman, Lipset, and Nevitte (2003). This paper analyzed the relationship between racial diversity and the experiences people had at school in a sample drawn from 140 American universities (N = 4,083 individuals, 1,643 students, 1632 faculty members, 808 administrators). They found the following: “As the proportion of black students rose, student satisfaction with their university experience dropped, as did their assessments of the quality of their education and the work ethic of their peers. In addition, the higher the enrollment diversity, the more likely students were to say that they personally experienced discrimination… Faculty members also rated students as less hard-working as diversity increased…Enrollment diversity was positively related to students’ experience of unfair treatment, even after the effects of all other variables were controlled. (As the proportion of black students grew, the incidence of these personal grievances increased among whites. Among blacks, however, there was no significant correlation. Thus diversity appears to increase complaints of unfair treatment among white students without reducing them among black students.)”. These perceptions of discrimination were not shared by the non-student sample. The authors write: “Among faculty and administrators, higher minority enrollment was significantly associated with perceptions of less campus discrimination and, among administrators, more positive treatment of minority students. But these findings were offset by the absence of similar results among students, who reported more personal victimization as diversity increased.” It’s also worth noting that increases in the proportion of a school that was Asian American increased student satisfaction with the school and all these results contained to be true after controlling for various measures of socio-economic status.

This paper has been debunked see: https://zero.sci-hub.ru/3767/ac6ab8ca6d713b13a9fc4b1d880019b2/barton2003.pdf

Thus, student quality seems to be a measure on which black schools are genuinely worse than white schools. While not the fault of white people, this situation is, in a sense, genuinely unfair for black students of high student quality.

Bias at Universities

We’ve seen that there is a slight pro-black bias in terms of overall spending, class size, and teacher quality. There may be a pro-white bias in terms of class offerings, but there is no evidence showing that this is so, and even if there is it cannot account for more than a tiny proportion of racial income differences. On the net, it seems that there is a pro-black bias in American high-schools. The same is true in our university system, and this is easier to show. Once qualifications are controlled for, black applicants are roughly 20 times more likely than white applicants to be admitted to a university, law school, or medical school. Hispanic Americans are three times as likely to gain admittance. Source (On Racial Discrimination in Hiring). With respect to Asian Americans, there is a 6% bias against them using the median result of the above analysis and a 59% bias in favor of them using the average result. Recent news about Harvard university has spread the idea that elite institutions discriminate heavily against Asian Americans, but this does not seem to generally be true. In selective colleges, it’s been estimated that the proportion of students who are white would increase from 66% to 75% if admissions were based solely on test scores. Carnevale et al. (2019).

Interestingly enough the study comes to a different conclusion:

“To be clear, we aren’t advocating admissions based on SAT or ACT score alone. If anything, these standardized test scores have been overused as colleges and universities attempt to gain prestige. The higher a college’s average SAT score among incoming students, the higher achieving its students are perceived to be. But the truth is that standardized tests are not a good enough predictor of college success to justify their use as a key determinant of admissions. Students with relatively high SAT scores do not perform at a much higher level than students with slightly lower SAT scores. For example, at selective colleges, we find that students with SAT scores between 1000 and 1099 have a 79 percent chance of graduating. That is similar to a student with an SAT score above 1200, who has an 85 percent chance of graduating when enrolled at a selective college. Requiring high SAT scores means rejecting a large number of students who have perfectly good chances of succeeding at a prestigious college.”

It’s true, but that’s only because black people are much more likely to come from lower socioeconomic backgrounds than white people.

It is noteworthy that the proportion of Asian students would actually decrease in such a scenario, suggesting once again that the Harvard case is not representative of American elite institutions. Once in college, minority students are more likely to receive aid in paying for their education. These differences are real, but not large. Minority students account for 38% of the student population and 40.4% of grant funding. White students account for 61.8% of all students and 59.3% of grant funding. Kantrowitz (2011).

This is not a bias. The Pell Grant system is based on income and assets of the applicant, and minority students tend to have lower income than Caucasian students. For example, looking just at the students with incomes, family incomes under $50,000, 48% of Caucasian students fall into that group, whereas 77% of African-American students fall into that group, and overall among minority students, 71%.

Again, it’s worth noting that there is a slight pro-Asian bias in the above data. There is also a slight trend such that white students are more likely than black students to be employed while in college. Asian students are less likely than white and black students to be employed. NCES (2019). Black, Hispanic, and white students have similar chances of their parents paying for a significant proportion of their college education. Asians are more likely than others to have parental aid. Rathmanner (2017). Thus, there is a strong anti-white bias in college admissions and a slight anti-white bias in college funding. The media has recently suggested that there are systematic anti-Asian biases in our university system, but that seems probably wrong. Using data from the NLSY97, Sweeten, Bushway, and Paternoster (2009) found that dropping out of school is not related to either an increase or decrease in delinquent behavior.

If SL read the paper it actually says that their

“conjecture receives only partial empirical support.”

They only say

“dropout for economic reasons has a noncriminogenic within-individual effect on delinquency (b = –.329, p < .10).”

“we find that the crime-reducing effect of dropout for economic reasons holds only for males (b = –.472, p < .01).”

But then they say “this effect is short term”. and later

“only 119 males and 53 females reported any variation in economic dropout across the first six waves of the survey. Thus, the test of differences in these effects has little statistical power to uncover smaller differences.”

Also,

“Turning to dropout for unclassifiable reasons, for males only, the within-individual dropout estimate is marginally statistically different from zero (b = .147, p < .10), which suggests a modest increase in crime (about 16 percent).”

They also state

“Contrary to our theoretical expectations, little evidence indicates that dropout for personal reasons results in a decrease in crime variety…It is important to point out that although the magnitude of the noncriminogenic economic dropout effect is three times larger than the criminogenic unclassifiable dropout effect, the latter applies to over six times as many individuals. In fact, 18.5 percent of males report dropout for unclassifiable reasons at some point, whereas only 2.9 percent report dropout for economic reasons.”

Later they do random effects which is better than fixed effects. They say

“Males who drop out for unclassifiable reasons, however, commit more crime in subsequent periods (b = .278, p < .05).”

But then they say

“those who drop out for school reasons (b = .441, p < .10) and unclassifiable reasons (b = .614, p < .01) are more crime prone throughout the survey”.

All of this is congruent with their conclusion. They reference another study contradicting their findings by saying

“Jarjoura (1993) found that dropout for reasons of marriage or pregnancy resulted in increased levels of violence.”

They try to refute this only with speculation and not evidence. But anyway, the only variable that is somewhat consistent with what the author is saying is males who drop out for economic reasons. But as stated before and in their conclusion,

“This kind of dropout is particularly rare. Only a small proportion of male dropouts enjoy these immediate crime-reducing drop- out effects. Just 121 of the 1,278 boys (9.5 percent) who ever dropped out during the first six waves of the NLSY97 reported economic reasons for leaving school early.”

Additionally,

“The short-term effect that we saw for economic rea- sons may stem from the fact that although males dropped out with the expectation of getting a job (and securing greater independence), they found that as a high-school dropout, it was difficult for them to secure meaningful employment.”

Moreover, Sweeten et al. (2009) use an individual fixed-effects strategy, essentially looking at how criminal behavior differed for an individual in the time just after dropping out relative to when in school, not long term. They even point this out:

“The counterfactual in a fixed-effects estimate is offending during periods of nondropout among the population who experiences dropout, controlling for other observable time-varying covariates.”

It doesn’t make sense for them to be predicting delinquency when they aren’t going to be teens when they commit the crime.

Overall, it doesn’t seem plausible for the black crime rate to be a result of schooling, either by school funding standards or by dropping out of school.

Conclusion

At the beginning, I noted that black Americans have lower educational attainment than white Americans. This is true, but if you only look at Americans with IQ scores equal to or greater than the average IQ score of college graduates, we see that Black Americans have higher educational attainment than white Americans do. Herrnstein and Murray (1994). This is exactly what we would expect if the education system actually exhibited a pro black bias. So, if anything, racial gaps in earnings would be larger if educational opportunity were equalized across races.

Based on what Vaush has said in the past (e.g. this Youtube video), and even in the document I am responding to, it seems he takes race differences in crime to be due entirely due to the environment. Race differences, possibly, do not play a role — and if they do, then they’re environmental and not genetic. Since Vaush’s explanations for black crime do not explain the black-white crime gap, I will argue that race differences in social and psychological traits explain the disparity instead. These traits, once taken together, can possibly explain why blacks commit more crime, and why they lag behind whites in money. What follows next is a discussion on race differences in traits, and how they can explain the black-white crime/ wealth disparity.

Alternative Explanations for Black Crime

What follows next are alternative reasons for black crime. The standard Vaush/ left-wing (even sometimes a right-wing case) argues that black crime is a result of environmental variables. As has been shown above, things like income and education do not seem to explain the high rates of black crime. Other possible explanations could be inequality and lead poisoning. However, once publication bias is controlled for, neither inequality nor lead poisoning has an effect on crime rates (Corvalan and Pazzona 2019;

This paper hasn’t been peer reviewed which is already obvious through its various grammatical errors.

The problem with this paper is that it’s using partial correlation analysis:

“We collect all the estimates on the inequality-crime relationship and compute partial correlation coefficients.”

However, the calculation of the partial correlation coefficient is based on the simple correlation coefficient which assumes linear relationship. Generally this assumption is not valid especially in social sciences, as linear relationships rarely exist in such phenomena.

Also, they don’t clarify how many orders of coefficients there are which is critical because as the order of the partial correlation coefficient goes up, its reliability goes down.

Also their analysis relies on antithetical or unrealistic assumptions:

“We then make protection endogenous and consider the case where only the poor offend and the rich protect themselves.”

This, in and of itself, acknowledges that poverty/inequality is an incentive that causes crime. In fact, they say

“we conclude that the poorer are the individuals in the society the higher are the incentives for crime.”

They provide no empirical evidence that “the rich” protects themselves as a response to inequality. In fact, the problem with their “private security” theoretical framework is that it has been criticised for their tendency to over-emphasise ‘supply’ factors - either in terms of the limitations of the “sovereign state”, and its agents such as the police, to supply security to its citizens (see Garland, 1996), or as a consequence of a growth in ‘mass private property’ and competition with the State from corporate actors in the supply of private security goods and services (Shearing and Stenning, 1983). Also see: https://www.researchgate.net/profile/Tim-Hope/publication/270581784_INEQUALITY_AND_THE_CLUBBING_OF_PRIVATE_SECURITY/links/54aeedf60cf21670b3589c2e/INEQUALITY-AND-THE-CLUBBING-OF-PRIVATE-SECURITY.pdf?origin=publication_detail

They also clarify that their logarithmic utility function of income, implying a constant relative risk aversion equal to one, is not a “completely innocuous assumption.”

Additionally they assume that the decision to engage in crime is binary, that it, to be a criminal or not, when other frameworks assume that individuals may divide their time in legal and illegal activities. See Ehrlich (1973).

Also, they use Dalton 1920’s Principle of Transfer as their measure of inequality however the transfer principle by itself is evidently not decisive in terms of inequality comparisons as illustrated here:

In passing from Monday’s distribution to Tuesday’s we find that there is an equalizing change at the bottom of the distribution (the P-Q gap has shrunk), but that there has also been a disequalising change at the top of the distribution (the Q-R gap has increased). A “top-sensitive” observer of this situation (someone who attaches particular importance to what happens in the part of the distribution concerning higher incomes) will conclude that inequality has increased from Monday to Tuesday: a “bottom- sensitive” observer would come to the opposite conclusion.

By appealing to the transfer principle alone we cannot resolve all possible inequality comparisons and build them up into a complete ordering of distributions by inequality. For more see: https://www.researchgate.net/profile/Frank-Cowell/publication/30521899_Thinking_About_Inequality/links/09e41505b9fe6f038e000000/Thinking-About-Inequality.pdf?origin=publication_detail

Their analysis is completely dichotomous in victims and criminals, i.e rich and poor respectively. They don’t take into account that

“evidence shows that crime victims are more in the middle class, rather than among the richest (Gaviria and Page ́s, 2002). Using an economic model of crime with continuous location, CC shows that victims of burglary are more often from the middle class.”

When they test for publication bias they use the funnel plot however high precision studies are different from low precision studies with respect to effect size (e.g., due to different populations examined) so a funnel plot gives a wrong impression of publication bias Joseph Lau, John P. A. Ioannidis, Norma Terrin, Christopher H. Schmid & Ingram Olkin (2006). The appearance of the funnel plot can change quite dramatically depending on the scale on the y-axis, whether it is the inverse square error or the trial size Jin-Ling Tang; Joseph LY Liu (2000) Researchers have a poor ability to visually discern publication bias from funnel plots Terrin, N.; Schmid, C. H.; Lau, J. (2005).

They’re funnel graph differs by study characteristics as high precision studies tend to be positive and are also studies on the US while the ones at the bottom are in other countries. This gives off a false impression of publication bias.

Additionally they are giving a misleading illustration of their funnel graph because they exclude high precision studies with positive results.

“We exclude from the graph, but not from the analysis, the estimates belonging to Andrienko (2002), which are the most precise ones. The average measure of precision for Andrienko (2002) is 317, well above the highest point in the graph , which is 80. The other outliers are the points which have precision level around 40 and shape a curved line to the right. These represent the estimates by Costantini et al. (2016). As a robustness exercise, we will run the formal tests for publication bias excluding the two studies.”

This is also the case in figure 3 in order to see how neat it looks and not how it actually indicates publication bias:

“We also report the funnel graph considering only the median estimate for each study, in Figure 3, which also displays the name of each study. The presence of positive publication bias looks neater in this graph compared to the previous one.”

They also don’t specify the p value of significance for publication bias and the degree of publication bias was small (1.18).

They even admit that when using other models, inequality does have a statistically significant relationship to property crime:

“However, using mixed models and fixed effects, we also a significant true effects, (λ0).”

They even reference Kelly 2000 which found poverty (POV) and economic growth (EG) significantly affect on property crime.

Higney, Hanely, Moro 2021).

The first thing I'd note is that I'd probably discard anything prior to 2005 or so, even though that would eliminate the original 2000 paper by Rick Nevin that started the whole thing. Aside from some very rough correlations, there just wasn't enough evidence before then to draw any serious conclusions.

I also note that they left out a good paper by Brian Boutwell, but that may just be one of the seven papers they discarded because the results couldn't be normalized in a way that allowed them to be used. This is a common issue with meta-analyses, and there's not much to be done about it. Still, it does mean that high-quality studies often get discarded for arbitrary reasons.

But none of this is really the core of the meta-analysis. In fact, you might notice that among post-2005 studies, there's only one that shows a negative effect (a smallish study that found a large effect on juvenile behavior but no effect on arrest rates) and one other that showed zero effect (which is odd, since I recall that it did show a positive effect, something confirmed by the abstract). This means that 22 out of 24 studies found positive associations. Not bad! So what's the problem?

The problem is precisely that so many of the studies showed positive results. The authors present a model that says there should be more papers showing negative effects just by chance. I wonder what evidence they give for the validity of this “model”? Some sort of publication bias discount factor, or threshold, such that any number of positive results below that threshold means no positive result overall. I wonder what their methodology for deriving THAT number was. This is the problem with many economic or statistical models. They start from an unproven and unprovable assumption and reason from there. Result is GIGO. They conclude that the reason they can't find any is due to the well-known (and very real) problem of publication bias: namely that papers with null results are boring and often never get published (or maybe never even get written up and submitted in the first place). After the authors use their model to estimate how many unpublished papers probably exist, they conclude that the average effect of lead on crime is likely zero.^[13]

With the caveat, yet again, that this is beyond my expertise, I don't get this. If, say, the actual effect of lead on crime is 0.33 on their scale (a "large" effect size) then you'd expect to find papers clustered around that value. Since zero is a long way away, you wouldn't expect very many studies that showed zero association or less.^[14]

So that's one question. The authors also state that homicide rates provide the best data for studies of lead and crime, which presumably gives homicide studies extra weight in their analysis of "high-quality" studies. In fact, because the sample size for homicides is so small, exactly the opposite is true. In general, studies that look at homicide rates in the '80s and '90s simply don't have the power to be meaningful. The unit of study should always be an index value for violent crime and it should always be over a significant period of time.

And there's another thing. These crime studies aren't like drug efficacy studies, where pharma companies have an incentive to bury negative results. Nor are they the kind of study where you round up a hundred undergraduates and perform an experiment on them. A simple thing like that is just not worth writing up if it shows no effect.

On the contrary, studies of lead and crime are typically very serious pieces of research that make use of large datasets and often take years to complete. It's possible that such a study couldn't find a home if it showed no effect, but I'm not sure I buy it. Lead and crime is a fairly hot topic these days with plenty of doubters still around. Well-done papers showing no effect would probably be welcomed.

For what it's worth, I'd also note that the lead-crime field is compact enough that practitioners are typically aware of research and working papers currently in progress. They'd notice if lots of studies they knew about just disappeared for some reason. And maybe they have! But if so, they're keeping mighty quiet about it. Someone should ask them.

I have a few other questions about this meta-analysis, but they aren't important. The primary question is pretty simple: Is the lack of negative results due to papers not getting published for one reason or another? Or is it the result of lead having a substantial effect, which makes it very unlikely for a study to show an association less than zero just by chance?

And this is where I have to stop. I can bring up questions about this meta-analysis, but the authors' model is too complex for me to assess. I also don't have enough background in the meta-analysis biz to judge the paper as a whole. Someone else will have to do it, and it will have to be someone who is (a) familiar with the lead-crime field and (b) has the mathematical chops to dive into this. Any volunteers?

One unrelated issue that the paper raises is that violent crime in Europe generally rose in the 1990s and aughts, even as gasoline lead decreased. I've read specific papers about Sweden, Australia, and New Zealand that support the lead-crime hypothesis, all of which showed crime declines as lead levels dropped. I also know that violent crime in the UK shows an almost perfect correlation with lead levels as violent crime rose and fell. And of course Rick Nevin has shown good correlations in a number of countries.

This is just off the top of my head, and I don't know the overall data well enough to comment any further—though keep in mind that Europe generally banned leaded gasoline well after the US, so their violent crime rate kept increasing long after the US too. In any case, maybe I'll look into it. If it turned out that most European countries showed little correlation between lead levels in children and violent crime rates later in life, it would be a serious setback to the lead-crime hypothesis.

Publication bias is a problem, yes, but it doesn't happen at random. Failed attempts at direct replication are reasonably likely to be submitted and published. On the other hand, researchers who hope to extend a published finding, or use it as a methodology in another problem area, are less likely to submit, assuming that they may have done something wrong. But the criteria for inclusion set by the authors of the meta-analysis would have excluded any study that wasn't at least a constructive replication. So my expectation is that publication bias would be a rather small factor.

Honestly, I didn't read far enough to find a description of their bias model, if any was given. Although I used inferential statistics on and off throughout my career, I started out somewhat skeptical of the idea of automating inference from data, and grew more skeptical over the years, I would be much more interested in knowing the particular ways in which the studies that showed little or no effect differed from those that showed meaningful effects. The goal should be understanding, not statistical 'significance'.

By narrowing their indicator for violent crime, the authors make a significant change to the research question at the heart of this issue, IMHO. How many of the papers under review focus only on homicides?

It would be a reasonable conclusion to say bias if the studies were conducted according to a common protocol, only drawing different samples from the same population, and applying the same analysis methodology. Then one would expect the results to be approximately Gaussian due to random error. On the other hand, there might be systematic differences between the studies that show little or no effect, and those that show real-world-meaningful effects. In that case, the distribution of PCCs would be difficult if not impossible to model.

Let’s do a few calculations of the data.

1. The Bayesian estimate of the effect is about 0.19-.2 . (the expected value of the posterior is 0.19-0.20.) The value varies a bit depending on what significance is assigned to the error bars in the figure. The paper itself starts off down the Bayes road, but veers into frequentist weeds pretty darn quickly.

2. The estimated errors in the original papers or as normalized are far too low. Some of the estimates are 0. That's simply not feasible.

3. The estimated probability of any given paper being correct depends highly on the estimated variance of the paper. This is a side effect of point 2.

There is a simple way to keep the prior distribution well bounded (It was underflowing to zero). The expected value is 0.25. The simple way is to sum the logs and then normalize the max sum of logs to zero. This paper doesn’t adjust for any of this.

https://jabberwocking.com/has-the-lead-crime-hypothesis-been-debunked/

Thus, alternative explanations for black crime will be offered down below. Readers not interested in this can skip the following paragraphs, but those interested in learning about it can continue on as normal. One trait that differs between blacks and whites, which influences many social and economic variables, is self-control. In psychology, self-control can be defined in two ways: (1) the ability to control oneself, whether it be emotions or desires, and; (2) The ability to delay self-gratification and resist unwanted behaviors and urges (in Cherry 2012). To measure self-control, researchers often use something called the marshmallow test or a variation of it. For this type of study, researchers offer a child the option between a small marshmallow now or a larger reward later on, or they leave the child in a room with a marshmallow in front of them and inform them that they have to leave the room but they can eat the marshmallow if they so desire. If the researcher comes back after a few minutes and the marshmallow is still there, the participant gets another marshmallow to eat or a larger one. There are a few other variations in this test, like how the study is conducted or what items are being offered, but they all measure the same thing. As has been shown in the literature, the ability to delay self-gratification predicts many outcomes within an individual’s life. It has been shown that the inability or unwillingness to delay self-gratification affects many life outcomes.

In 2014, a study was conducted showing that ability to delay gratification depends on social trust (Michaelson et al, 2013).

Using Amazon’s Mechanical Turk, (n = 78, 34 male, 39 female and 5 who preferred not to state their gender) completed online surveys and read three vignettes in order—trusty, untrustworthy and neutral—while using a scale of 1-7 to note how likeable, trustworthy, and how sharing their likelihood of sharing. Michaelson et al (2013) write:

“Next, participants completed intertemporal choice questions (as in Kirby and Maraković, 1996), which varied in immediate reward values ($15–83), delayed reward values ($30–85), and length of delays (10–75 days). Each question was modified to mention an individual from one of the vignettes [e.g., “If (trustworthy individual) offered you $40 now or $65 in 70 days, which would you choose?”]. Participants completed 63 questions in total, with 21 different questions that occurred once with each vignette, interleaved in a single fixed but random order for all participants. The 21 choices were classified into 7 ranks (using the classification system from Kirby and Maraković, 1996), where higher ranks should yield higher likelihood of delaying, allowing a rough estimation of a subject’s willingness to delay using a small number of trials. Rewards were hypothetical, given that hypothetical and real rewards elicit equivalent behaviors (Madden et al., 2003) and brain activity (Bickel et al., 2009), and were preceded by instructions asking participants to consider each choice as if they would actually receive the option selected. Participants took as much time as they needed to complete the procedures.”

When one’s trust was manipulated in the absence of a reward, within the group of subjects influenced their ability to delay gratification, along with how trustworthy one was perceived to be, influenced their ability to delay gratification. So this suggests that, in the absence of rewards, when social trust is reduced, ability to delay gratification would be lessened. Due to the issues of social trust manipulation due to the order of how the vignettes were read, they did a second experiment using the same model using 172 participants (65 males, 63 females, and 13 who chose not to state their gender). Though in this experiment, a computer-generated trustworthy, untrustworthy and neutral face was presented to the participants. They were only paid $.25 cents, though it has been shown that the compensation only affects turnout, not data quality (Burhmester, Kwang, and Gosling, 2011).

In this experiment, each participant read a vignette and there was a particular face attached to it (trustworthy, untrustworthy and neutral), which were used in previous studies on this matter. They found that when trust was manipulated in the absence of a reward between the subjects, this influenced the participants’ willingness and to delay gratification along with the perceived trustworthiness influencing it as well.

Michaelson et al (2013) conclude that the ability to delay gratification is predicated on social trust, and present an alternative hypothesis for all of these positive and negative life outcomes:

“Social factors suggest intriguing alternative interpretations of prior findings on delay of gratification, and suggest new directions for intervention. For example, the struggles of certain populations, such as addicts, criminals, and youth, might reflect their reduced ability to trust that rewards will be delivered as promised. Such variations in trust might reflect experience (e.g., children have little control over whether parents will provide a promised toy) and predisposition (e.g., with genetic variations predicting trust; Krueger et al., 2012). Children show little change in their ability to delay gratification across the 2–5 years age range (Beck et al., 2011), despite dramatic improvements in self-control, indicating that other factors must be at work. The fact that delay of gratification at 4-years predicts successful outcomes years or decades later (Casey et al., 2011; Shoda et al., 1990) might reflect the importance of delaying gratification in other processes, or the importance of individual differences in trust from an early age (e.g., Kidd et al., 2012).”

Another paper (small n, n = 28) showed that the children’s perception of the researchers’ reliability predicted delay of gratification (Kidd, Palmeri, and Aslin, 2012). They suggest that

“children’s wait-times reflected reasoned beliefs about whether waiting would ultimately pay off.” So these tasks “may not only reflect differences in self-control abilities, but also beliefs about the stability of the world.”

Children who had reliable interactions with the researcher waited about 4 times as long—12 minutes compared to 3 minutes—if they thought the researcher was trustworthy. Sean Last over at the Alternative Hypothesis uses these types of tasks (and other correlates) to show that blacks have lower self-control than whites, citing studies showing correlations with IQ and delay of gratification. Though, as can be seen, alternative explanations for these phenomena make just as much sense, and with the new experimental evidence on social trust and delaying gratification, this adds a new wrinkle to this debate. (He also shortly discusses ‘reasons’ why blacks have lower self-control, implicating the MAOA alleles. However, I have already discussed this and blaming ‘genes for’ violence/self-control doesn’t make sense.)

Michaelson and Munakata (2016) show more evidence for the relationship between social trust and delaying gratification. When children (age 4 years, 5 months, n = 34) observed an adult as trustworthy, they were able to wait for the reward, compared to when they observed the adult as untrustworthy they ate the treat thinking that, since they observed the adult as untrustworthy, they were not likely to get the second marshmallow than if they waited for the adult to return if they believed him to be untrustworthy. Ma et al (2018) also replicated these findings in a sample of 150 Chinese children aged 3 to 5 years old. They conclude that “there is more to delay of gratification than cognitive capacity, and they suggest that there are individual differences in whether children consider sacrificing for a future outcome to be worth the risk.” Those who had higher levels of generalized trust waited longer, even when age and level of executive functioning were controlled for.

Romer et al (2010) show that people who are more willing to take risks may be more likely to engage in risky behavior that provides insights to that specific individual on why delaying gratification and having patience leads to longer-term rewards. This is a case of social learning. However, people who are more willing to take risks have higher IQs than people who do not. Though SES was not controlled for, it is possible that the ability to delay gratification in this study came down to SES, with lower class people taking the money, while higher class people deferred. Raine et al (2002) showed a relationship between sensation seeking in 3-year-old children from Mauritius, which then was related to their ‘cognitive scores’ at age 11. As usual, parental occupation was used as a measure of ‘social class’, and since SES does not capture all aspects of social class then controlling for the variable does not seem to be too useful. Because a confound here could be that children from higher classes have more of a chance to sensation seek which may cause higher IQ scores due to cognitive enrichment. Either way, you can’t say that IQ ’causes’ delayed gratification since there are more robust predictors such as social trust.

Though the relationship is there, what to make of it? Since exploring more leads to, theoretically, more chances to get things wrong and take risks by being impulsive, those who are more open to experience will have had more chances to learn from their impulsivity, and so learn to delay gratification through social learning and being more open. ‘IQ’ correlating with it, in my opinion, doesn’t matter too much; it just shows that there is a social learning component to delaying gratification.

In conclusion, there are alternative ways to look at the results from Marshmallow Experiments, such as social trust and social learning (being impulsive and seeing what occurs when an impulsive act is carried out may have one learn, in the future, to wait for something).

Though these experiments are new and the research is young, it’s very promising that there are other explanations for delayed gratification that don’t have to do with differences in ‘cognitive ability’, but depend on social trust—trust between the child and the researcher. If the child sees the researcher is trustworthy, then the child will wait for the reward, whereas if they see the researcher is not trustworthy, they ill take the marshmallow or whatnot, since they believe the researcher is not trustworthy and therefore won’t stick to their word.

https://notpoliticallycorrect.me/2018/02/11/delaying-gratification-and-social-trust/

Using nationally representative samples, Moffit et al. (2012) looked at how well self-control measured in childhood predicted life outcomes at age 32 compared to IQ and parental socioeconomic status. Self-control was found to predict better health, more wealth, less criminality, and a lower chance of being a single parent. This held true even while IQ and parental socioeconomic status was held constant. Although, Moffit et al. found IQ to be a better predictor of wealth and adult socioeconomic status when self-control and parental socioeconomic status were held constant.

This uses informant-report measures of self control, not delay of gratification measures for self control which is literally the measure he uses to argue black people have lower self control. This is important to recognize because convergent validity correlations among informant-report questionnaires and delay tasks are small to moderate and unreliable given the confidence intervals (r = .21, CI [.09, .32], Z = 6.19, p < . 0001). He even acknowledges socioeconomic status predicts life outcomes better than self control and they’re more statistically significant too.

Additionally self control had wider confidence intervals and larger standard errors. There should be a warrant for the R2.

Other studies have examined the associations between early delay of gratification and adolescent outcomes in a more diverse sample of children and with more sophisticated statistical models. They found smaller bivariate associations between early delay ability and later achievement Watts (2018). Others have found that the measures used to assess self-control, recognizing morally debatable behaviors, and the antisocial beliefs composite did not display significant group differences Sorge et al. (2015).

Tangey, Baumeister, and Boone (2004) found that high self-control predicted a higher GPA, better social adjustment, less binge-eating and alcohol abuse, better relationships and interpersonal skills, secure attachment, and better emotional responses.

One of the authors, Baumeister, actually has a more recent paper in 2011 that stumbled on a paradox: The people who were the best at self-control — the ones who most readily agreed to survey questions like “I am good at resisting temptations” — reported fewer temptations throughout the study period. To put it more simply: The people who said they excel at self-control were hardly using it at all (Hofmann et al. 2012). Psychologists Marina Milyavskaya and Michael Inzlicht recently confirmed and expanded on this idea. In their study, they monitored 159 students at McGill University in Canada in a similar manner for a week. Additionally, Baumeister has an idea that dates back to the 90s that ego depletion confounds ideas of self control because some people can be more depleted in self control than others.

Daley et al. (2015) looked at 16,780 British individuals and looked at how well IQ, childhood self-control, and class predicted adult unemployment. IQ and self-control both had a negative relationship with unemployment, and class failed to predict unemployment after IQ and self-control were controlled for. In figure 2, the authors showed that after controlling for gender, intelligence, and parental social class, a 1-SD increase in childhood self-control predicted a decrease in the probability of being unemployed at all ages looked at. On average, a 1-SD increase in childhood self-control was associated 1.4% decrease in the probability of being unemployed.

This doesn’t say anything about race and levels of self control. There should also be a subgroup analysis if these effects are on frictional, structural, or cyclical unemployment to see how significant these results truly are.
The study also has antithetical results showing that
“Self-control did not significantly predict unemployment at age 34 or age 38, when average unemployment rates were at their lowest.”
Additionally those with lower self control have confidence bounds that capture bounds of those with other levels of self control?
“the predicted number of months of unemployment was 6.34 (95% confidence interval, CI = [5.46, 7.22]), for participants with low self-control (1 SD below the mean), 4.99 (95% CI = [4.49, 5.47])”
I’m also not sure why they’re using negative binomial regression for variables that aren’t over dispersed.
They also say
“the self-control measures used in Studies 1 and 2 were not originally designed for that purpose, which raises the possibility of measurement error that could have attenuated the relationship between self-control and unemployment.”
Additionally they say
“by failing to adjust for potentially important constructs, such as “grit” (i.e., perseverance and passion for long-term goals), that overlap with self-control (Duckworth & Gross, 2014), we may have overestimated the contribution of self-control.”
Also
“Although we observed an enhanced contribution of self-control during a recession, it is unclear whether these striking findings are generalizable to other time periods and countries.”

Casey et al. (2011) looked at 60 individuals and remarked that those who showed lower self-control in preschool also showed lower self-control in their 20s and 30s.

This study has several problems:
Despite the reported significant main effect of task, the study does not provide any information about the magnitude of this effect or the actual performance of the participants on each task. So the meaning of this result is unfounded. Additionally, the study reports a significant interaction between group and task, but does not provide any post-hoc analyses or contrast tests to investigate this interaction further. This violates and does not meet the causal assumption of mediator monotonicity. They haven’t shown that the relationship between delay of gratification in childhood and impulse control abilities in adulthood may not be monotonically increasing, but rather may have more complex or non-linear patterns. For example, let’s say they found that individuals who had higher levels of delay of gratification in childhood tended to have better impulse control abilities as adults, but only up to a certain point. Beyond that point, the relationship between the two variables may begin to plateau or even decline. This could be due to other factors that may influence impulse control abilities, such as environment. In this case, the relationship between delay of gratification in childhood and impulse control abilities in adulthood would not be monotonically increasing, but rather would have a curvilinear pattern. This would violate the causal assumption of mediator monotonicity and highlight the need for further analysis to account for potential complexities in the relationship between the two variables.
They also reports a trend towards a difference in performance between the low and high delay groups on the "hot" task, but this difference is not statistically significant. While the study does report a significant difference between the two groups on the happy "nogo" trials within the "hot" task, this result should be interpreted cautiously given the lack of statistical significance on the overall "hot" task.
Furthermore this study does not provide any information on the statistical power of the analyses. Without this information, it can not be determined whether the study had sufficient power to detect any true differences between the groups. This is related to how it violates and does not meet the causal assumption of treatment ignorability. It does not adequately control for other factors that may influence both the treatment (delay of gratification in childhood) and the outcome (impulse control abilities in adulthood). These factors include environmental influences, such as parenting styles or socioeconomic status, which influence both delay of gratification in childhood and impulse control abilities in adulthood. If the low and high delay groups differ on these environmental factors, then it cannot be determined he true causal relationship between the two variables. Another factor is developmental factors such as socialization experiences. These same factors apply for mediator ignorability. This study does not meet the causal assumption of mediator ignorability by failing to provide information about whether the go/nogo task may be influenced by unmeasured confounds.
This study is on a non random sample so their estimates are biased.

In a meta-analysis by Ridder et al. (2011), self-control was related to a variety of human behaviors like love, happiness, getting good grades, speeding, commitment in a relationship and lifetime delinquency. There was a small-medium relationship between self-control and outcomes, showing that self-control may not explain all of these variables fully, but it is a factor with the coefficients all being statistically significant. .

Still doesn’t mention race and their sample correlations show very much unexplained variance as they even point out with implications
“all analyses with the Low Self-Control Scale produced exceptionally high levels of unexplained variance (which were much higher than those using the Self-Control Scale and the Barratt Impulsiveness Scale), indicating that other factors unaccounted for in the present meta-analysis influenced the effects of self-control obtained with this scale. Research has suggested that better specification of the condi- tions under which the Low Self-Control Scale is likely to have more or less effect on deviant behavior should be undertaken (Tittle, Ward, & Grasmick, 2003), and our research supports this recommendation.”
High Q values indicating substantial variability with no random effects model analysis or subgroup analyses means statistical significance means nothing.

As can be seen, self-control is correlated with many social and economic outcomes. Those with higher self-control levels have better outcomes, while those who have lower self-control are more likely to experience the negative effects.

Kentaro Fujita, a psychologist who studies self-control at the Ohio State University, says “Effortful restraint, where you are fighting yourself — the benefits of that are overhyped,” He’s not the only one who thinks so. Several researchers I spoke to are making a strong case that we shouldn’t feel so bad when we fall for temptations. “There’s a strong assumption still that exerting self-control is beneficial … and we’re showing in the long term, it’s not” Indeed, studies have found that trying to teach people to resist temptation either only has short-term gains or can be an outright failure (Allom et al. 2015; Miles et al. 2016). Brian Galla, a psychologist at the University of Pittsburgh, says “We don’t seem to be all that good at [self-control],” If resisting temptation is a virtue, then more resistance should lead to greater achievement, right? That’s not what the results, pending publication in the journal Social Psychological and Personality Science, found. The students who exerted more self-control were not more successful in accomplishing their goals. It was the students who experienced fewer temptations overall who were more successful when the researchers checked back in at the end of the semester. What’s more, the people who exercised more effortful self-control also reported feeling more depleted. So not only were they not meeting their goals, they were also exhausted from trying. So who are these people who are rarely tested by temptations? And what can we learn from them? There are a few overlapping lessons from this new science:
1. People who are better at self-control actually enjoy the activities some of us resist — like eating healthy, studying, or exercising. So engaging in these activities isn’t a chore for them. It’s fun. “‘Want-to’ goals are more likely to be obtained than ‘have-to’ goals,” Milyavskaya says. “Want-to goals lead to experiences of fewer temptations. It’s easier to pursue those goals. It feels more effortless.” If you’re running because you “have to” get in shape, but find running to be a miserable activity, you’re probably not going to keep it up. That means than an activity you like is more likely to be repeated than an activity you hate.
2. People who are good at self-control have learned better habits. In 2015, psychologists Brian Galla and Angela Duckworth published a paper in the Journal of Personality and Social Psychology, finding across six studies and more than 2,000 participants that people who are good at self-control also tend to have good habits — like exercising regularly, eating healthy, sleeping well, and studying. “People who are good at self-control … seem to be structuring their lives in a way to avoid having to make a self-control decision in the first place,” Galla says. And structuring your life is a skill. People who do the same activity — like running or meditating — at the same time each day have an easier time accomplishing their goals, he says. Not because of their willpower, but because the routine makes it easier. A trick to wake up more quickly in the morning is to set the alarm on the other side of the room. That’s not in-the-moment willpower at play. It’s planning. This theory harks back to one of the classic studies on self-control: Walter Mischel’s “marshmallow test,” conducted in the 1960s and ’70s. In these tests, kids were told they could either eat one marshmallow sitting in front of them immediately or eat two later. The ability to resist was found to correlate with all sorts of positive life outcomes, like SAT scores and BMIs. But the kids who were best at the test weren’t necessarily intrinsically better at resisting temptation. They might have been employing a critical strategy. “Mischel has consistently found that the crucial factor in delaying gratification is the ability to change your perception of the object or action you want to resist,” the New Yorker reported in 2014. That means kids who avoided eating the first marshmallow would find ways not to look at the candy, or imagine it as something else. “The really good dieter wouldn’t buy a cupcake,” Fujita explains. “They wouldn’t have passed in front of a bakery; when they saw the cupcake, they would have figured out a way to say yuck instead of yum; they might have an automatic reaction of moving away instead of moving close.”
3. It’s easier to have self-control when you’re wealthy. When Mischel’s marshmallow test is repeated on poorer kids, there’s a clear trend: They perform worse, and appear less able to resist the treat in front of them. But there’s a good reason for this. As University of Oregon neuroscientist Elliot Berkman argues, people who grow up in poverty are more likely to focus on immediate rewards than long-term rewards. Because when you’re poor, the future is less certain.
The new research on self-control demonstrates that eating an extra slice of cake isn’t a moral failing. It’s what we ought to expect when a hungry person is in front of a slice of cake. “Self-control isn’t a special moral muscle,” Galla says. It’s like any decision. And to improve the decision, we need to improve the environment, and give people the skills needed to avoid cake in the first place. “There are many ways of achieving successful self-control, and we’ve really only been looking at one of them,” which is effortful restraint, Berkman tells me. The previous leading theory on willpower, called ego depletion, has recently come under intense scrutiny for not replicating. (Berkman argues that the term “self-control” ought to be abolished altogether. “It’s no different than any other decision making,” he says.) The new research isn’t yet conclusive on whether it’s really possible to teach people the skills needed to make self-control feel effortless. More work needs to be done — designing interventions and evaluating their outcomes over time. But it at least gives researchers a fresh perspective to test out new solutions. In Berkman’s lab, he’s testing out an idea called “motivational boost.” Participants write essays explaining how their goals (like losing weight) fit into their core values. Berkman will periodically text study participants to remind them why their goals matter, which may increase motivation. “We are still gathering data, but I cannot say yet whether it works or not,” he says. Another intriguing idea is called “temptation bundling,” in which people make activities more enjoyable by adding a fun component to them. One paper showed that participants were more likely to work out when they could listen to an audio copy of The Hunger Games while at the gym. Researchers are excited about their new perspective on self-control. “It’s exciting because we’re maybe [about to] break through on a whole variety of new strategies and interventions that we would have never thought about,” Galla says. He and others are looking beyond the “just say no” approach of the past to boost motivation with the help of smartphone apps and other technology. This is not to say all effortful restraint is useless, but rather that it should be seen as a last-ditch effort to save ourselves from bad behavior. “Because even if the angel loses most of the time, there’s a chance every now and again the angel will win,” Fujita says. “It’s a defense of last resort.”

Given the fact that self-control predicts many social and economic outcomes, it’s worthy of asking if racial groups differ in self-control. If they do, then it can help explain some variation in black-white differences in outcomes. If races do not differ in this, then the alternative hypothesis would show that self-control is not an explanatory variable in black-white differences in social and economic outcomes. In a classic series of studies, Mischel (1958, 1961a, 1961b) found that black Trinidadian children given a choice between getting a smaller candy bar now or a larger one in a week tended, much more than matched white children, to choose the smaller, immediate candy bar. The difference between white and black children “so great as to make tests for the significance of the difference superfluous” (Mischel 1961a). Mischel reported undertaking the study because informants had suggested that “Negroes are impulsive, indulge themselves, settle for next to nothing if they can get it right away, do not work or wait for bigger things in the future.”

The thing about these delayed gratification studies, especially the Mischel studies in the 50s and 60s, is that it actually depends on social trust (Michaelson et al, 2013). Banks et al. (1983) and other academics and studies actually criticized Mischel. “Consequently, much of what recent theorists have inferred from past research and conjectured regarding &dquo;a preference for smaller, immediate rewards” among Blacks (Mischel, 1966, p. 125), stands in marked contrast to the actual data. The accumulated evidence largely refutes rather than supports the construct validity of immediate gratification preference among Blacks. Accordingly, with minor exception the evidence fails even to substantiate that such preference characterizes the behavior of that population.” “Systematic experimental research into delayed gratification began with Black subject populations in the Caribbean (Mischel 1958, 1961 c). Mischel had observed informally in the context of a village in Trinidad what earlier social observers had remarked upon in different settings (e.g., Drake & Cayton, 1945). It appeared to Mischel and to his casual informants that a marked tendency obtained in Trinidadian Blacks toward immediate gratification (see Mischel, 1971). This seemed in striking contrast to the behavior of a native East Indian population that was characterized by self-deprivation and the postponement of gratification. In an attempt to verify this cultural observation, Mischel devised a simple paradigm in which subjects were asked to make a choice between two alterlnative rewards for their participation in an experiment. Subjects could choose to receive a small reward to be presented immediately by the experimenter, or a larger, more valued reward to be presented somewhat later by the experimenter his initial investigation with 35 Black 7-to-9-year-olds (both sexes) and a comparative sample of 18 East Indians, Mischel (1958) offered the alternative of a one-cent candy immediately or a ten-cent candy after one week. Sixty-seven percent of the Indian children selected the larger, delayed reward alternative; 33% chose the smaller, immediate reward. This pattern did, indeed, differ from that of the Black sample, who chose the larger, delayed alternative less often (37%) than the smaller, immediate alternative (63 %). Neither of these patterns of choice in itself, however, could be characterized as preferential. In his sample of 18, Mischel’s East Indian subjects would have needed to choose at a rate exceeding 70% to reject the null hypothesis of chance (p = .05) selection of the large, delayed alternative (see Table 1). More germane to the present discussion, Blacks were equally non-preferential in their selections, their 63% of choices for the small, immediate reward failing to exceed the rate required (66%) at the 95% level of confidence by simple z-test. In a somewhat later investigation Mischel (1961b) engaged 68 Black children from Trinidad and 69 Black children from Grenada, aged 8 to 9 years. These children were offered a choice similar to that offered the earlier sample, a two-cent candy immediately or a ten-cent candy a week later. Fifty-three percent of the Trinidadian Black sample chose the two-cent candy; 47% chose the ten-cent alternative. This pattern conformed to chance. Of the Grenadian Blacks, only 24% chose the small, immediate alternative, whereas 76% clearly preferred the larger delayed reward at a rate that rejects the null hypothesis of chance, but in the direction of delay-preference. Another sample of Trinidadian Blacks, this time ranging in age from 12 to 14, was engaged by Mischel (1961c) in a study of the relationship between im- mediate preferences and delinquency. Within the overall sample of 206 children, 68% selected a twenty-five-cent candy to be received in one week. over a five-cent candy to be be received immediately (32%). This pattern of clearly significant (p= .05) preference for the larger delayed reward also obtained for that subsample of the overall group who were classified as nondelinquents. Non Delinquent children from a large elementary government school selected the twenty-five cent candy 74% of the time, clearly significant within that group of 136. The delinquent subsample from a boy’s industrial school selected the immediate reward 44% of the time, and the delayed reward 56%, a pattern that did not depart from binomial chance. This same pattern of inconsistent findings has been replicated in a range of variations on Mischel’s initial paradigm. For instance, a sample of 112 Black Trinidadian 11-to-14-year-olds was presented with one actual reward choice and two hypothetical queries (Mischel, 1961a). Subjects could select either a ten-cent candy to be received immediately or a twenty-five-cent candy to be received in one week. In addition, they indicated either their agreement or disagreement with each of the following self-descriptive statements: &dquo;I wouldrathergettendollarsrightnowthanhavetowaitawholemonthandget thirty dollars then; I would rather wait to get a much larger gift much later rather than get a smaller one now.&dquo; On the basis of their three responses the children were assigned to consistent delay (3 delay responses), inconsistent delay (2 or 1 delay responses) and consistent immediate (0 delay responses) groups. Combining those who made two or three delay responses, 49% of the children can be said to have made delay choices. Using those who made either one or no delay responses, the combined non delay (or immediate) groups totaled 51 % of the overall sample. These rates of choice are obviously the same as chance.Even taking the consistent delay (n= 37) and the consistent immediate (n= 30) groups only,the dichotomy of choice would be 5507 and 45%, respectively. Although these frequencies are in the direction of preference for the larger, delayed reward, they do not depart from statistical chance for a sample of 67. Strickland (1972) offered a different reward choice to 171 American Black children, aged 11 to 13 years. Following the completion of a questionnaire, the children were offered a choice between one 45-rpm record immediately or three 45-rpm records to be received in three weeks. Overall, 55 % of the children selected the one record immediately, 45% selected to wait for the three records in three weeks. These frequencies again conform to chance, in- dicating non preference. A series of four choices between small, immediate, and large, delayed rewards was presented to fourth-graders in a midwestern city. Hertzberger and Dweck (1978) asked 19 Black male and 15 Black female children to choose between one nickel immediately or three nickels in one week, two nickels immediately or five nickels in one week, two nickels immediately or three nickels in one week, a small candy bar immediately or a medium-sized candy bar in one week. Each child made all of these choices without knowing which of them would be the real one; to determine the actual gift. Across the four choices, females selected the larger delayed alternative 53 %, 47 %, 53 %, and 54% of the time, respectively; males chose that alternative at rates of 63%, 58%, 58%, and 58% respectively. None of these was different from chance. Price-Wiliams and Ramirez (1974) presented their sample of 60 Black fourth graders with these questions: “Supposeyou could get $10 now or wait a month and get $30, which would you take?&dquo; &dquo;If you could get a small 5C can- dy bar now, or wait a month and get a bigger 25C candy bar, which would you take?&dquo; &dquo;If you could get a small present now, or take a bigger one a month later, which would you take?&dquo; The first question was entirely hypothetical, whereas the latter two were accompanied by a tangible display of the alter- natives and resulted in the child’s actually receiving the chosen alternative. In response to the first query 18% of the children indicated a choice of the im- mediate $10, while 82% indicated a choice of the $30 to be obtained later. Thirty-three percent chose the smaller immediate candy bar, and 67 % chose the larger, delayed one. Finally, 37% chose the smaller gift, and 67% chose to wait for a larger, more attractive alternative. All of these frequencies were, in fact, significantly different from chance and in the direction of delayed gratification preference. Extending the strategy of presenting purely hypothetical verbal queries, Lessing (1969) devised seven items that related to choice situations (e.g., If wearing ugly braces would make my teeth look prettier later on, I would put up with looking awful for a year or two). Eighty-eight Black eighth- and eleventh-graders responded with either agreement or disagreement to each of the seven items and were scored zero or one for each immediate versus delayed response, respectively. The possible range of scores was 0 to 7 indicating from low to high delayed gratification preference, and the overall mean for the sam- ple was 5.16. Using an estimate of the standard deviation of the scores, calculated from the reported analysis of variance results, would yield a t-value greater than 7.00 when the obtained mean is compared with a 3.50 scale midpoint. It can, therefore, be surmised that these Black subjects responded significantly in a delayed gratification direction.” “Taken together these reports offer little substantiation of past impressionistic conjectures about the inability to delay gratification among Blacks. Clearly the majority of evidence reveals a pattern of nonpreference for either smaller immediate or larger delayed rewards within that population. At the same time, a considerable accumulation of evidence suggests quite the opposite of immediate preferences. Lessing, (1969), Mischel (1961b, 1961c), Price-Williams & Ramirez (1974), and Seagull (1966) reported choice patterns among American, Trinidadian, and Grenadian Blacks that reflect a preference for delayed gratification. Against these findings only three studies within the experimental research literature support the notion that Blacks exhibit immediate reward preferences. Mischel (1958) found that his entire (albeit small) subsample of Black Trinidadian children who reported their fathers as not liv- ing at home (n = 10) chose the immediate reward alternative. Similarly, he reported in a later study, (Mischel, 1961b) that a subsample of father absent Trinidadian Blacks (n= 23) preferred the immediate, smaller reward at a rate (74%), which exceeded chance. Mischel interpreted both these observations in terms of trust, reasoning that the absence of a father may have undermined the confidence of these children in any promises made by authori- ty figures. Strickland (1972) attempted to lend support to this thesis in a study of the effects of the promise-maker’s race, and found that her sample of Black children (n = 84) preferred the immediate reward (67%) in a white experimenter condition.” Banks uncovered theoretical and methodological flaws in the way the construct was developed and investigated (i.e., theoretical inconsistencies and limited research design, respectively). Thus, the notion that Black people preferred immediate gratification, due to an overwhelming sense of powerlessness, victimhood, and low agency, was largely unsubstantiated (Banks et al., 1983). This critical review and rejection of Eurocentric psychological theory laid the foundation for a more accurate, strengths-based approach to people of African descent in the field of psychology. Studies like Michaelson et al (2013) conclude the ability to delay gratification is predicated on social trust, and present an alternative hypothesis for all of these positive and negative life outcomes: “Social factors suggest intriguing alternative interpretations of prior findings on delay of gratification, and suggest new directions for intervention. For example, the struggles of certain populations, such as addicts, criminals, and youth, might reflect their reduced ability to trust that rewards will be delivered as promised. Such variations in trust might reflect experience (e.g., children have little control over whether parents will provide a promised toy) and predisposition (e.g., with genetic variations predicting trust; Krueger et al., 2012). Children show little change in their ability to delay gratification across the 2–5 years age range (Beck et al., 2011), despite dramatic improvements in self-control, indicating that other factors must be at work. The fact that delay of gratification at 4-years predicts successful outcomes years or decades later (Casey et al., 2011; Shoda et al., 1990) might reflect the importance of delaying gratification in other processes, or the importance of individual differences in trust from an early age (e.g., Kidd et al., 2012).” Kidd, Palmeri, and Aslin, (2012) suggest that “children’s wait-times reflected reasoned beliefs about whether waiting would ultimately pay off.” So these tasks “may not only reflect differences in self-control abilities, but also beliefs about the stability of the world.” Children who had reliable interactions with the researcher waited about 4 times as long—12 minutes compared to 3 minutes—if they thought the researcher was trustworthy. Michaelson and Munakata (2016) show more evidence for the relationship between social trust and delaying gratification. When children (age 4 years, 5 months, n = 34) observed an adult as trustworthy, they were able to wait for the reward, compared to when they observed the adult as untrustworthy they ate the treat thinking that, since they observed the adult as untrustworthy, they were not likely to get the second marshmallow than if they waited for the adult to return if they believed him to be untrustworthy. Ma et al (2018) also replicated these findings in a sample of 150 Chinese children aged 3 to 5 years old. They conclude that “there is more to delay gratification than cognitive capacity, and they suggest that there are individual differences in whether children consider sacrificing for a future outcome to be worth the risk.” Those who had higher levels of generalized trust waited longer, even when age and level of executive functioning were controlled for.

Seagull (1966) looked at black and white 9 year olds who lived in New York City. Blacks and whites were offered the choice between being given a small candy bar now, or a larger one in a week’s time. Black children were more likely to ask for the smaller candy bar now than white children.

Banks et al. also addressed Seagull saying “In a somewhat ambiguous report of delayed gratification choices of 68 Black third-graders in Syracuse, New York, Seagull (1966) described two subsamples-one which was unclassifiable as to socioeconomic status, and one which was classifiable.; All children were offered a choice between one Hershey bar immediately and two Hershey bars to be obtained after waiting one week. Seagull reported that among the unclassifiable Blacks, 51% chose the immediate alternative, and 49% chose the delayed; these frequencies do not differ from chance. But among Seagull’s classifiable group, 24% chose the immediate and 76% chose the delayed candy bars, indicating significant preference for the larger, albeit postponed reward.”

Herzberger and Dweck (1978) looked at a sample of 100 4th graders and asked them to rate prizes. After rating the prizes, the researcher showed the immediate prizes and the delayed one. Per the study, “the choice pairs included: three nickels, two versus five nickels, two versus three nickels, a small candy bar versus a medium-sized candy bar, and a rubber ball versus an iron-on patch (the latter was inscribed with either ‘keep on truckin” or ‘try it—you’ll like it’).” Black children had lower self-control than white children even after controlling for socioeconomic status.

It actually says “There was also a tendency for lower SES blacks to delay less often than lower SES white subjects. This finding may be attributable, as Strickland (1972) suggests, to a lack of trust in the white experimenter.” Again, a common flaw in these delayed gratification studies.

Not all studies used candy, though. In the mid 1990s, the U.S. government offered military personnel two options for when they retired: A large lump sum of money now (immediate reward) or a yearly payment (delayed reward) which, overtime, will be more than the immediate lump sum of money. Warner and Pleeter (2001) looked at the data for 66,483 individuals and found that blacks were 15% more likely than non-blacks to take the immediate reward. Whites were .4% less likely than non-whites to take the immediate reward.

The utility of such a study is confusing. How could hereditarians assign some inherent variable/s causing or even related to less delayed gratification from the study? They even state it themselves stating, "for example, more-educated, higher-income individuals may be able to borrow at lower rates than less-educated, lower-income individuals. And to the extent that blacks and other minorities face discrimination in credit markets, they will face higher borrowing rates and therefore exhibit higher personal discount rates." Then later regarding the lump sum, "again, blacks are estimated to be significantly more likely to take the lump sum than other nonwhites while whites are significantly less likely. Those with more education are again found to be less likely to take the lump sum and to have lower discount rates, as are older personnel." So throughout the study the difference is an issue of education.

Zytkoskee, Strickland, and Watson (1971)

The abstract of this study literally says “No relationship was found between internal-external control or the status conditions and delay behavior.” In fact, they found no correlation (r = .09) between the Bialer Locus of Control Scale and their five-item measure of immediate- versus delayed- reward preference in their small sample of 76 Black ninth-graders. So this is antithetical to his point that black people have lower self control because there’s no correlation between gratification and a sense of powerlessness/helplessness in Black people.

and Price-Williams and Ramirez (1974) featured Mexicans, whites and blacks. The choices varied slightly and consisted of the option of $10 now or $30 in a month’s time, a 5 cents candy bar now or a 25 cents candy bar in a month’s time. There was little difference between the Mexicans and blacks, both of whom preferred the immediate reward — white children preferred the delayed reward at a higher rate.

The scope of this study is severely limited. It consists only of 180 fourth graders attending Catholic schools in disadvantaged socioeconomic status. The study notes not to generalize this:
The results were interpreted within the context of the school situation as it is recognized that universal generalization can be hazardous.
These studies suggest that distrust in experimenters can explain the results which is the common flaw in all these delayed gratification studies. Per the study:
At any rate, we divided those children who spontaneously brought up the trust reason, as their primary choice against those who invoked any other reason. One-tailed 2 X 2 chi-squared tests, using Yates’ correction for continuity when necessary, were calculated for comparing the Anglo children with each of the other two ethnic groups where data was available. For Condi- tion I the comparison between Anglos and Mexican-American was not significant, but the comparison between Anglos and Blacks was significant at the .005 level. Comparing the latter two groups for Condition IIIwas also significant at the ,005 level. Unfortunately, there was an error in collecting responses in Condition I11 for the Mexican-American children, so comparison with the Anglo children here was not possible. Nevertheless, the element of lack of trust among the Black children is striking, despite the fact that it was a Black investigator who was testing them

In the mid-1990s, the U.S. government offered military personal two options for when they retired: A large lump sum of money now (immediate reward) or a yearly payment (delayed reward) which, over time, will be more than the immediate lump sum of money. Warner and Pleeter (2001) looked at the data for 66,483 individuals and found that blacks were 15% more likely than non-blacks to take the immediate reward. Whites were .4% less likely than nonwhites to take the immediate reward.

Debunked above.

Castillo et al. (2011) had 82% of the student population of 4 middle schools in a poor Georgia school district. Subjects were asked if they want $49 now, or $49 + $x seven months from now. The x variable was positive and increased over time, so it would’ve been a lot of money. Black children had significantly lower control than white children.

This study is not looking at self control, it is looking at impatient decisions.
This is only from one county and a few middle schools about children. This isn’t applicable to the black population at all. Again, there is no warranty that trust in the experiment is the actual factor.
Also, looking at their regression model, their results are not robust across exogenous variables. The black coefficient is weak (0.08) with a considerable standard error that cuts this in half (0.04). It is not statistically significant at the .05 level. Constructing a confidence interval for this would be damning to these results.

Andreoni et al. (2017) examined a total of 1,265 children who were asked if they wanted an immediate reward at the end of the day, or a larger reward the next day. The child’s race was significantly related to their level of patience and black children had lower levels of self-control than the white and Hispanic children. These differences weren’t explained by early assignment to school or parent preferences. Controlling for SES and IQ again made the coefficient smaller, but this doesn’t mean that low self-control is a product of SES and IQ. Rather, low self-control can impede rising up the socioeconomic ladder.

Andreoni et al. (2017) had limitations not mentioned.
1. “It is unclear whether the ability to wait is increasing with age because time percep- tions change with age (i.e., 1 day to a 3-year old feels “longer” than 1 day to a 12-year old) or whether the underlying time preference construct is changing. To disentangle these differences, future research should ex- plore how changing the time delay affects willingness to wait by age. Future research should also explore the test-retest reliability of this measure.”
2. “It is unclear whether parent preferences are uncorrelated with child preferences, whether the measures that we use are the most appropriate for observing this correlation, or whether the prefer- ences of children are simply difficult to measure. Our results are in line with Bettinger and Slonim (2007) who also found no correlation between adolescent and parent time preferences, but are at odds with Kosse and Pfeiffer (2012, 2013). Notably, we found no association in parent and child time preference using two different measures of time preferences: the standard economic time preference elicitation task, and the delay of gratification paradigm. We also found no association when constraining our sample to mothers only, as Kosse and Pfeiffer (2012, 2013) do. An interesting extension would be to systematically use alternative tests of parent preferences, such as a qualitative question with parents, to see if differences in methodology can partly explain the mixed findings in this literature.”
3. “Because our experiment was not initially designed to disen- tangle the causal impact of schooling on child time preferences, we only see a sub-set of children in our data who were also part of the CHECC randomization. Hence, while we do not see statistically signifi- cant differences in time preferences by treatment assignment, this could be due to a small sample size or due to sample selection. For in- stance, suppose that random assignment to a CHECC treatment group does causally affect child time preferences, but there is differential attendance at the experimental sessions based on child level of impa- tience, such that parents of more impatient control group children are less likely to attend than parents of more impatient treatment group children. Such a story would undermine our ability to find treatment effects. To address this, we conducted a wave of data collection in 2017 that assessed children in school. This allowed us to reach all of the children within one participating district, independent of parental involve- ment. But this wave occurred several years after the intervention, when the potential effects of the intervention on time preferences could have faded out. We believe that future work should continue to use exogenous variation in early childhood environments to better understand the causal impact of such variation on time preference development.”
4. “Another possibility is that early childhood education treatments are causally related to making mistakes in the decision task, which could result in non-monotonic decisions.”
The results could be due to the vast differences between the tasks and outcomes administered to children and parents.

As for IQ, this wouldn’t mean much given race differences in IQ are primarily genetic (see Jensen 1998)

Bird (2020) literally debunks Jensen’s claims: “Results presented here indicate that known biases from population structure, assortative mating, indirect genetic effects, gene-environment interplay, and derived allele frequency differences between African and non-African populations bias polygenic score analysis.” then shows “that these biases likely produced false signals of polygenic selection in recent analyses, and that there are no signals of divergent polygenic selection between African and European populations.…I further show the predicted genetic contribution to the Black-white gap in IQ score across a range of heritability estimates was substantially smaller than observed phenotypic gaps, suggesting at least 80% of the IQ variance between Africans and Europeans is environmental in nature under an idealized ‘best case scenario’ for the hereditarian hypothesis.”

and self-control and IQ share genes that overlap (Polderman et al. 2009).

Which genes?

So, we are essentially controlling for genetic differences. Race differences in self-control levels can also be moderately explained by genetics, especially since self-control is under some genetic influence. Self-control is moderately influenced by genetics, as found by heritability studies. Heritability is simply the variation of a trait due to genes within a population. Multiple studies have found that self-control is influenced by genes and the rest is due to non-shared and shared environments. Beaver et al. (2008) looked at 80 high school students and 52 middle school students who took part in Add Health, a nationally representative sample of adolescents. Lowe self-control was measured from questions on the Add Health survey that measured self-control, drug use among peers was also measured, and so was maternal disengagement, maternal attachment, maternal involvement, and parental permissiveness. Race, age, and gender were held constant in their analysis. The overall heritability of self-control was at h2=0.56.

This paper doesn’t even have a cronbach's alpha for how consistent and reliable their scores are. We see that the alpha values for the self control scales were .65 and .62 which is way below the recommended .80 threshold. This is reflected that they admit there’s no valid or reliable way in quantifying self control: “Even though Gottfredson and Hirschi’s (1990) general theory has been one of the most empirically scrutinized theories in recent years, there is still a considerable amount of disagreement concerning the most reliable and valid way to measure self- control (DeLisi, Hochstetler, & Murphy, 2003; Longshore, Stein, & Turner, 1998; Longshore, Turner, & Stein, 1996; Marcus, 2003, 2004; Piquero & Rosay, 1998). Although Gottfredson and Hirschi are quite clear that people with low levels of self-control are risk seekers, are impulsive, are self-centered, prefer physical activities to mental ones, prefer simple tasks, and have a temper, there is wide variability in the scales that have been used to tap self-control. Perhaps the most frequently used measurement strategy is the scale developed by Grasmick, Tittle, Bursik, and Arneklev (1993). Unfortunately, the Add Health surveys did not include items that could be used to construct the Grasmick et al. scale.” They then use proxies for measures of self control which haven’t been empirically validated. Regardless it’s still a self reported test: questions are less clear, there is no correct answer, participants are sometimes unaware of the examiner’s expectations, motivation may vary considerably, and the examiner is usually interested in typical performance. Additionally we see that race is a bad predictor of self control heritability as it has large standard error and is statistically insignificant.

Meaning that 56% of the variance in self-control was due to genes.

That’s not even what heritability means.

Anokhin et al. (2011, 2015) found that the heritability of self-control increases with age, meaning that as one gets older genes start playing a larger role in self-control. The heritability of self-control at age 12 was .30, .51 at age 14, and .55 at ages 16 and 18. Isen et al. (2014) looked at twins who participated in the Minnesota Twin Family Study and measured their self-control through a computerized task, their intelligence through a short form of the WAISC-R, their socioeconomic level, and psychopathology. Their h2 estimate was found to be at .47. Recently, Willems et al. (2019) conducted a meta-analysis and found the heritability of self control to be .6.

It's a little strange that the correlation doesn't increase with age (i.e. in contrast with IQ). I'm not sure whether I trust this claim, since they say that early/middle childhood assessments mostly consisted of parent-reports which they adjusted for with a "multiple-moderator models". I'm not sure what that means, but so long as it's adjusting for source type, it seems like they're indirectly adjusting for age, and "heritability doesn't change with age after we (indirectly) adjusted for age" doesn't strike me as very interesting.

In conclusion, half of self-control can be explained by genes and races differ in self-control for genetic reasons and environmental reasons.

These are twin studies (which determine heritability estimates) which have been criticized due to the many assumptions twin studies make (ex. the equal environments assumption is false). Saying things like "half of self-control can be explained by genes" demonstrates that 1."Self-control" is an action and not reducible. Twin studies are useless. 2. This person has a shit understanding of what heritability implies. Heritability estimates don’t help identify particular genes or ascertain their functions in development or physiology, and thus, by this way of thinking, they yield no causal info. Additionally, Heritability estimates do not estimate the relative weight of genetic and environmental influences in a population, and are misleading and potentially harmful when presented this way.
• Although heritability estimates are based on the assumption that genetic and environmental factors do not interact, they clearly do (see the model-fitting section below).
77
STUDIES OF REARED-APART TWINS
• Heritability is the property of a population, not of the characteristic or disorder itself.
• Heritability refers to the genetic contribution to behavioral variation in a particular population; it does not describe the importance of genetic factors as they relate to an individual.
• Heritability estimates apply only to a specific population, at a specific time, and in a specific environment.
• Heritability estimates are based on research methods that are unable to disentangle the potential influences of genes and environment on behavior, such as family and twin studies.
• The finding of high heritability within populations says nothing about whether genetic differences exist between different populations.
• High heritability, or even 100 percent heritability, does not mean that even simple environmental changes or interventions cannot have an important impact. The difference between black and white people is still totally environmental. Feldman & Lewontin (1975) show that “this partitioning of the causes of variation is really illusory, and the analysis of variance cannot really separate variation that is a result of environmental fluctuation from variation that is a result of genetic segregation. The genetic variance depends on the distribution of environments and the environmental variance depends on the distribution of genotypes.” Next they say, “the very existence of genotype-environment correlation precludes the valid statistical estimation of the genotypic, environmental, and interaction contributions to the phenotypic variance. That is because correlation makes it impossible to know how much of the phenotype similarity arises from similarity of genotype and how much from similarity of the environment. Thus in human population studies, where experimental controls are either impossible or unethical, statistical inference about the heritability of traits that are phenotypically plastic is invalid. We discuss these difficulties later from another point of view.” Estimates of heritability less biased by environmental confounds unlike these twin studies are more valid. For example, Morris et al (2020) found that “Using data on educational achievement and parental socioeconomic position as an exemplar, we demonstrate that both heritability and genetic correlation may be biased estimates of the causal contribution of genotype.” Also see: https://zero.sci-hub.ru/1870/a9a17b94b5177fa5de11aa3f863585a3/bailey1997.pdf

This attitude towards rewards can be described in a variety of ways: more rapid decay of reinforcement, unwillingness or inability to defer gratification, “extreme present-orientation” (Banfield 1974), impulsiveness, lower superego-dominance. In more crude terms, blacks are more impulsive than whites.

This is not what the book argues. Crime is a persistent and pervasive problem in many urban areas around the world. It is often seen as a symptom of deeper social and economic problems, such as poverty, inequality, and a lack of opportunity. In his book "The Unheavenly City Revisited," Edward C. Banfield argues that crime is, in fact, the result of a failure of government policies to address the needs of urban residents. According to Banfield, crime is a direct outcome of the social and economic disadvantages faced by many urban residents, and it is the responsibility of the government to address these underlying issues in order to reduce crime and create safer, more livable cities. Banfield's argument is grounded in his analysis of the causes of crime in urban areas. He contends that crime is often the result of a lack of social cohesion and a failure of government policies to provide for the basic needs of urban residents. According to Banfield, many urban residents feel disconnected from the larger community and are unable to access the resources and opportunities that they need to succeed. This can lead to frustration, anger, and a sense of hopelessness that can, in turn, fuel criminal behavior. To support his argument, Banfield points to a number of government policies that have contributed to the persistence of crime in urban areas. For example, he cites the failure of government welfare programs to provide adequate support to families in need, as well as the lack of access to quality education and job training programs. Banfield also argues that the high cost of housing and other necessities in urban areas can make it difficult for residents to make ends meet, leading to a sense of desperation and a greater risk of criminal activity. Overall, Banfield's argument in "The Unheavenly City Revisited" is compelling and well-supported by evidence. His analysis of the causes of crime in urban areas highlights the important role that government policies play in shaping the social and economic conditions of cities. By focusing on the needs of urban residents and addressing the underlying issues that contribute to crime, governments can work to create safer, more livable cities for all.

We shouldn’t expect racial differences in self-control to narrow as people age, especially since low self-control in childhood stays consistent up until adulthood (Casey et al. 2011). Race differences in self-control matter since they could explain a variety of racial disparities. For example, Banfield argues that the primary cause of black poverty is because the lower class person lives from moment to moment– they are unable or unwilling to take account of the future or to control their impulses.

See above.

Herrenstein and Wilson (1985) reported that poor blacks wanted to make a lot of money, but they left jobs if they were low paying while, ironically, saying that the work game is strong.

"Crime & Human Nature" has received criticism from scholars for a number of reasons, including its claims about the relationship between poverty, race, and crime. In the book, Wilson and Herrnstein argue that poor black individuals are more likely to engage in criminal activity because they are more likely to have certain personality traits, such as impulsivity and low self-control, which are correlated with criminal behavior. They also suggest that poor blacks may be more likely to engage in criminal activity because they are more likely to have low-paying jobs that do not provide enough financial reward for the time and effort invested in them. However, this claim has been challenged by other scholars who argue that it is overly simplistic and does not adequately account for the complex social, economic, and political factors that contribute to criminal behavior. Some critics have pointed out that poverty and race are themselves the result of structural inequalities and systemic discrimination, rather than being inherent or individual characteristics. Others have argued that the authors' emphasis on individual-level explanations for crime ignores the role of larger social and economic factors, such as unemployment, discrimination, and inequality, in shaping criminal behavior. In addition, some scholars have criticized the authors' use of research and data to support their claims, arguing that they selectively interpret and present evidence to support their arguments while ignoring or downplaying evidence that contradicts their views. Overall, while "Crime & Human Nature" has had a significant influence on the field of criminology and on public policy debates about crime and criminal justice, it has also been the subject of significant scholarly criticism.

W.J. Wilson also reported how blacks told ethnographers that their black unemployed friends were lazy; one person said that “many black males don’t want to work, and when I say don’t want to work, I say don’t want to work hard. They want a real easy job, making big bucks” (Wilson 1997).

First of all these are just anecdotes. Second of all the book literally says this after: “The deterioration of the socioeconomic status of black men may have led to the negative perceptions of both the employers and the inner-city residents. Are these perceptions merely stereotypical or do they have any basis in fact? Data from the UPFLS survey show that variables measuring differences in social context (neighborhoods, social networks, and households) accounted for substantially more of the gap in the employment rates of black and Mexican men than did variables measuring individual attitudes. Also, data from the survey reveal that jobless black men have a lower ‘reservation wage’ than the jobless men in the other ethnic groups. They were willing to work for less than $6.00 per hour, whereas Mexican and Puerto Rican jobless men expected $6.20 and $7.20, respectively, as a condition for working; white men, on the other hand, expected over $9.00 per hour. This would appear to cast some doubt on the characterization of black inner-city men as wanting ‘something for nothing,’ of holding out for high pay.”

Lower self-control among blacks could partially explain why blacks are poorer than whites. Race differences in self-control can also help explain why blacks commit more crimes in the U.S. and all over the world (see Beaver, Ellis, and Wright 2009). Although risk-taking is sometimes beneficial, races engage in different risks. Blacks are more likely to engage in risky behavior such as smoking, not wearing a seat belt, and not engaging in proper hygiene (Hersch 1996;

This only applies to average differences which is misleading because it’s univariate. The study repeatedly concludes that after controlling for other variables, the differences aren’t substantial. In fact the study says:
Overall, the results reveal that much of the seemingly riskier choices of black men relative to white men is due to differences in characteristics. The gap in safety choices narrows considerably or even reverses in estimates controlling for individual characteristics.
For “smoking” the study says:
The percent of black men who smoke is 8.2 percentage points greater than white men. However, controlling for characteristics, the probability that a black man will smoke is only 2.5% higher than a white man with the same characteristics.
And right after, for “not wearing a seatbelt” the study says:
the racial gap in seatbelt use become insignificant in the probit estimates
These are the same for female comparisons. Finally, for hygiene the study shows in table 2, that the gaps in brushing and flossing are statistically insignificant. When adjusting for other factors, table 6 shows that black men have a 3.5% and 2.3% higher probability of flossing and brushing, respectively. While it shows marginally higher probabilities for white women vs black women, table 3 shows black women have similar correlations between seat belt usage & flossing (0.14 vs 0.13), seat belt usage & brushing (0.12 vs 0.12), flossing & brushing (0.18 vs 0.19), flossing & excercise (0.07 vs 0.12), and flossing & checking blood pressure (0.06 vs 0.03). Nevertheless, the study just contradicts FB in the abstract and the conclusion several times.

CDC; Lynn 2019). Given that self-control correlates with lifetime delinquency and income, two variables in which blacks and whites differ, then this could be a factor that can help explain some of the reasons as to why blacks commit more crimes and have lower incomes. Adjusting for other variables does not close the gap, and still leaves it open — showing that environmental variables are not the reason as to why blacks have lower self-control. Along with self-control, or lack thereof, another mediating variable to help explain the black-white crime/wealth gap is IQ.

If FB wants to convince the skeptics that IQ plays an important role in social outcomes, then FB will have to do a few things: Show that the metrics of IQ are associated with the outcomes in the first place, using proper statistical techniques. Show that the metrics of IQ are substantially associated with the outcomes, without using an endless list of statistical massaging techniques to boost the correlation. Show that basic confounds do not cause the relationship (fine indices of socioeconomic status, other psychology variables, etc). Demonstrate causality using robust techniques (Mendelian randomization, instrumental variables, regression discontinuity, differences in differences, etc). Identify the mechanism that connects IQ scores and the social outcome. Demonstrate the ergodicity assumptions holds (Fisher et. al 2018). Thus far, not a single one of the outcomes IQ is allegedly associated with has passed step 2 (very few have passed step 1), let alone all 6.

As has been noted below and in other places, races do differ in mean intelligence. Whites, on average, have an IQ of 100 and blacks have an IQ of 85. This view is not heretic, and has in fact been supported in the overall scientific literature. Shuey (1966) in The Testing of Negro Intelligence reported on 382 studies involving 80 different tests administered on hundreds and thousands of black and white children, high school and college students, military personnel, civilian adults, deviates, and criminals. The average black IQ score in these studies were a bit below 85, and the average for whites was also a bit above 100. The average black-white difference was always close to 1 SD.

This person seems to not be caught up with newer developments. This is from 1966 so it’s not hard to debunk. Shuey first off assumes race exists without any definition (something other hereditarians do). Often posited is that hereditarians are not necessarily discussing the biology of race but this is irrelevant because the evidence is. Regarding the Black White IQ gap, the current evidence goes against the 15 point gap purported by Shuey Smith (2018). The admixture studies provided may seem useful, but the actual claims conflict with basic understandings of evolution.
https://notpoliticallycorrect.me/2017/04/09/the-evolution-of-human-skin-variation/
His discussion of Binet is interesting considering Binet's intentions: https://notpoliticallycorrect.me/2020/12/02/binet-and-simons-ideal-city/
and flaws:
https://notpoliticallycorrect.me/2020/01/11/the-frivolousness-of-the-hereditarian-environmentalist-iq-debate-gould-binet-and-the-utility-of-iq-testing/
The most crushing evidence against any of the information provided in the 600 pages of nonsense would be violations of Berka-Nash and the GxE which obviously was not known then. It is not discussed as any damning evidence because it is very old and even then tests a gap that has decreased. Citing a study from 1966 would obviously show you 1 SD difference. Even then, questions on the origin of this difference are put into question. Arguing in favor of g puts you in another debate because now we have to answer questions on the causal nature of the latent variable. Even then, IQ tests don't measure intelligence as there is no unified theory behind "measuring" intelligence. Besides, scholars like Graham Richards have criticized Shuey. “1. Twenty-two percent of her references (122 of 555 by my count) are to unpublished material, mostly masters’ and doctoral theses. Virtually all of these are among the 380 studies providing the data, thus accounting for almost a third of them, 70 date from the 1941–60 period (16 from 1961–5). Many emanate from Southern colleges and universities, including black universities such as Fisk and Howard. Insofar as they are Southern in origin a question mark must hang over their acceptability. Work emanating from Deep South universities and authored by white post-graduates in their early twenties in, for example, the 1940s was itself a product of that region’s racist segregationist culture. In the absence of evidence to the contrary we must assume the young authors’ orientation (and mode of relating to African American subjects) still to be deeply pervaded by the prevailing attitudes, values and assumptions of the cultural context in which they lived. Black-authored work from this region (generally intra-racial in character and included insofar as it provided black performance data) will also reflect the realities of this culture (and usually did not claim otherwise). Garrett, in his introduction, praises the recent research which succeeds in ‘equating background variables’ (p.vii). But how could this be done in a Deep South culture which systematically ensured, as a central matter of cultural principle, precisely that these were kept unequatable? Given the decline in published research on the topic during the 1930s, Shuey had, however, few alternative sources of recent research to these unpublished theses. Of 76 studies cited for 1941–50 45 (59%) were of this nature. While citation of unpublished theses is not in itself an academic sin—on the contrary, it can boost an author’s cloister-credibility—the excessive scale on which Shuey does so and the uncritical use she makes of them as reliable primary data sources is one of the work’s major shortcomings. 2. Pursuing this line of bibliographic critique we note that 191 (34%) of the references are to material from 1940 or earlier (but not all to race differences studies as such). While she is neither uncritical nor entirely undiscriminating regarding the quality of these studies, they nevertheless provide Shuey with a substantial proportion of her data and are incorporated without much ado into her various summarising ‘meta-analyses’. Yet, as we have already seen, this earlier work had failed to convince the discipline at large by the 1930s and was vitiated by numerous methodological and conceptual flaws. These two points, taken together, indicate that the calibre of much of the data Shuey is drawing on is very poor or must be assumed to be so. If unpublished and pre-1941 data were excluded from consideration her case would be considerably weakened. Adding bad data to good, even if it is consistent with it, does not strengthen the latter. Nor does the fact that the same findings emerge from repeated use of flawed methods render them less flawed. 3. Shuey overstates the ‘remarkable consistency’ of the findings. They are consistent to the extent that they invariably show African American underperformance, but certainly not consistent regarding the extent of this. 4. She fails to notice that successive restandardisations of several intelligence tests required setting the mean ‘100’ score at relatively higher levels. She therefore claims that ‘Negro’ intelligence has remained static throughout the period covered. If, in fact, the gap in scores has been static it must however imply that black performance has improved in parallel with white performance. 5. Finally it must be stressed that Shuey’s resolutely empirical approach enabled her to dodge any in-depth consideration of underlying conceptual and theoretical problems. Since these are considered at length in the next chapter it is only necessary here to note that Shuey’s position is fully vulnerable to the points which will be raised there. The solidity of Shuey’s compendious text is thus none the less illusory. It relies on the cumulative impact of presenting masses of data and findings while largely disregarding the problematic nature of much of this and ignoring conceptual or theoretical questions which cannot be answered by empirical data alone. Historically it was the most comprehensive and forcefully articulated statement of the traditional ‘race differences in intelligence’ position prior to the Jensen controversy, and served as a primary reference point for succeeding pro-differences researchers. Its wider impact appears to have been relatively limited even so, there was little essentially new in it and many in the anti-differences camp felt disinclined to plough yet again through evidence they believed had long been proved valueless.”

Lynn (2011) reviewed hundreds of studies looking at race differences in IQ, and the black-white IQ gap was always 1 SD.

Lynn is actually from 2006. This book is a frightening example of how a European author with skills of academic presentation can argue any case by selectively ignoring vast areas of research on the roles of individual biological variation, cultural traditions and biases in psychological testing, and by creating conceptual entities from unreliable observational phenomena. This is dangerous because, in the past, similar arguments have confirmed racist political and layperson attitudes, and at their extremes resulted in the holocaust and apartheid. Critics were also quick to point out that Lynn's analyses had a deeply flawed sampling from Africa, resulting at least in a bias against African countries and it showed signs of measurement bias from the IQ tests themselves (Wicherts et al., 2010). The evolutionary reasoning has also been critiqued by research that casts doubt on the validity of the “Cold Winters theory” (MacEachern, 2006; Pesta & Poznanski, 2014; Wicherts et al., 2010). Continuing this legacy of evolutionary explanations for racial theories, recent genomic analyses (Piffer, 2015, 2019) claim to provide strong genetic evidence in support of natural selection using polygenic scores derived from GWAS in European populations. There are independent lines of evidence that genetic differences at variants associated with EA and CP are consistent with neutral evolution instead of divergent positive selection. First, the fact that education-and-cognitive-performance-associated alleles do not show more genetic differentiation than control SNPs that are not associated with these traits is demonstrated. Second, I test for polygenic selection using polygenic scores computed from within-family effect sizes that minimize the confounding biases mentioned above (Berg & Harpak, 2019; Sohail & Maier, 2019) and did not find a signal of divergent positive selection. Although there is more noise in within-family effect size estimates, Cox et al. (2019) were able to detect signals of polygenic selection for height in a sample of ancient genomes using within-family effect sizes and between-family effect sizes, which suggests that despite the greater noise in within-family estimates, they are still capable of detecting polygenic selection. Additionally, the results presented here build upon the failure of Guo et al. (2018) to find significant genetic differentiation of a different set of education-associated SNPs compared with control SNPs, and the failure of Racimo et al. (2018) to find evidence of divergent selection for educational-attainment-associated SNPs between African and European population.

Roth et al. (2001), which was a large meta-analysis which included more than 6,000,000 individuals, found that blacks score 1 SD lower than whites.

First of all, different populations have different variances, even different skewness and these comparisons require richer models. These are severe, severe mathematical flaws (a billion papers in psychometrics wouldn’t count if you have such a flaw). The formal treatment is here.The Roth et al. study had several limitations not mentioned. 1. “We found relatively few industrial samples, although the values of N were often large. We speculate that this is partially due to the diffuse nature of the literature on ethnic group differences. We found studies in a variety of fields and journals. The relatively small number of industrial studies led to somewhat large confidence intervals. We note that the confidence intervals in many of our moderator analyses did overlap. For example, the confidence intervals associated with low and medium complexity jobs overlapped considerably. Although our focus was on obtaining the best mean estimates in many cases, we do note this limitation.” 2. There was an “influence of studies with a large sample size, in that they had a substantial effect on the results of the meta-analysis. For example, the GRE is associated with a large Black-White d score in the overall and educational samples. In addition, studies using the Wonderlic contributed a large portion of data and may have a large influence on our results. When appropriate, we analyzed data with and without such large samples.” 3. “There may be a number of latent variables associated with our moderators. For example, there may be some socioeconomic variables that correlate with job complexity which partially obscure the interpretation or causality of the exact effect of job complexity on standardized group differences. We encourage basic research into this issue below.” 4. “We were unable to assess the influence of time on standardized ethnic group differences. A significant body of research has suggested that average scores on mental abilities are ris- ing and this trend may narrow the Black-White group difference (e.g., Flynn, 1999). This research is not without its methodological problems (e.g., Jensen, 1998) or data contradicting it (Nyborg & Jensen, 2000). Although we had originally coded the date of publication in our meta-analysis, we found that there was such a large influence of extraneous factors such as varying sample sizes by time, various tests across time, and so on, that we simply did not put much faith in this analysis. Instead, we tried to control for the influence of time by choosing the most recent studies when there was an option. For example, we chose to include only the last few years of tests such as the SAT, GRE, and ACT because they have been revised to reduce ethnic group differences and they provide the most recent data available. Within our analyses we did find three longitudinal studies that addressed this trend using the same test(s) across time. Without devoting a great deal of time to this debate, we refer the interested reader to the following sources (Lynn, 1998; Nyborg & Jensen, 2000; Wonderlic & Wonderlic, 1972). As a whole, these studies suggest that there are observed gains for both groups, but the reduction in the between-group difference is either small, potentially a function of sampling error (Lynn, 1998), or nonexistent for highly g loaded instruments (Nyborg & Jensen, 2000).”
They cite Dickens & Flynn 2006 (Not the actual study but a response they made IN 2006)
http://www.iapsych.com/iqmr/fe/LinkedDocuments/dickens2006b.pdf
where they review Rushton and Jensen 06's claims. On Roth et al. , they say: "If GRE (Graduate Record Exam) results are treated as a single source, almost 60% of the studies Roth et al. analyzed refer to pre-1980 data. As for the gap of 1.1 standard deviations, the median age in the meta-analysis of Roth et al. would not be under 24. Our Figure 3 projected to age 24.7 gives a current IQ for Blacks of 83.5, or exactly 1.1 standard deviations below Whites." Rushton and Jensen quote Roth et al. (2001) as concluding that there has been no Black gain. However, Roth et al. explicitly stated that their own data left them ‘‘unable to assess the influence of time on standardized ethnic group differences’’ (p. 323). Instead, they directed the reader to three sources that they thought might be illuminating: Lynn (1998), which is a study of vocabulary scores and not IQ; Wonderlic data, which we have already analyzed in Appendix B; and Nyborg and Jensen (2000), in which there is no attempt to measure trends over time and which Jensen himself has not cited against us. If Rushton and Jensen wish to make a case based on analysis of these three sources, they should do so. Citing the conclusion of Roth et al. is simply an appeal to authority, and to imply that the conclusion of Roth et al. is based on the data that they analyzed is unhelpful." And what about Admixture? Jensen's hypothesis that these differences in the frequency distributions of genetic variants is causally related to the Black-white test score gap and anything made to prove it still assume we can divide into components G and E which is false. Admixture studies as an indirect way of proving the hereditarian hypothesis only leads to this division.

Chuck (2013) looked at 100 years of testing done on black intelligence, and every study looked at found lower intelligence among blacks.

Dickens and Flynn's response to Roth et al debunks this. Shuey's work in 1966 has already been addressed. The crux of their argument is Spearman's Hypothesis & BWIQ gap which has several problems.

Even The National Academy of Science reported that, “Many studies have shown that members of some minority groups tend to score lower on a variety of commonly used ability tests than do members of the white majority in this country. The much publicized Coleman study provided comparisons of several racial and ethnic groups for a national sample of 3rd, 6th, 9th and 12th grade students on tests of verbal and nonverbal ability, reading comprehension, mathematics achievement, and general information. The largest difference in group averages usually existed between blacks and whites on all tests and at all grade levels. In terms of the distribution of scores for whites, the average score for blacks was roughly one standard deviation below the average for whites. Differences of approximately this magnitude were found for all given tests at 6th, 9th and 12th grades… The roughly one-standard deviation difference in average test scores between blacks and white students in this country found by Coleman et al. is typical of results of other studies” (Garner and Wigdor, 1982). Similar comments were made by the American Psychology Association who noted a 1 standard deviation gap in IQ after the release of the 1994 book The Bell Curve (in Neisser et al. 1996).

Again, another outdated source with the standard “we did a g measurement and also muh neurology” crap runs into the same old problems with both philosophy of mind and circularity. Additionally, most of the issues are related to the fact that this was made in 1996. For example, they mention the dominance of factor models but Savi and van der Maas 2021 introduce with their theory that network models will become dominant. Further on they state critics of IQ don't combat the fact that IQ scores "predict certain forms of achievement--especially school achievement--rather effectively" But this is obviously false today. Their claims on predictive validity (ignoring violations of Berka-Nash aka the whole no theory behind IQ etc....) are inaccurate. The first issue regarding these correlations is that most are upward adjustments from true correlations ranging from .10 to .20. There are issues of restriction range, corrections for measurement error. Richardson & Norgate presents a value far more damning of the predictive validity of IQ here for example. Whatever correlations that remain are explained away by institutional practices https://www.sciencedirect.com/science/article/abs/pii/S0191308510000043
and education
https://journals.sagepub.com/doi/abs/10.1177/0956797618774253
Next, their invocation of Haier and anything related to Brain correlations we have already disproved as bunk. The remaining correlations purported (whether they be academic or economic success) can be explained by the fact that the tests are constructed that way (similar to the argument that IQ is NOT distributed by a bell curve). For the final point read
https://notpoliticallycorrect.me/2017/11/23/iq-test-construction-iq-test-validity-and-ravens-progressive-matrices-biases/
https://notpoliticallycorrect.me/2019/11/15/the-history-and-construction-of-iq-tests/

Since races differ in IQ, this could also help explain the high rates of black crime since IQ correlates with criminality.

Actually, IQ differences between whites and black people are within 5-10 points of one another and within one standard deviation of the national mean, or otherwise what is considered "average." Smith 2018 even found it to be decreasing. There's also a lot of overlap between the two groupings.

Sources for the correlation between IQ and crime can be found here.

By Zeke Groth? First, the problems of IQ cannot be sidestepped. Now those are some weak correlations. Anyway, there's some report of the causality. “Lynam, Moffit, and Stouthamer-Loeber (1993) argue low IQ is a solid, causal predictor of delinquency. This is done through specific procedural measures such as using younger boys, so as to avoid the effect of prison lifestyle on intelligence, as well as controlling for test motivation. The latter procedure is done to combat the hypothesis that the delinquency-IQ correlation can be mediated by the fact that delinquents do not seek to do well in life and will not care about their results on a test. Additionally, multiple studies have shown that the IQ of delinquents was low before said individuals became delinquent (Denno, 1990; Moffit et al., 1981; West and Farrington, 1973). The present study used self reports to measure delinquency in the boys and controlled for social status, race, and test motivation to ensure the correlation remained regardless of these variables. A correlation of r=-0.22 is found for FSIQ and delinquency. Impulsivity mediated relatively little of the relationship; school achievement did not have any effect on the relationship for whites whereas it mediated the association for blacks.” But this isn't enough. To prove causality, we need an idea of what IQ precisely is (we don't), and a theory of how this contributes to crime rates.

The reason group differences in intelligence matter is because IQ is negatively correlated with criminality. In 1914, the role of intelligence in crime was brought to attention by H.H. Goddard (1914) who found that the majority of people in prison were mentally deficient.

That book was actually on an analysis of investigations by field-workers into the history of three hundred and twenty-seven families represented in the Vineland training-school for defectives. It’s hard to take that book seriously given it was written by a eugenist segregationist. But anyway, in 1931, however, E. H. Sutherland challenged this prevailing view. He compared the IQ scores of adult offenders to those of army draftees—representative of the general population—and the two groups had nearly identical IQ levels. He concluded that intelligence was not a "generally important cause of delinquency" (p. 362).

After this point though, the relationship between intelligence and criminality was not only ignored, but unfairly rejected (see Hirschi and Hindelang, 1977). This was because of a paper by Edwin Sutherland (1931) titled “Mental Deficiency and Crime” where Sutherland argued that the cause of the association was because of poor testing conditions. He showed that as testing procedures became better, the correlation began to diminish itself. But, he wrongly assumed that over time, it would completely disappear. Hirschi and Hindelang (1977) re-initiated the crime-IQ debate with a paper titled, “Intelligence and Delinquency: A Revisionist Review”. This paper fought against the sociological bias against the role of IQ in criminal behavior and adult delinquency and pulled from multiple studies available at the time to prove it is unjustified. Some of the most compelling data is explained in the following paragraph (all studies cited in the following paragraph can be found in Hirschi and Hindelang [1977]).

Lynam et. al (1993) lists three possible explanations for the relationship that have been put forth by theorists throughout the literature: A spurious relationship caused by a third variable correlated with both IQ and delinquency/crime such as: Differential detection by the level of IQ; low IQ youth have higher reported rates of delinquency because they are more likely to be caught by the authorities (Feldman 1977; Stark 1975; Sutherland 1931;). Confounding by socioeconomic status and other structural variables (Pfohl 1985). Or confounding by some other test-related variables like test motivation (Tarnopol 1970). A causal relationship from IQ to delinquency/crime such as: IQ causing academic performance, which is then related to delinquency through its associations with social bonds (Maguin & Loeber 1996; McGloin et. al 2004; Ward & Tittle 1994). Some sort of biological relationship between intelligence and crime, wherein smaller brains are less intelligence and also more amenable to commit crime (Ellis & Walsh 2002), or a hormonal mediation (Ellis 2005), or r/K life history theory (Rushton & Whitney 2002). Some sort of evolutionary relationship (Kanazawa 2010). A causal relationship from delinquency/crime to IQ such as: The dangerous lifestyle delinquents engage in causes them to have lower IQs (Hare 1984; Shanok & Lewis 1981).Moffitt & Silva (1988) claim that the differential detection hypothesis is not supported by the data, but it seems that more recent and representative data shows that it is supported to a certain extent (e.g. that it can account for only part of the relationship) (Yun & Lee 2013), which has been replicated with more controls (Boccio et. al 2018; Yun et. al 2013). Moreover, in a reanalysis of the National Longitudinal Study of Youth in a review of Murray and Herrnstein’s The Bell Curve, Cullen et. al (1997) report that differential detection does occur in this cohort. The relationship between lead and intelligence (Reyes 2012), and lead and crime (Aizer & Currie 2017; Bellinger 2008; Marcus et. al 2010; Nevin 2007; Olympio et. al 2009; Reyes 2014; Reyes 2015; Stretesky & Lynch 2004) have both been well-established (Brady 1993). Other environmental exposures like pollution have also been posited to contribute to both IQ (Zhang et. al 2018) and crime (Herrnstadt & Muehlegger 2015), and there have been demonstrated moderators of the relationship (Bellair & McNulty 2009). Other research has demonstrated that following the inclusion of a more robust set of structural controls, IQ contributes to less than 5% of the variation in delinquency (Menard & Morse 1984). Moreover, IQ is usually one of the smallest contributors to overall variance, explaining less than 1% in many meta-analyses (Cullen et. al 1997). Even more, other research has also shown a lack of a longitudinal correlation following the adjustment of the relationship for confounding factors (Fergusson, Horwood & Ridder 2005) [4], indicating that more research should be done to more robustly test hypotheses of confounding.

First, they re-analyze data from Hirschi (1969) which had a sample of 3,600 boys from Contra Costa County, California. They find a gamma correlation of -0.31.

There are various problems with Hirschi 1969. The conclusions have slender empirical bases at best. An example will illustrate some of the weaknesses which are serious enough to damage his findings. He concludes that the control theory of delinquency he advocates overestimates the significance of involvement in conventional activities (p. 230). This was evidently influenced by his finding of a weak relationship between delinquent acts and employment while attending school (pp. 188-99). In table 70 he lists the percentage of respondents commiting none, one, a self-reported delinquent acts depending on whether they answer whether they are currently working for pay. There is no attempt this relationship by age or grade in school despite the well-known relationship between labor force participation and age. Failure to separate junior high students from senior high school students and to differentiate students on the basis of intentions to go on to college or to commence work could mask the strength of the relationship between employment experience and delinquency. This oversight is surprising since elsewhere in the book the author stresses importance of transition from adolescent to adult status and the educational expectations. Table 70 is the only analysis of the relationship between delinquency and the respondent's employment status although the questionnaire asked 31 questions about the individual's work, income attitudes toward work. In addition, while the author found no relationship between delinquency and socioeconomic status, as measured by the father's occupation, he did find a relationship between delinquency and the unemployment and welfare history of the father. It is not clear why the employment history of the father would be more important than the employment experience of the adolescent. What about the impact of race on the results reported in Table 70? Is the sample based on white boys only? In some tables throughout the book, the results are reported by race, but in many of the tables in Chapters 6-11 it is not clear what the nature of the sample is. In Chapter 4 Hirschi states the composition of the original sample by race and by sex. One is left with the general impression that he has not satisfactorily reconciled this finding and the suggestion that there is little evidence of social class bias in official response to delinquent conduct (p. 68, footnote 7) with the social and economic position of children processed by the system of justice. The importance of social class is further clouded by the exclusion of black people from the analysis. Despite evidence in his own data suggesting that blacks, in general, would be found at the lower end of a distribution by income (or some substitute for a measure of income), Hirschi removes them from the analysis and finds after thus truncating the range of variables used to measure economic position that there is no relation between delinquency and social class. This restriction of the analysis primarily to white boys is apparently a reflection of one weakness in this type of self-report procedure-the problems some children have in completing the forms. This general inadequacy of the self- report procedure, combined with problems it raises for the definition of delinquency may provide part of the explanation for discrepant findings on the importance of social class for delinquent conduct. Measures of delinquency which, in effect, equate an admission of one petty larceny with one arrest for strong-arm robbery would appear to conceal more than they reveal. Throughout the book the results are discussed in terms of boys. It is never mentioned what happened to the girls, how the results vary by sex, or what theoretical implications such results may have. There is also no attempt to assess the statistical significance of the relationship reported in Table 70. This is characteristic of the book. Tabular principal analytical technique. There are 100 tables in the book and only 7 multiple regressions. No F-distribution statistics are reported for the regressions. More distressing is the failure to report t-distribution statistics or standard deviations of the estimates of the partial regression coefficients of the individual variables in the multiple regressions. Consequently, we’re at sea with the significance of results. Hirschi goes too far when he rejects notions of culturally based rule violation and structural pressures toward deviance on the basis of his analysis of the answers of white male high school students in one California county.

Wolfgang et al. (1972) uses criminal data from 8,700 boys and splits them into groups by IQ. There was a clear association with a gamma correlation of -0.31 for whites and -0.16 for blacks.

Wolfgang isn’t generalizable because it’s from a single Philadelphia cohort from 1945. It’s retrospective in nature so There is frequently an absence of data on potential confounding factors since the data was recorded in the past. It’s difficult to identify an appropriate exposed cohort and an appropriate comparison group. Differential losses to follow up also bias this study. They do not present measures of association for these IQ scores and delinquency, nor do they show tabular material in which IQ is treated as an independent variable. There’s no direct comparison with social class. Regardless, included highest grade completed and number of school moves, variables which account for the bulk of the explained variance in the measure of delinquency. Thus, race places third behind these school variables and IQ accounts for virtually nothing. So this study suffers from several problems: First, the use of the gamma correlation coefficient to evaluate the association between IQ and race is inappropriate. The gamma correlation coefficient is a measure of association that is typically used when both variables are ordinal (i.e., ranked in some way, such as "low," "medium," and "high"). It is not appropriate to use the gamma correlation coefficient when one or both variables are continuous, such as IQ scores. Instead, a Pearson's correlation coefficient or a Spearman's rank-order correlation coefficient would be more appropriate. Another issue is the potential for confounding variables in the study. Confounding variables are variables that are related to both the independent variable (in this case, race) and the dependent variable (IQ). For example, if the study did not control for socio-economic status, it is possible that the observed association between race and IQ could be due to the fact that some racial groups are more likely to have lower socio-economic status, which could in turn be related to lower IQ scores. It is also important to consider the possibility of bias in the study. Bias can occur at various stages of the research process, from the way the sample is recruited and selected, to the way the data is collected and analyzed. For example, if the study relied on self-reported IQ scores, there could be a bias if some participants were more or less inclined to accurately report their IQ scores.

West (1973) uses data from over 400 boys from London. They found that one quarter of people with an IQ of 110 or higher had a police record whereas one half of people with an IQ of 90 or less had a police record. He concludes that IQ was a significant predictor of delinquency and that it survived as a predictor after controlling for variables such as family income and family culture.

West’s research design is patently flawed. For example, their comments about the subjective biases of the psychiatric social workers who gathered data on family backgrounds (p. 52-53) and their exclusion of preferred analysis techniques by insufficiency of numbers in certain categories (p. 150). Likewise, the authors exercised prudence in discussing the issue of prediction. Although their data suggest that certain factors may be associated with vulnerability to delinquency, they are clear to point out that "this degree of predictability did not make it possible to forecast with certainty that any individual boy would become delinquent' (p. 190). Moreover, the authors provide an extended discussion of those cases, the majority, which either became delinquent against "predication" or resist delinquency against a forecast of vulnerability (pp. 136-150). This is an important contribution, often overlooked in research of this type. Several of the researchers' methodological constructions cast some doubt on certain substantive interpretations. It is unclear, for instance, why in constructing chi square matrices the “worse Quarter” (in terms of measured adversity) was consistently compared with the mean of the rest of the sample. This technique, while dramatizing differences, excludes the possible recognition of similarities between "the worst" and middle range of the population or of curvilinear patterns for certain variables. Additional problems are debunked in Pfohl’s review in 1975.

Toby and Toby (1961) showed “intellectual status” was a significant predictor of delinquency/non-delinquency, regardless of socioeconomic status.

From this site, we see that this isn’t even peer reviewed. The measures were flawed. A boy was considered delinquent if his friends had a record of delinquency. A boy with no record was assumed to be a delinquent who had not been caught. So they attributed a characteristic to their sample without any sufficient proof.

Finally, Hirschi and Hindelang report on data which shows that even self-reported criminal behavior correlates with criminality. Spellacy (1977) looks at 40 violent and non-violent adolescent males and tests them on neuropsychological tests and on the MMPI scale. They tested the group on verbal IQ, performance IQ, and full-scale IQ (FSIQ). On FSIQ, there was a 12.4 point difference between violent and non-violent adolescents. The results were consistent across other tests of mental ability.

These results aren’t conclusive at all. They’re on white people and they don’t account for socioeconomic status. In fact there’s no sort of regression to see if this is even a relationship.

Similar differences are analyzed by Holland, Beckett and Levi (1981)

The validity of this early study was compromised by their use of self-report measures to assess psychopathy.

and Holland and Holt (1975). Mears and Cochran (2013) used the NLSY data of white men and their AFQT scores and created an index of different forms of delinquency and how much the participants committed those forms of delinquency. They controlled for additional measures, like Cullen et al. in order to refine the correlation to AFQT scores as much as possible. But, instead of purely relying on regression analysis (as they explain it has issues for curvilinear data), they used GPS analysis. First, they provide the bivariate analysis results which show that lower IQ people tend to commit more crimes, but once you are looking at people in the IQ range of 77-88, the propensity for crime drops off. Essentially, this implies an inverse U-shaped model. Then, when testing the relationship through GPS analysis, they do find an association where people in the 90’s IQ range commit the most crimes, but people below that and above that commit less crimes.

So this paper literally says “the distribution of confounders, especially SES, may limit the ability of statistical approaches to arrive at unbiased estimates of IQ effects.” So this paper actually shows the effect of IQ is heavily dependent on socioeconomic status and that it cannot be easily controlled away, with many methodological considerations being at play. Indeed, there is evidence that the small relationship is mediated by well-being, substance abuse, and other confounding factors that prohibit simple causal interpretation (Freeman 2012). “In trying to understand how to deal with the crime problem, much of the attention now given to problems of poverty and unemployment should be shifted to another question altogether: coping with cognitive disadvantage” - Murray & Herrnstein, people who contributed to The Bell Curve, which has been heavily criticized and discredited. They also cite The Bell Curve to justify their use of the AFQT test however again, it’s been criticized by no end. For example, the fact that when using the AFQT as a proxy human capital and cognitive skills is a problem because measurement error problems exist for any latent variable. Anyway they use the AFQT as their measure of IQ is assembled from a simple unweighted average of four of the achievement tests. The AFQT is the most commonly used combination of the tests used by the military to predict performance. This test is not the same as the g that can be extracted from the full battery of 10 tests available on the survey. While proclaiming the virtues of g, Herrnstein and Murray do not actually use it in their empirical analyses. A variety of interesting empirical associations are established in this portion of the book. Although the authors issue the standard warning that correlation does not imply causation, throughout the book these correlations are given an implicit causal interpretation, as is common in much empirical research in social science. This problem is especially serious in their analysis of the relationship between AFQT and education, as I extrapolate further. Herrnstein and Murray's measure of IQ is not the same as the g that can be extracted from test scores available in their data set. They do not emphasize how little of the variation in social outcomes is explained by AFQT or g. There is considerable room for factors other than their measure of ability to explain wages and other social outcomes. In their empirical work, the authors assume that AFQT is a measure of immutable native intelligence. In fact, AFQT is an achievement test that can be manipulated by educational interventions. Achievement tests embody environmental influences: AFQT scores rise with age and parental socioeconomic status. A person's AFQT score is not an immutable characteristic beyond environmental manipulation. The weighting placed on various "abilities" depends critically on the composition of the tests used to measure ability. If the achievement tests used to define AFQT are deleted from the ASVAB scores, the g loaded scores become the speeded tests. The g factor extracted for all demographic groups is remarkably similar. There is less similarity across demographic groups in the other factors. This evidence is supportive of a single dominant factor as a determinant of test scores (see the evidence in Cawley et al. [1995]). The g measure is not the same as the AFQT measure used by Herrnstein and Murray. Arguably by not using g in their analyses of social performance, they bias downward their estimate of g's effect on social outcomes, but the difference between AFQT and g as predictors is slight. It is more interesting to consider how well g predicts outcomes in society. Wages are a variable of great interest to economists as a measure of performance in society. For each major demographic and ethnic group, table 2 presents simple Pearson correlations for log hourly wages in 1991 from the NLSY survey used by Herrnstein and Murray, with g obtained from group-specific factor analyses.
The table also presents correlations of log wages with the second factor (g2), with AFQT, and with numerical operations-one of two speeded tests in the ASVAB battery. This table exhibits the same g dominance that is found in a variety of studies cited by Herrnstein and Murray. There are several other noteworthy features of this table. First, the test scores predict female wages better than male wages. Second, g and AFQT perform about equally well in predicting wages for most demographic groups. Neither g nor AFQT explains all that much of the variance in log wages. The highest R2 is less than 22%. A lot of variability in log wages remains unexplained. Even if measurement error is as high as 30% of wages (a very generous estimate; see, e.g., Bound et al. [1990]), more than half of the variability across persons is explained by factors other than g or "ability." The second factor, g2, is never very strong and does not predict as well as speeded numerical operations, even though it is heavily "loaded" on it. Finally, a numerical operations test, a single 5-minute test, predicts log wages better than AFQT in 1991 for white males. The score on the numerical operations test is not included in AFQT. In results not reported here, the additional gain in R2 from using all 10 ASVAB scores to predict log wages over using g by itself is never more than 2% except for Hispanic females, for whom the gain is 3% (Cawley et al. 1995). When the conventional statistical testing procedures advocated by Herrnstein and Murray (p. 549) are used for most groups, additional scores of tests beyond g-or AFQT-are justified for inclusion in the prediction equations for log wages. For white males, as many as four ability measures are statistically significant in log wage equations. Nonetheless, it is striking that the same g that predicts test scores does such a good job of summarizing how a variety of test scores predict log wages. This evidence confirms the dominance of a single factor in explaining wages that is similar to the dominance that occurs in predicting military performance. The fact that g, AFQT, or even the entire battery of ASVAB tests explains only a fraction of the variance in measured wages (24% at the most, 30% when one adjusts for measurement error in wages), pass rates in military training schools (17-18%), and supervisor ratings (40-45%) means that there is a lot of room for factors not measured by psychometric tests to account for the variation in performance in a number of settings. The seminal work of Mincer (1972) suggests two important factors: education and job experience. Cawley et al. (1995) present evidence that education and tenure on the job account for a substantial compo- nent of the variance in log wages even after measures of ability are introduced into wage equations. Table 3 presents evidence on the improvement in fit of log wage equations that arises from adding education and work experience after controlling for measures of ability.
This stepwise procedure obviously exaggerates the explanatory power of the AFQT test. Adding schooling, tenure on the most recent job, and Mincer's measure of work experience raises R2 substantially above the level obtained from a pure psychometric specification. Education and tenure on the most recent job are always statistically significant at conventional levels. For Hispanics and whites, these variables have a substantially greater proportionate effect for males than for females, doubling R2. Psychometric variables are strongly predictive of wages, but so are schooling and work experience. Moreover, the important factors for explaining tests are not always the important factors for predicting wages. Cawley et al. (1995) note that a one-standard-deviation increase in scores on the speeded numerical operations test raises wages substantially more than a one-standard-deviation increase in g or AFQT. Numerical operations scores often drive AFQT or g scores into statistical insignificance in regressions of log wages on experience, schooling, and ability. For whites, the coefficient on numerical operations is larger than the coefficient on AFQT or g. The g that emerges from the test score matrix is dominated by a test score that is loaded on the second component. This evidence emphasizes the discrepancy between the factors that predict test scores and the factors that predict social performance. The hypothesis of a universal g that underlies the analysis of Herrnstein and Murray does not receive much support. Analyses by researchers in the military (see, e.g., Harris, McCloy, and Statman 1995) demonstrate that work experience in a job partially compensates for initial cognitive deficits, especially for tasks that are not complex and in which there is little technical change. However, in times of rapid change, the reward to ability appears to increase. In more complex work environments and environments undergoing technical change, experience is a less perfect substitute for innate ability. Moreover, even in simple tasks, unaffected by technical change, experience never eliminates ability differentials. There is a remarkable parallelism in performance profiles in terms of experience among different cognitive classes that is strongly suggestive of Mincer's evidence on the parallelism of log wage-experience profiles across different education groups (Harris et al. 1995). At all levels of experience, more able workers retain their initial advantage over less able workers. Military studies of motivation and attitude show only weak effects of those traits on performance. At least in the military, motivation and drive are negligible contributors to productivity. This evidence and the studies of Sternberg (1985) support the notion that g is important, but g alone does not explain social outcomes. "Crystallized intelligence" or experience also contributes to social performance. See the discussion in Hunt (1994). Cameron and Heckman (1993b) demonstrate that although holders of General Educational Development certificates have higher AFQT scores, they earn no higher wages than other high school dropouts with the same years of education. The Flynn effect, the fact that AFQT scores rise with age, studies of transracial adoption, and the research of Herrnstein et al. (1986) all suggest that g is affected by the environment. Since g is an achievement test loaded and schooling is likely to raise performance on achievement tests, there is likely to be a strong relationship between g and education. In the limit, if a brilliant person had no schooling, he or she would be unlikely to score well on exams. The correlation between AFQT and years of schooling is high (r = .6 for white males), but AFQT may affect schooling. A direct test of this proposition is provided by Neal and Johnson (1994), who adopt an instrumental variable strategy similar to that used by Angrist and Krueger (1991). Restrictions on the age of students entering schools cause many children born in the last quarter of a year to start school one year later than students born earlier in the same calendar year. Neal and Johnson examine grades completed and AFQT scores for students who were 16-18 when they took the test. They find within each birth year that years of schooling completed are roughly constant over the first three quarters and then drop substantially ('/3-'/2 year) in the fourth quarter. By using quarter of birth as an instrument for years of schooling completed, they show that an additional year of schooling raises AFQT scores for men and women by 0.22 and 0.25 standard deviations. The black-white AFQT gap could be closed by four additional years of school. The evidence that exogenous schooling substantially raises AFQT calls into question the evidence offered by Herrnstein and Murray. It also suggests that a substantial component of their education-AFQT correlation is due to reverse causation. It is the education that "causes'' some portion of the AFQT score and not the reverse. In their empirical analysis, the operational definition of IQ is the AFQT score. Their treatment of family background is cavalier, to put it mildly. However, it is not much more cavalier than many papers on status attainment in sociology. The index is based on parental education and occupational status, and on family income measured at one point in the life cycle of the child. For many persons, the family income measure is entirely missing and is omitted from the construction of the index. No sensitivity analysis is presented to allay concerns about the sensitivity of their estimates to the application of their unusual imputation procedure. Fischer et al. (1996) note that Bell Curve analysis is based on the Armed Forces Qualifying Test (AFQT) which is not an IQ test but designed to predict performance of certain criterion variables. The math section requires high school algebra. Furthermore, they note that the original plot of the AFQT data is not in the shape of the required bell curve. Since Herrnstein and Murray require a bell curve for their theory, they reshaped the original data to fit their theory. Here we have an example of theory driving the data. Even with these problems, Fischer et al.,(1996) for the sake of argument, accept Herrnstein and Murray’s evidence, measure of intelligence, and basic methodology and then reexamine the results. However, they correct for factors (ignored by Herrnstein and Murray) known to have significant effects on a person’s life outcome (e.g., parental income, number of siblings, local unemployment rate, geographic region). In their reanalysis, Fischer et al. conclude that a person’s life chances depend on their social surroundings at least as much as their own intelligence. They conclude that the key finding of the Bell Curve (i.e.,IQ as a predictor of SES) is an artifact of its own method. They even delete from their composite AFQT score a timed test of numerical operations because it is not highly correlated with the other tests. Yet it is well known that in the data they use, this subtest is the single best predictor of earnings of all the AFQT test components. The fact that many of the subtests are only weakly correlated with each other, and that the best predictor of earnings is only weakly correlated with their "g-loaded" score, only heightens doubts that a single-ability model is a satisfactory description of human intelligence. It also drives home the point that the "g-loading" so strongly emphasized by Murray and Herrnstein measures only agreement among tests, not predictive power for socioeconomic outcomes. By the same token, one could also argue that the authors have biased their empirical analysis against the conclusions they obtain by disregarding the test with the greatest predictive power. Janet Currie and Duncan Thomas said “Herrnstein and Murray report that conditional on maternal ‘intelligence’ (AFQT scores), child test scores are little affected by variations in socio-economic status. Using the same data, we demonstrate their finding is very fragile.” Anyway, the study isn’t convincing for the hereditarian case considering it only looked at White males (thus conclusions on other populations are irrelevant) and confounding. They talk in length about confounding but in the end say "It seems unlikely, given the results here and elsewhere, that a null relationship would be identified if confounding were better addressed." after explaining: "In the present study, it is apparent that SES, an important potential confounder, is not evenly distributed across levels of IQ." This is in violation of the fact that controlling for potential confounders only works if the confounder is sufficiently distributed across levels of the variable of interest (IQ). Also they confuse some relationships: "For example, a lower IQ may lead to poor academic performance and school experiences. More generally, individuals with low IQs may be less able to successfully negotiate social relationships and situations, including relationships in school and with family and friends. These effects in turn may lead to weak social bonds, greater strain, greater association with delinquent peers, and negative labeling." Is it that lower IQ LEADS to poor academic performance or is it the other way around? (That, because test constructors of IQ tests already have preconceived biases, they structure the results to fit their biases). Also I don't like how they only mentioned that the IQ-crime relationship controversy centered around eugenics and other policy applications. Sure that was a major factor but the main reason Jensen and others were shit on is because they were WRONG. The section sort of implies people opposed to hereditarianism are only focused on moral implications of the IQ debate which would be historically and currently incorrect. Also these theories Mears & Cochran cites are insane. "One centers on the idea that people who are lower in intelligence may lack sufficient moral awareness of how to behave and so are more likely to offend (Langdon, Clare, & Murphy, 2011). A related argument is that individuals with lower IQs may be less able “to foresee the consequences of their offending and to appreciate the feelings of victims” (Farrington & Welsh, 2008, p. 41; see also Lynam, Moffitt, & Stouthamer-Loeber, 1993; Moffitt, 1993)." One thing I didn't mention before (and you can see in the quote above) is that IQ and intelligence are equated. Testing "moral awareness" from IQ sounds like some 1920s testing objective. I also don't know at what point some sample can "lack sufficient" moral awareness. Again they had a lot of limitations including: 1. The publications employed a sample and a total birth cohort (respectively) that can not generalize to female and non-Caucasian populations. 2. The publications employed ordinal measures of IQ as opposed to continuous measures of IQ. 3. Mears and Cochran (2013) only examined the functional form of the association between IQ and antisocial behavior at a single life period. This leads to questions that include the following: 1. Is the direct association between IQ and antisocial behavior curvilinear? 2. Does the operationalization of antisocial behavior moderate the curvilinear association? 3. What is the degree of the curvilinear association? 4. Is the curvilinear association moderated by age? 5. Is the curvilinear association moderated by race and sex? These questions can also stem from the fact that 1. They primarily relied upon data collected several decades ago, most notably the National Longitudinal Survey of Youth (NLSY), which was initially collected in 1979. 2. They rely exclusively on self-reported data. While the limitations of both self-report and official records measures of offending have been documented, the strengths of each measurement strategy seem to complement the other's limitations. 3. When they examine the IQ- offending association, they rely either on a single, comprehensive measure of intelligence or a single subscale which is not ideal.

June Andrew (1982) uses digit span tests and verbal IQ scores for a young sample of delinquents. She finds substantial differences in both across non-violent delinquents and violent delinquents.

There are major concerns with June Andrew (1982) which includes small sample sizes, the areas of subject selection, adequacy of controls, collection of neuropsychological data, and data analysis. The subjects were a highly selected group of incarcerated volunteers, hospitalized violent adolescents, adjudicated recidivistic delinquents, or offenders referred for psychiatric evaluation. The mean age was 15.33. Samples of this type introduce several sources of bias. The subjects may have been involved in drug or alcohol abuse, fighting, or motor vehicle accidents (in which they may have sustained repetitive concussions). a history of truancy and/or institutionalization is also likely. This is expected to compromise performance on cognitive tests, suggesting that a delinquent lifestyle may produce neuropsychological deficit. It could also be possible that the “cognitively impaired“ delinquent is more easily apprehended or more likely to be referred for testing in fact Robbins et al. (1983) found clinic-referred delinquents to be more neuropsychologically impaired than adjudicated delinquents who had not been referred. Another issue is the test motivation of the older incarcerated delinquent and another problem could be racial minorities are often disproportionately represented among incarcerated delinquents, and minority adolescents often perform poorly on mental tests. Poor black youths are often less than comfortable with mainstream english, suggesting a special problem tor interpretation of the verbal deficits most often found to characterize delinquents. There was also no controls screened for non adjudicated delinquency. There could also be a problem with the choice of neuropsychological measures that was apparently determined post hoc. Based on scores collected during routine evaluation or from clinic records. As a result. The tests employed often tapped a restricted range of primary cognitive functions, making the interpretation of patterns of relative strengths and weaknesses difficult. They didn’t even cite the reliability and validity of the tests they used. It is far from clear that the WISC-R is one exception for adequacy of benchmark comparison in fact its critiques fall in line with test construction. Nisbett already outlined the laughable examples of test questions in his 2012 work. SB and WISC are also used as "standard" for testing which purports to address claims of construct validity but this is circular since performances on new tests correlate with performances on past tests. The statistical treatment of the data also was less than rigorous. The individual/tests were reported in one table without regard for the likelihood of type I error. Little attention was paid to possible confounding variables. For example, gender and social class are known to covary with delinquency and with many neuropsychological test scores yet it didn’t control statistically for social class.

Crocker and Hodgins (1997) follow a Swedish cohort of over 15,000 participants to age 30. The mentally retarded male participants were significantly more likely to have committed at least one violent offense, theft, traffic offense, or “other” offense. Similar results are found for women.

I don’t understand how this is relevant to anything.

A study by Oleson and Chappell (2012) actually looked at a sample of people of very high intelligence, or geniuses (mean IQ of the sample was 154.6). They found that even among geniuses, a statistically significant, negative correlation exists between IQ and use of violence, having killed another human (excluding warfare), and having kidnapped someone.

This study used self reports with no tests of its validity or reliability. There was no randomization in their sample procedure and they didn’t control for basic third variable confounding nor differential detection.

Diamond, Morris, and Barnes (2012) look at both individual IQ and prison-unit-level (different units of the prisons; groups) IQ and see if it relates to the amount of individual inmate violence. First of all, they find that the average IQ of prisoners is about 2/3rds of a standard deviation below that of the American population. This is in line with other research that the IQ difference between criminals and non-criminals is about 8-10 points (Hirschi and Hindelang, 1977).

Again, due to differential detection and third variable confounding.

Second of all, they find individual IQ negatively correlates with individual inmate violence and that differences in prison-unit-level IQ negatively predicts the amount of individual-level violence. This may be somewhat in line with the theory I presented earlier that group IQ differences matter more to variation in given outcomes than individual level IQ differences.

This relied on a single comprehensive measure of crime or delinquen- cy without considering more specific types of offending. This approach may limit the conclusions that can be drawn from the results. For example, it is possible that the IQ-crime association varies depending on the seriousness and the type of crime (e.g., property vs. violent crime). This is more plausible given the cross sectional, retrospective nature of this study which doesn’t permit any kind of causal inference.

Lynam, Moffit, and Stouthamer-Loeber (1993) argue low IQ is a solid, causal predictor of delinquency.

Yet they don’t do any sort of causal inference procedures i.e differences in differences, mendelian randomization, regression discontinuity, or two stage least squares. They don’t even present a simple counter factual.

This is done through specific procedural measures such as using younger boys, so as to avoid the effect of prison lifestyle on intelligence, as well as controlling for test motivation. The latter procedure is done to combat the hypothesis that the delinquency-IQ correlation can be mediated by the fact that delinquents do not seek to do well in life and will not care about their results on a test. Additionally, multiple studies have shown that the IQ of delinquents was low before said individuals became delinquent (Denno, 1990; Moffit et al., 1981; West and Farrington, 1973). The present study used self reports to measure delinquency in the boys and controlled for social status, race, and test motivation to ensure the correlation remained regardless of these variables. A correlation of r=-0.22 is found for FSIQ and delinquency.

This just means FSIQ explains only ~5% of delinquency.

Impulsivity mediated relatively little of the relationship; school achievement did not have any effect on the relationship for whites whereas it mediated the association for blacks. Hodges and Plow (1990)

This old paper was based on a small sample size (n=76) limiting its statistical power to correctly detect effects. The study even show this to say their results aren’t generalizable:
“These findings should not be interpreted as useful for the purpose of making diagnostic decisions for individual cases. For example, the magnitude of the discrepancy between VIQ and PIQ for the conduct-disordered group wasapproximately the same as that observed in the standardization sample for the WISC-R (Kaufman & Reynolds, 1983). However, the skewed distribution in favor of higher PIQ was not observed in the standardization sample, and long-term stability of the discrepancy (and its direction) would not be assumed for the individuals in the normative sample. In any case, further study is needed to determine the generalizability of our findings, given the characteristics of our sample and the number of analyses undertaken.”
Their socioeconomic control variable also was the Hollingshead Index, an old and poor measure which results in more error with their results see
https://moscow.sci-hub.se/4224/f878a3ed6c16bbac886dbc3275c39f77/haug1971.pdf?download=true
In table 3 they literally show that VIQ actually had a statistically insignificant relationship to conduct disorder (which the study says is predictive of delinquency with no sort of correlative test to show that?)

and Ward and Tittle (1994) also control for both SES and race and find that low-IQ remains a significant predictor for delinquency.

This study didn't even have a clear index of socioeconomic status or where the index came from.
Ward and Tittle’s results suffer from errored estimates:
“We assume that measurement errors are uncorrelated. If the data had contained multiple indicators for all variables, and if there were a theoretical basis for believing that measurement errors may be correlated, then their estimation might have altered our conclusions in some unknown way”
A barely acceptable cronbachs alpha score for school attitudes:
“Cronbach's a as a lower-bound estimate of the reliability of the index of all 15 items is 0.74.”
Their retest correlations didn’t even explain most of the variance for juvenile delinquency
Additionally their results are noisy and their models don’t have good fit as their R^2 = .4 - .5. So there is a lot of variance being unexplained meaning they aren’t controlling for relevant variables. They even say this:
“If the objective is to explain as much variance in delinquency as possible, evaluating the effects of IQ will not be particularly fruitful since it will, at best, contribute only a relatively small additional increment from both its direct and its indirect effects.”
Even then, they literally say that the IQ delinquency relationship is actually positive and statistically insignificant:
“consistent with the argument, the direct effect of IQ on delinquency (b = 0.06) is nonsignificant (t value =-1.82).”

Wolfgang, Figlio, and Sellin (1972) compare one-time offenders and chronic offenders in intelligence. They control for SES and race and still find an 8.1 IQ point difference for whites and a 10.6 point difference for blacks. The latter difference is particularly interesting because most studies find a smaller association between IQ and crime for blacks (Hirschi and Hindelang, 1977; Lynam, Moffit, and Stouthamer-Loeber, 1993). McDaniel (2006) used NAEP data to estimate the average IQ of different states. He finds a correlation of r=-0.58 for state IQ and violent crime rate. Bartels, Ryan, Urban, and Glass (2010) use state IQ estimates to create estimates on the relationship of IQ to criminal behavior. They sought to replicate and extend upon McDaniel (2006) by looking at various types of crimes at the state level. Bartels et al. find, like McDaniel, a correlation of r=-0.58 for state IQ and violent crime, despite the years tested being different. They extend with the following, significant correlations of -0.57 (murder), -0.29 (robbery), -0.41 (assault), -0.45 (property), -0.57 (burglary), and -0.29 (theft).

Again, if he read the study it said “For the crime of aggravated assault results revealed that while the overall model was significant, F (2, 47) = 7.12, p < .01, R2 = .23, neither state IQ nor percent Black in the state were significant independent predictors.” Percent black wasn’t a significant predictor of burglary either. These models aren’t reliable because statistically there isn’t a lot being explained. Also other studies find that results differ from those of Bartels et al. (2010) as they found a significant negative association between average IQ and assault, burglary, and property crime rates within the US, which other studies did not find at a cross-country level. For example, when controls were not included, increasing IQ appears to raise the rate of property crimes and burglary. However, after inclusion of controls, IQ had non-significant negative effects on these types of crime. Moreover findings on IQ–homicide relationships differ only after we including the control variables into the regression models. In comparing the effect of differential IQ classes, we found that although raising the intelligence of IQ50th has the highest impact on reducing homicide rate (β = -.460), the beta coefficient does not substantially differ from that of IQ95th (β = -.397) and IQ5th (β = -.414) (Burham et al. 2014).
As for the study itself it says (again) third variable confounding wasn’t adjusted for and this is not a causal relationship
“the present study is correlational and thus precludes any interpretation of causation. Furthermore, there are a number of additional factors including SES (Guay et al., 2005)”
Additionally both problems with Mcdaniels and Bartels wasn’t adjusted for statistically:
“It should also be noted, with respect to the correlational analyses, that the populations of states differ substantially and any assumption of identically distributed data points was likely violated.”
Also
“The source of offending data also contains well known weak- nesses. In addition to the problems of underreporting (specifically for rape for the current study), hierarchical reporting, and the myr- iad of special circumstances and reporting issues per year (details from the author), there are fundamental differences between the measures used for this study.”

McDaniel (2006) was also replicated by Pesta, McDaniel, and Bertsch (2010) who found states with lower average IQs had higher aggregate crime rates. They found a correlation of r=-0.76 between overall crime rate and state IQ estimates. Templer and Rushton (2011) analyze data from the fifty United States on IQ and criminal behavior. IQ was correlated with murder at r=-0.64, robbery at r=-0.46, and assault at r=-0.47.

This just means IQ only explains does not explain 59% of the variance in murder, or 79% of the variance in robbery, or 78% of the variance in assault. There should also be some warranty on the residuals. See previous rebuttal on Templer and Rushton. Again, a laughably stupid paper.

Beaver and Wright (2011) look at over 200 countries and their IQ estimates. The correlation matrix can seen below in Table 1: primarily the violent crime rate was correlated with IQ at r=-0.58, the property crime rates was correlated at r=-0.40, the aggravated assault rate was correlated at r=-0.52, and the composite crime rate was correlated at r=-0.53. At the national level, Rushton and Templer (2009) conduct principal components analysis on the effect national IQ has on national crime rates. National IQ correlated with homicide at r=-0.25, rape at r=-0.29, and serious assault at r=-0.21. Gendreau, Little, and Goggin conducted a meta analysis of 131 studies (1,141 correlations) on the relationship between specific factors and adult recidivism. “Intellectual functioning” had a mean correlation of r=0.07. This relationship was stronger than the SES-Crime relationship, but worse than most others.

This is a severe misunderstanding of the study. The data explicitly shows that intellectual functioning had a correlation of r = 0.07 (SD = 0.14) with recidivism, based on 32 effect sizes across 21,369 subjects. This correlation is not only weak in absolute terms, but the high standard deviation of 0.14 despite the large sample size indicates substantial inconsistency in this relationship. This variability is particularly telling - with a sample size of over 21,000, we would expect more consistent effects if intelligence were truly a robust predictor.
The environmental factors, particularly dynamic criminogenic needs, showed much stronger and more consistent correlations. Criminogenic need factors demonstrated a correlation of r = 0.17 (SD = 0.11) across 246 effect sizes, and the Common Language Effect Size Indicators in Table 3 quantify this superiority precisely: criminogenic needs outperformed intellectual functioning 71% of the time. Social achievement factors (r = 0.15, SD = 0.14) also demonstrated stronger predictive power. Even family factors outperformed intellectual functioning 61% of the time according to the Common Language Effect Size Indicators.
The most damning statistical evidence against the hereditarian interpretation comes from Table 4's analysis of comprehensive risk assessment tools. The LSI-R, which incorporates multiple environmental and social factors, achieved a correlation of r = 0.35 with a much smaller standard deviation (SD = 0.08) despite its complexity. This higher correlation coupled with lower variability demonstrates that environmental factors working in concert are far more reliable predictors than intelligence alone. The authors' own statistical analysis using Student-Newman-Keuls post hoc comparisons formally established that criminogenic needs and social factors were significantly stronger predictors than intellectual functioning (p < .05).
Furthermore, the study's finding that dynamic (changeable) factors outperformed static (unchangeable) factors (r = 0.15 vs r = 0.12) directly contradicts hereditarian interpretations. The statistical evidence shows that factors amenable to environmental intervention have more predictive power than fixed traits. The high standard deviation in intellectual functioning's correlation, despite having one of the larger sample sizes, suggests this relationship is likely confounded by unmeasured environmental variables. In conclusion, attempting to use this meta-analysis to support hereditarian claims requires ignoring the superior predictive power of environmental factors demonstrated across multiple statistical measures, the greater consistency of environmental predictors, and the clear superiority of assessment tools that emphasize dynamic environmental factors over static traits.

Kandel et al. (1988) finds that IQ is a protective factor against criminogenic, environmental differences and, in effect, reduces risk of criminality. Many others have agreed with the hypothesis that IQ indirectly affects criminality through causing criminogenic factors such as Herrnstein and Murray (1994), Magdol, Moffit, Caspi, and Silvia (1998), and Ward and Tittle (1994). The 2009 edition of the Handbook of Crime Correlates finds a large number of studies on the correlation between IQ and criminality (Beaver, Ellis and Wright, 2009). They find that the supermajority find a statistically significant, negative association.

This book oversimplifies the relationship between intelligence and crime by focusing solely on the negative correlation and ignoring other factors that may be relevant. For example, social and economic factors, such as poverty and lack of access to education, may also play a role in both intelligence and criminal behavior. Additionally, there may be other individual-level factors, such as personality traits or environmental influences, that contribute to both intelligence and criminal behavior. By focusing solely on the negative correlation between intelligence and crime, the book may fail to consider the full range of factors that contribute to criminal behavior.

Some fail to find a statistically significant association, and very few find a positive association. These results were the same for official offending, self-reported offending, and various forms of psychopathy related to criminal behavior.

Individuals may under-report or over-report their criminal behavior due to social desirability biases or memory errors. Additionally, self-report measures may be influenced by the specific wording and framing of the questions, which can affect the responses given. Similarly, official records of criminal behavior may not provide a complete or accurate representation of an individual's criminal activity. For example, official records may only capture offenses that are detected and reported to the authorities, and may not include offenses that go undetected. Additionally, the definitions and categories of criminal behavior used by different jurisdictions may vary, which can affect the comparability of official records across different studies or populations. Regarding psychopathy, it is a complex and multifaceted personality trait that is often associated with criminal behavior, but it is not the only factor that contributes to criminal behavior. Psychopathy is typically measured using standardized assessment tools, such as the Psychopathy Checklist-Revised (PCL-R), which assesses a range of personality traits and behaviors that are thought to be indicative of psychopathy. However, it is important to note that psychopathy is not a diagnostic category in the DSM-5 (the diagnostic manual used by mental health professionals) and that the concept of psychopathy is still the subject of ongoing research and debate in the field of psychology.

Additionally, the type of IQ matters; performance IQ has a stronger association than does verbal IQ, but both negatively predict criminality.

The book's focus on performance IQ and verbal IQ may not capture the full range of intelligence, and the relative importance of these different measures may vary depending on the specific context and goals of the research.

Ellis and Walsh (2003) review the international data on IQ and crime; of 68 studies on IQ and delinquency, 60 found statistically significant, negative relationships. The other eight only reported statistically insignificant relationships. Of 19 studies on adult offending and IQ, 15 found statistically significant, negative relationships. Of the 17 studies on self reported and IQ offending, 14 found a statistically significant, negative relationship. Of the (19) studies on the effect of IQ on antisocial personality disorder, all found a statistically significant, negative relationship. Additionally, the international meta analysis provided by the newer edition of the Handbook of Crime Correlates (Ellis, Farrington, and Hoskin, 2019) finds that far and wide, the majority of the studies show a statistically significant, negative relationship between official offending and IQ. Some studies were statistically insignificant and very few showed a positive relationship. One criticism may be that low-IQ offenders are more likely to be caught, and thus the relationship between low IQ and criminality is a result of this issue. One study found that this was not the case, and that criminals that aren’t caught still have low IQs (Moffit and Silva 1988).

More recent and representative data shows that differential detection hypothesis is supported to a certain extent (e.g. that it can account for only part of the relationship) (Yun & Lee 2013), which has been replicated with more controls (Boccio et. al 2018; Yun et. al 2013). Moreover, in a reanalysis of the National Longitudinal Study of Youth in a review of Murray and Herrnstein’s The Bell Curve, Cullen et. al (1997) reports that differential detection does occur in this cohort.

Furthermore, controlling for SES still shows criminals to be low IQ (Jensen 1998).

So aside from the fact that (Jensen 1998) violates Berka Nash, first of all there is a great deal of evidence of various kinds that the general factor does not do what Jensen claims. This factor only applies when one assumes that intelligence is a fairly narrow construct, that narrow measurements suffice for measuring intelligence, and that work that goes outside this tradition is not worth taking seriously. The problem is that g does not extend much beyond academic, analytical tasks. The fact that even Herrnstein and Murray (1994), staunch supporters of the g construct, found that conventional psychometric tests of abilities account for only about 10% of the variation in various measures of life success ought to give us pause. Others have come up with similar numbers (e.g., Wigdor & Garner, 1982). There is a varied body of empirical evidence calling the existence of g into question. One program of research investigates practical mathematics versus school mathematics, lnvestigating populations such as Brazilian street children doing mathematics problems in their street businesses (Nuiies, Schliemann, & Carraher, 1993) or Berkeley housewives doing mathe- matics problems in the supermarket (Lave, 1988). The general finding is that individuals who are able to use mathematics effectively in their lives are often unable to solve the same problems when they are presented in paper and pencil format and in abstract form. Moreover, the correlation between performance on the two kinds of tasks is meager. A second related program of research has shown that individuals who may be considered to be of quite average levels of intelligence, such as men working in a milk-processing plant, may devise quite complex ways of doing their jobs-ways that are substantially more clever than the ways they are told to do their jobs by management (Scribner, 1984). In a related research program, Ceci (1996) has investigated adults setting odds at racetracks. Men who are able to formulate and use extremely complex mental mathematical formulas were found to be of roughly average intelligence, and again, cross-domain correlations appear to be meager. In further work under this research program, it has been shown that children who solve a cognitive task in one way in the laboratory will often solve it in a different way when they are presented with the same task in the home. Furthermore, a task presented with engaging content will generate a much higher level of performance than the same task presented with content that fails to interest the children performing it. Research suggests that what constitutes in- telligent, adaptive behavior varies from one culture to the next, and that the mental operations used to produce such behavior may vary some- what as well. For example, Greenfield (1997) has found that the assump- tions underlying the display of intelligence (such as those involving the role of other people) vary from one culture to the next, and Cole (1996) andhiscolleagueshaveshownthatclassificatorybehaviorthatisconsid- ered intelligent in one culture may be considered stupid in another. Ro- goff (1990) has shown that the socialization of intelligence differs widely around the world, and research shows that it even differs across ethnic groups in the United States (Heath, 1983). In other research (summarized in Sternberg, 1997; Sternberg et al., in press), we have found that scores on tests of practical intelligence for individuals in a variety of occupations-management, sales, academic psychology, and several levels of the military do not correlate with scores on conventional tests of intelligence, despite the fact that scores on these tests predict job performance about as well as or better than do scores on conventional psychometric measures. In work in Kenya, we have found negative correlations between tests of practical intelligence and tests of academic (crystallized and fluid) abilities as well as school achievement. We have also found in Tanzania that, through dynamic testing, children’s scores on conventional cognitive-ability tests can be increased substantially, and that scores after training are only modestly correlated with scores before training. In the United States, we have found that high school students’ scores on tests of analytical, creative, and practical abilities are not correlated with each other (after correcting for response format) and that children who are taught to their pattern of analytical, creative, and practical abilities outperform children who are not so taught (Sternberg, Grigorenko, Ferrari, & Clinkenbeard, in press). Indeed, students taught analytically, creatively, and practically outperform students taught in conventional ways, even on conventional assessments (Sternberg, Torff, & Grigorenko, 1998).

Along with crime,

Other research has demonstrated that following the inclusion of a more robust set of structural controls, IQ contributes to less than 5% of the variation in delinquency (Menard & Morse 1984). Moreover, IQ is usually one of the smallest contributors to overall variance, explaining less than 1% in many meta-analyses (Cullen et. al 1997). Even more, other research has also shown a lack of a longitudinal correlation following the adjustment of the relationship for confounding factors (Fergusson, Horwood & Ridder 2005) [4], indicating that more research should be done to more robustly test hypotheses of confounding.

IQ can also help explain some of the reasons as to why blacks earn less than whites and why blacks are in poverty. First of all, income is heritable, as found by twin studies. Hyytinen et al. (2013) looked at 19 previous samples in which the heritability of income was estimated. 42% of the income variation could be attributed to genes, while 9% was due to non-shared environments. It’s possible that blacks and whites could differ in genes associated with income, with whites having genes associated with higher income and blacks with genes associated with lower income.

There are classic problems with both Hyytinen et al. (2013) and race realists' interpretations. Firstly, ACE, ADE (accounting for dominance effects) represents a flawed understanding of genes and environment. The study tries to account for GxE but see:
https://developmentalsystem.wordpress.com/2019/10/07/gene-environment-interactions-and-the-statistical-fallacy/
Their coefficient of relatedness is put into question. as stated before. "Genes and environments combine in statistica it ttm lly heterogeneous and stochastic ways that evade detection without actual developmental research." Modeling such as in Purcell 2002 ONLY works when SPECIFICALLY modeling for the environment. Second of all, Heritability estimates do not help identify particular genes or ascertain their functions in development or physiology, and thus, by this way of thinking, they yield no causal information. (Panofsky) No genes have been "found" for income and differences by ways of race fail to account for the assumptions of the twin studies (the assumptions the study conveniently "addresses" in a couple of statements and proceeds on such as Assortative mating and EEA). Even then, they must account for more than just a few biases (when dealing with such a topic the bar ought be set high for hereditarians to prove their arguments). Third, h2 estimates themselves wouldn’t tell us which traits are “genetic” or not Bailey et al. (1997).

Regardless, Strenze (2007) looked at over 100,000 individuals and found that IQ correlates with income was at .22. This is important because IQ is a better predictor of someone’s socioeconomics in the future than their parents socioeconomics (Strenze 2007).

So Strenze actually says “The correlation with income is considerably lower, perhaps even disappointingly low, being about the average of the previous meta-analytic estimates (.15 by Bowles et al., 2001; and .27 by Ng et al., 2005). But...other predictors, studied in this paper, are not doing any better in predicting income, which demonstrates that financial success is difficult to predict by any variable. This assertion is further corroborated by the meta-analysis of Ng et al. (2005) where the best predictor of salary was educational level with a correlation of only .29. It should also be noted that the correlation of .23 is about the size of the average meta-analytic result in psychology(Hemphill, 2003) and cannot, therefore, be treated as insignificant.” When one looks at the cloud of data points, we’d all be right to be skeptical that any information can be extracted. Some research has shown associations between IQ and income and wealth that are approximately zero. For instance, Heinek & Anger (2010) use German panel data and estimate regression coefficients of cognitive ability on wages of about 0.02 for males, and figures not different from 0 for females (Table 1). Hauser (2010) reports there is no effect of intelligence on income net education, so any purported effects have to be mediated through education, a highly bureaucratic and credential laden institution. Borghans et. al (2016) reports IQ only explains ~2.5% of the variance in income, indicating a correlation of about ~.15 (fig 4). Finally, Taleb (2019) reports the relationship between IQ and wealth and income with R^2 values of about 0.01 to 0.02.

Now, a possible criticism might be that income influences IQ rather than the reverse. However, adoption studies and guaranteed income studies have not shown income to impact intelligence. Blau (1992) found family income to not predict differences in intelligence between siblings, except in reading ability. In North Carolina, a guaranteed income experiment found no effect on GPA in high schoolers but did find one in young children (Maynard 1977). Duncan et al. (2011) found no effect of income on test scores, and Scarr et al. (1976) found that family income did not predict IQ in adoptive homes.

Adoption studies (specifically dealing with comparing the correlation of the adopted children to their biological parents and their adoptive parents) suffer from range restriction, selective placement, and family unit effects. Moreover, both forms of adoptee studies suffer from the issue of confounds from epigenetic inheritance, effects from the prenatal and pre-adoptive environment.

Regardless, Palmer (2018) also found that IQ was a better predictor of someone’s SES and poverty than their parental SES. Since IQ is associated with income, it’s no surprise to see that controlling for IQ cuts the black-white difference in the probability of being in poverty and wages in half (Murray and Herrnstein 1997). A large portion of the black-white differences in wages and being in poverty can be explained by IQ. The rest of the remaining disparity could be attributed to the race differences described above and below.

A reanalysis of this discredited book shows that measured cognitive ability, education, experience, and job tenure together account for, at most, only 29% of wages.The contribution of measured ability to the overall fit of the model is dwarfed by that of other observed characteristics. If ability is the only regressor included, ability contributes between 12-17% of wages. When they control for human capital measures (education, job tenure, job tenure squared, work experience, and work experience squared), the marginal increase ability falls to between 4% and 1%. So H&M dramatically overstate the degree to which differences in wages among individuals can be attributed to differences in their cognitive ability (Delvin, Fienberg, and Resnick 1997)

Jensen and Nyborg (2001) also found that adjusting for intelligence closed the black-white income gap, and even showed blacks to out-earn whites at the higher levels of IQ.

This is very debunkable. There are several reasons this study is debunked. The method of correlated vectors (MCV) when applied to item-level data. The MCV approach used in "Occupation and income related to g" relies on correlating a vector of g-loadings for each item with a vector of group differences in performance on each item. However, as Wicherts explains, using item-total correlations from classical test theory as estimates of g-loadings is highly problematic. Item-total correlations are dependent on item difficulty and the ability distribution in each group. An item may have a high correlation in one group but low in another group with different ability, even if the item measures the same latent trait. This violates a key requirement of measurement invariance that item parameters should not depend on group membership. Consequently, the vector of item-total correlations will differ systematically across groups that vary in overall ability levels. This means the "g-loadings" being correlated are not comparable across groups. The MCV result can change drastically depending on which group's item-total vector is used. Item-total correlations do not isolate g, but reflect variance from all sources. High inter-item correlations can occur even in scales with multidimensional structure. MCV cannot determine if item variances reflect g versus other abilities. The nonlinear relationship between item difficulty, item-total correlation, and phi coefficients (group mean differences) further obscures the interpretation of MCV results. Low MCV correlations can occur even if all items are g-loaded and measurement invariant. In addition to these inherent limitations of using item-total correlations as g-loadings, the MCV approach does not properly account for sampling variability and multiple comparisons when testing correlations across a large set of items. So the MCV method lacks both sensitivity and specificity when applied to item-level data. It cannot provide compelling evidence either for or against Spearman's hypothesis and the role of g in group differences. The extensive critiques of MCV strongly undermine its evidentiary value in making causal claims about the effect of intelligence on income differences. More rigorous analysis using modern psychometric models like IRT and tests of measurement invariance are needed.
It relies heavily on the assumption that performance on Raven's provides an undiluted measure of general intelligence. However, the evidence that Raven's is a pure indicator of g is highly questionable. Factor analytic studies show substantial variability in the g-loadings of Raven's, with many analyses finding it does not have the highest g-loading among tests in a battery. Raven's often shares variance with group factors like fluid reasoning or visual processing, beyond g. This violates the expectation that a pure g measure should have zero loadings on all group factors. The large Flynn effect for Raven's suggests it is influenced by causal factors like education that do not affect g. A pure g measure should not exhibit test-specific secular score increases. All tests likely have some degree of test-specificity according to classical test theory. The assumption that Raven's has zero test-specific variance and error is implausible. The g-loading of Raven's partly depends on the nature of the other tests in a battery through correlated variance. Its g-saturation is not an intrinsic property. Additionally, Raven's consists of only one type of matrix reasoning item, and does not sample the broad domain of cognitive abilities that comprise g according to theory. A short, narrow test cannot provide comprehensive measurement of a latent variable like g. So, the substantive critiques of the "purity" of Raven's indicate that differences in Raven's scores cannot be unambiguously attributed solely to individual differences in general intelligence. The use of Raven's as a single proxy for g in investigating group income differences is highly questionable on psychometric grounds.
There are key mathematical traps in factor analysis that call into question the results. The study extracts a general factor g based on factor analysis. But, factor loadings and g scores are indeterminate in factor analysis - there are an infinite number of different g score estimates that can reproduce the same covariance matrix. This indeterminacy means any resulting g factor scores are on shaky ground. The common factor model assumes g is common to all mental tests. But the authors acknowledge they used different collections of tests across groups, which would alter the nature of the resulting g factor. This violates the common factor assumption. Monotonicity checks show the study's correlation matrices violate expectations of a dominant g factor (e.g. the highest correlations are not all in the upper left corner). This suggests the data does not fit the g model. The study claims g is the largest source of variance across mental tests. But common factor analysis only models covariances, not explained variances, so this claim about g is unfounded. The study depends on a single g factor applying equally across groups. But it’s been shown how group differences can distort factor loadings, undermining any universal g. Problematically, the authors state "Different collections of tests will result in somewhat different first principle factors" (p.200) yet continue to interpret g as if it is the same trait across groups. This reveals a concerning inconsistency in their conceptualization and application of factor analysis. So, the mathematical and conceptual traps reveal major gaps in the factor analytic foundations used to estimate g and link it to income differences in this study. The results cannot be considered valid or meaningful given the violations of core assumptions.
The linear models used in this study are highly problematic given the data involved. The study uses linear regression models to relate g factor scores and income. But g factor scores are estimated from dichotomous item-level data, which involves complex non-linear relationships. Dichotomous item scores follow Bernoulli distributions, not normal distributions. Linear models like OLS regression rely on normality assumptions that are violated by binary item data. Statistics like item-total correlations have restricted ranges and will relate non-linearly to effects like group differences. The lasso-shaped bivariate relationships described by Wicherts indicate severe problems for linearity. The linear regressions used do not properly model measurement error in the g factor scores, which attenuates associations. Structural equation models are preferred for modeling latent variables with error. Income is measured categorically, but linear regression treats it as continuous. This distorts coefficients and standard errors. Ordered logistic or probit models would be more appropriate. Complex survey data requires weighting, clustering, and stratification to be accounted for to obtain unbiased estimates. It is unclear if the regressions properly modeled the sampling design. Axis scaling and restriction of range issues typical of achievement data can create illusory linearity. Non-normal residuals and homoscedasticity should be checked. So, the application of linear regression and correlation to analyze these data violates critical assumptions and does not adequately capture the relationships between g, test performance, and income. The inferences about g eliminating group income differences are highly suspect given the statistical model misspecification. The authors' conclusions outstrip what the linear methods applied can support.
The lack of measurement invariance testing is a major omission in this study that undermines the conclusions drawn about group differences and g. Establishing measurement invariance is fundamental for comparing latent variables like g across groups. Without invariance, group differences in test scores cannot be attributed to the latent trait. However, the authors do not conduct any test of measurement invariance of their g factor across ethnic groups. This leaves open the possibility that method effects or differential item functioning account for score differences. At minimum, weak factorial invariance testing configural, metric, and scalar invariance of the factor model across groups should be conducted. But no invariance testing is reported. The factor loadings used to derive g are based on the combined sample including both blacks and whites. If measurement non-invariance exists, these combined loadings are uninterpretable. The authors acknowledge that different collections of tests were given to the two groups. This means the nature of the latent g derived could differ across groups, violating assumptions. The income models control for g as if it is the same latent trait across groups. But without establishing measurement invariance of g, this causal interpretation is unsupported. Differences in latent means cannot be compared without scalar invariance. But no attempt was made to test scalar invariance of factor loadings or intercepts. So, the failure to conduct measurement invariance testing undercuts any claim about group differences in g or its effects on income. Without invariance, what appears as an ethnic difference in g could simply be measurement artifact or method effects. The causal inferences drawn about g and income are inappropriate given the lack of rigorous psychometric testing across groups.
The limitations of IQ tests as a measure of intelligence. IQ tests rely heavily on psychometrics, the quantitative measurement of psychological constructs. However, intelligence is an extremely multidimensional and nebulous concept that evades quantification. As discussed by Taleb, intelligence encompasses real-world skills like creativity, wisdom, critical thinking, and practical problem solving. These faculties involve complex neurological processes and dynamical interactions that cannot be reduced to a single score. IQ tests like the WAIS and Stanford-Binet are based on classical test theory, which assumes each item measures the same underlying latent trait. However, the positive manifold and factor analytic methods used to derive the "g factor" are prone to computational instability and rely on linear assumptions that do not hold for a nonlinear system like intelligence. The positive manifold may simply reflect cognitive complexity rather than a unitary general ability. IQ tests exhibit poor construct validity and lack predictive power for real-world criteria. The tests are not causally informed by cognitive neuroscience and at best assess skills like working memory and abstract reasoning. Performance on laboratory-based psychometric tests often fails to transfer to fluid intelligence tasks. IQ only explains at most 25% of the variance in occupational or academic performance. Most IQ tests have ceiling effects and are unable to differentiate ability at the high end of the scale. They may only measure deficits or learning disabilities rather than enhanced cognitive capacity. Range restriction is thus a major issue when studying high ability groups. IQ tests also have limited cross-cultural validity and language tests like the WAIS are biased for native English speakers. Group-level IQ differences likely reflect environmental factors, test bias, and stereotype threat rather than innate ability. Heritability studies attributing IQ to genetics are fraught with statistical flaws. So, the theoretical constructs and statistical models underlying IQ testing are oversimplified and lack sufficient validity to support their use as a proxy for intelligence. The tests measure a very narrow band of cognitive abilities and should not be reified as representing general intelligence.
The non-normality of the income distributions. First, the study reports descriptive statistics like means and standard deviations for income, but does not provide information on distributional shape like skewness or kurtosis. Omitting common skewness and kurtosis metrics is a red flag that the data may not meet normality assumptions. Second, the income variable is aggregated into seven discrete brackets rather than being continuous. Categorizing a continuous variable into bins destroys information and introduces discretization error. The coarse categories obscure the true distributional shape. Third, the study relies solely on linear regression techniques like OLS that assume normal residuals. However, income is well known to follow a log-normal or Pareto distribution, exhibiting a heavy right tail. Linear models are not robust to violations of normality and is highly misleading results under fat tails. Fourth, the reported R-squared values are very low, on the order of 1-2%. This indicates the models explain almost none of the variation in income, likely due to omitted non-linearities and interactions when imposing a parametric linear model on non-normal data. Fifth, the income ranges reported in Table 2 show right-skew, with the mean exceeding the median. The Black average income of $23,215 is closer to the lowest White bracket of $25,000 than the White mean of $30,052. This skew is indicative of a long right tail. So, the omission of distributional statistics, use of binned categories, linear-only techniques, low R-squared values, and descriptive skew all suggest the income data may not be normally distributed. This could render the regression models and IQ-income inferences invalid. The authors should have tested for normality and used robust non-linear methods.
The limitations of correlational analysis, especially with regards to the reported link between IQ and income. The correlations are spurious and break down under fat-tailed distributions. First, the study only reports a simple Pearson correlation coefficient of .36-.39 between IQ and income. However, Pearson's r assumes a linear relationship and is not robust to non-linearities. Income more likely follows a exponential/power-law relationship with IQ where returns accelerate at the upper tail, invalidating a linear correlation. Second, the authors do not report any scatterplots between the variables that could reveal the true functional form. Omitting the visual plots obfuscates the possibility of a non-linear relationship. Third, the low R-squared of 1-2% indicates that nearly all variation in income is unexplained. This suggests at most a very weak link between IQ and income, with income more likely dependent on socioeconomic status, social capital, racial discrimination and other omitted variables. Fourth, the study dichotomizes race which could induce a Simpson's paradox where the correlation reverses or disappears when disaggregating the groups. This points to issues with aggregation bias and causal direction. Fifth, the theory that IQ increases earnings is backwards causal. Income and occupation likely enhance cognitive performance and test-taking skills through nutrition, education, family environment and other confounders. So, the correlation presented is likely an artifact of the linear methodology. Visual inspection, the low R-squared values, aggregation issues and reverse causation all indicate the IQ-income relationship breaks down under real-world conditions. The study provides no evidence of a robust, generalizable correlation between the variables.
The use of self-reported income data is problematic and fails to isolate individual earnings. There are a few major issues with relying on self-reported income that undermine the validity of the results. Self-reported data is subject to various biases that distort responses. These include social desirability bias, prestige bias, recall bias, and reference dependence effects. Participants may inflate or deflate incomes due to stigma, poor memory, or shifting comparisons. The study relies on total household income rather than individual earnings. Household income conflates individual occupational outcomes with assortative mating, household structure, and spousal characteristics. Two individuals with equal earnings could have drastically different household incomes. Household income has higher variance and skew than individual earnings because it compounds multiple income streams. This can artificially inflate correlations by mixing variability sources. The models likely violate homoscedasticity assumptions. There are different incentives around reporting household versus individual income that could induce systematic measurement error varying by race. Factors like welfare stigma or marital patterns may distort self-reports. Without tax records, pay stubs, or other verification, there is no way to validate the accuracy of the self-reported income brackets. Misreporting could be sizeable in either direction. So, relying entirely on an unverified, coarse measure of household income severely compromises the validity of the study's models and claims around individual occupational outcomes. To make claims about individual earnings, the models should be re-run on verified wage data, accounting for household composition. Self-reports are simply too unreliable.
The generalizability of the results given the use of a veteran sample. There are a few concerns with extrapolating these findings to the broader population. Military samples go through extensive screening and Filters on attributes like health, education, and cognitive ability. This range restriction truncates the distribution and may distort correlations. Veterans differ from the general population on qualities like motivation, risk tolerance, and access to benefits like the GI Bill. These factors likely moderate IQ-income links but cannot be explored. The veteran population is heavily male, while income dynamics may differ for women. Gender differences in occupational choice, discrimination, and household tradeoffs make generalization dubious. Selection into military service is non-random and depends on socioeconomic status, patriotism, and other attributes that could confound the relationships studied. Unobserved confounding is likely. The nature of military occupations differs from civilian work in terms of skill demands, compensation structure, and promotion patterns. This reduces external validity. The data truncates both tails of the IQ distribution through minimum AFQT requirements. This distorts correlations and distributions, especially when comparing racial groups. So, the patterns observed in this select veteran sample cannot be assumed to represent broader society. The range restrictions, confounding, and sample peculiarities all indicate the need for considerable caution before generalizing these correlations beyond veterans. The study findings have limited external validity.
The sensitivity of the regression cross-over effect reported in the study. There are a few reasons to doubt the claim that blacks out-earn whites at higher IQ levels. Crossing regression lines are highly unstable and sensitive to model specification. Slight changes in functional form, transformations, or inclusion of interactions can dramatically alter which lines cross. The reported crossing effect is entirely driven by extrapolation, as very few blacks actually fall in the higher IQ ranges where their regression line supposedly dominates. Predictions in sparse areas of data are unreliable. At a fixed IQ level, there is still substantial income variance, as evidenced by the low R-squared. Cross-over effects could easily shift under replication. IQ only explains a small portion of income differences. Unobserved variables like family background, social capital, and discrimination likely swamp any IQ effect. The models assume homoscedasticity and the IQ-income relation likely exhibits heteroscedasticity, with income variability expanding at higher IQ levels. This distorts cross-over points. As visualized in the scatterplot, there is enormous variance and overlap in the IQ-income planes between races. Any distinctions or cross-over points are dwarfed by the variance. So, the claim that blacks out-earn whites at high IQ levels rests entirely on an unstable artifact of model extrapolation in a low-validity model. This effect would not replicate in out-of-sample data. The huge variance observed makes any cross-over points statistically negligible. The conclusions extend the limited data too far.
The dangers of imposing elegant mathematical models on complex real-world data. There are a few signs that the neat linear regression results likely fail to capture the true complexity. The study relies entirely on linear regression techniques like OLS. But as established earlier, the income variable clearly deviates from normality and linearity based on the descriptive statistics. Imposing linearity on non-linear data leads to fallacious inferences. The models assume homoscedasticity, but the income variable likely demonstrates heteroscedasticity, with increasing variability at higher IQ levels. Violating homoscedasticity distorts model fit and significance testing. The study excludes any non-linear terms, transformations, or interactions from the models. Yet theory suggests income depends non-linearly on factors like family background, networks, and other advantages that likely interact with IQ. Excluding non-linearities omits essential dynamics. The models only explain 1-2% of income variation, indicating massive model misspecification and omitted variable bias. But rather than improving the models, the study draws causal conclusions from admittedly invalid models. The authors note that "correlations tend to remain fairly constant" between groups. This consistent linearity despite different environments for blacks and whites defies logic and suggests forcing the data into parallel linear models for mathematical elegance rather than model fit. So, the study demonstrates many warning signs that the linear modeling results inadequately capture the intricate dynamics of income determination. The conclusions extend the mathematically convenient but grossly oversimplified models far beyond what is warranted. The analysis subsumes scientific rigor to mathematical aesthetic.

Going back to morality, we should not assume blacks and whites are equal in their ability to morally reason. One way to measure someone’s ability in moral reasoning is to give them a test where the lead actor is confronted with a moral dilemma. The respondent is supposed to choose the proper action consistent with it. When it comes to these tests, delinquents perform poorly on them (Raine 1993). Not surprisingly, these tests have also found race differences. More controversially, morality can also impact criminality. There’s no reason to assume that blacks and whites would share the same morality. Indeed, when giving moral understanding tests which put individuals into a scenario and then ask them what they’ll do, blacks score lower on these tests than whites do. In a sample of 1,322 junior high school students, their mean score was 21.90 with a standard deviation of 8.5 (Rest 1979).

This old paper wasn’t even on black people.

In a study by Preston (1979, cited in Rest 1979), blacks got a mean score of 18.45, showing a weaker moral understanding with a Cohen’s d of 0.41. In a sample of 8,782 people, blacks were more likely to endorse statements such as “laws are made to be broken”, “there is no right or wrong ways to make money”, and “it is okay for a teenager to have fist fights.”

Preston isnt even cited in Rest.

The same has been found in blacks from Trinidad (d=-0.45 [Rest 1986]) and in Jamaica (d=-0.51 [Gielen et al. 1989, in Adler 1989]) when compared to whites. Since blacks score lower in tests that measure one’s moral understanding, it’s no surprise that they are more likely to commit crimes, especially since lower scores on tests of moral understanding are associated with delinquency. According to Beaver, Ellis, and Wright (2019), they find a negative relationship between lower scores on tests of moral understanding and official and unofficial offending behavior. Now, either blacks simply lack the morality that whites tend to have, or blacks simply have a different view of morality. If blacks have a different view of morality, then they’ll tend to treat others and do actions that may harm people at a higher rate than whites. As evident by the higher rates of black crime, it does seem they have a different view of morality than whites or simply lack the morality that whites have. Another possible cause for black crime is race differences in MMPI scores. The Minnesota Multiphasic Personality Inventory, or MMPI, is a test that assesses an individual’s personality and psychopathology.

The MMPI was developed in the spirit of empiricism, with little regard for item content, and so conclusions based on the interpretation of MMPI content are highly suspect. It’s a test that’s been revised many times due to its invalidity and unreliability. The MMPI is a widely known test primarily reliable with the white middle-class and those who are severely disturbed see: https://www.statisticssolutions.com/free-resources/directory-of-survey-instruments/minnesota-multiphasic-personality-inventory-mmpi/
Additionally studies show predictive inaccuracy and bias when it comes to racial differences in the MMPI (Arbisi 2002, Monnot 2009). Finally, the MMPI doesn't warrant any sort of diagnosis. The MMPI-2 is one of the most famous personality questionnaire but it does not have a predictive value by itself, if there are some scale such as the O-H and the 43/34 and 49/94 profiles which are linked to impulsive behavior and sometime violent conducts, it does not provide a predictive certainty see: https://us.sagepub.com/en-us/nam/forensic-applications-of-the-mmpi-2/book5097
Measures such as the MMPI are usable in the general population, and may give some information on a persons personality, which you then could use as the basis of a further inquiry, for example by selecting people with worrisome profiles and doing in-depth follow-up interviews, more regular checks or mandate some type of treatment or guidance. Only using scores on tests to make decisions that could have huge negative consequences on someone's life however, is never an option. Cultural differences may not be at all related to clinical issues, at least not in any direct or simple manner. Relations between cultural and clinical factors are quite unclear. One might hypothesize the potentially different cultural backgrounds, different environments, would make it possible that the blacks interpreted the items differently or described themselves from a different cultural orientation but do not demonstrate behavioral or clinical symptomatology which the criterion groups exhibited.
https://sci-hub.ru/downloads/2020-01-03/0d/dietrich1978.pdf

Hathaway and McKinley (1989) describe people who score high scores on the MMPI as being irresponsible, psychopathic, aggressive, having marital and work problems, and being underachieving.

Hathaway, the author of the MMPI, also had another paper which stated “They [the results] must, in the present instance, be interpreted cautiously because of the obvious invalidities involved in the direct application of clinical scales derived on adults to a high school population.” So Hathaway is quite cautious in interpretations of the results of Hathaway’s work. The author's certainty of the validity (that the test is measuring what it says it is) of the MMPI is definitely not legitimately inferred from the references which he cites to support his position.

Gynther (1968) remarks how these findings are interpreted to show “estrangement and impulse-ridden fantasies … unusual thought patterns and aspiration-reality conflict,”

Gynther also has a more recent paper in 1974 that supports the hypothesis described above about cultural differences. Gynther found that correlates of MMPI results descriptive of white patients were not descriptive of black patients even though the MMPI results were very similar see: https://pubmed.ncbi.nlm.nih.gov/4149988/

Below is a table measuring the differences in psychopathic personality between blacks and whites using Cohen’s d. Since whites are a reference group, negative scores indicate scores lower than whites, and positive ones indicate scores higher than whites.

The table is just a series of small effect sizes for the difference between black and white people. Additionally there’s no sort of t test/F test or hypothesis test for the significance of the difference. There’s no regard for type I error, there’s no warrant for outliers/skewness, bottom line is the table tells us little.

(-) indicates missing data, adopted from Lynn (2019)
In the book MMPI Patterns of American Minorities, it’s discussed how blacks score higher on psychopathy, schizophrenia, and on hyperactivity scales (Dahlstrom, Lachar, and Dahlstrom 1986).

Only about 30% of all racial comparisons involving these three scales have resulted in statistically significant differences (Pritchard & Rosenblatt, 1980a). Moreover, even when these comparisons are statistically significant the actual difference does not typically exceed five T^-score points (Greene, 1980; Pritchard & Rosenblatt, 1980a).

and Dahlstrom, Lacher, and Dahlstrom note that the high score of blacks on hypomania indicate “outgoing, sociable, and overly energetic patterns; tendencies to act impulsively and with poor judgment.”

Dahlstrom, Lacher, and Dahlstrom also says “several methodological problems inherent in the process of measurement are made more difficult by this limitation in current psychometric procedures, prominent among them being the identification and reduction of error or systematic bias in these measurements. Errors of measurement are never totally irradicable, of course,but requirements of scientific method demand that unremitting effort be devoted to the appraisal of possible sources, directions, and magnitudes of constant errors in our test data and to the discovery of ways to eliminate, control, or make allowance for them in all of our measurements.” The funny thing is Lachar had a book in 1974 which stated that MMPI profiles obtained from patients who deviate significantly from these reference grouPS (adult, Caucasian, lower- to middle-class patients seen in VA and teaching hospital settings), or more specifically, the reference group noted by a specific interpretation, should be interpreted with caution using all other available data. Examples of such patients include those with organiC brain syndromes referred from neurology or neurosurgery, patients from atypical cultural backgrounds such as blacks and foreign nationals, and adolescents.

Jones (1978) looked at 226 black and white junior college students and administered the MMPI and California Psychological inventory. As Jones remarks, “Blacks reported themselves as more dominant and poised socially, fundamentalist in their religious beliefs, concerned with impulse management, self-critical, psychologically tough, cynical and power oriented, conventional in moral attitudes, and conformist than Whites. Blacks also reported themselves as less adventuresome and likely to take risks, and less vulnerable and tender psychologically (an interaction effect suggests this is particularly true of Black males than whites).” White male scored higher on unconventional morality, meaning that their behavior is considered beyond conventional sexually and ethically, but this doesn’t align with reality given race differences in moral reasoning tests, something to be discussed below. When it comes to women, black women scored higher on social dominance, compulsive-orderliness, self-criticism, psychological toughness, risk-taking, cynicism and power orientation, and conformity. Both black and white women tended to report themselves as more religious, and conventional in moral attitudes. These race differences held true even after holding socioeconomic status and years of education constant. Similar findings for women have been noted in another study.

Jones (1978) also notes that generalizing black people is not advised. “These traits can be generalized to Blacks only with caution, since correlations of EFT scores and personality clusters, though modest in size, were usually in opposite directions for Black and White subjects. It is possible the personality implications of field dependence may vary for Blacks and Whites.” Many critics in the respect of cultural influence have attacked the test. Garcia (1981).

Harrison and Kass (1967) looked at pregnant black and white females. It was reported that “Negroes reported themselves as more religious, intellectual, romantic, cynical, impulsive in fantasy, fearful, estranged, sociable, concerned with dreams, orderly, and somatically tense than whites and less masochistic, free of aberrant behavior, indulgent in minor crimes, self-conscious, and antagonistic toward school than whites.”

Harrison and Kass (1967) has been criticized heavily. Harrison and Kass (1967) partially summarize findings in tabular form. Methodological sophistication, populations, and sample sizes have varied broadly. As a result, the findings have been inconsistent and, at times, even contradictory. Studies have attempted to probe three of the outstanding methodological issues that emerged. Harrison limited the acceptability of profiles for inclusion by imposing more stringent criteria for protocol validity than were employed by other researchers. Dahlstrom (1960) indicate that only four studies were of minority cultural background conducted during the period 1939-1960 which compared the performance of blacks and whites on the MMPI. Three of the studies were of prisoners in state institutions and the other was of patients in a VA hospital. These studies, as well as more recent ones, report that the answers of blacks and whites on the MMPI statements are significantly different. Harrison and Kass state that “. . . race differences in both MMPI items and derived factor scales are of a magnitude seldom found in personality re- search. . . . Race differences on the items were huge. . . , The scales are not very sensitive to race differences, whereas the items are remarkably sensitive.” These results illustrate some of the problems one may encounter when attempting to apply a statistical rationale based on data acquired from one cultural group to another cultural group. What does it mean, for example, that there were "huge" differences on items, yet, at times at least, no significant differences on scales? Or that there were differences on some scales, but not on others? Do we ignore the differences in item responses, and conclude that similar scale scores imply similar groups, or do we ignore similar scale scores and draw inferences from the significant differences in item responses between two groups? Clearly, there are signs that one cannot simply apply MMPI inter- pretations across cultural groups without a good deal of further study and analysis (as noted above).

MacDonald and Gynther (1963) also found race differences in MMPI scales: Race differences were smaller when looking at the high social class groups (1-2), but the differences in MMPI scale scores get more pronounced depending on the scale being measured and the social class. An issue with this is that blacks scored higher on L, which is the scale that checks if you’re lying when taking the MMPI test. Based on Cohen’s d, the differences are large (black-white male difference d=0.9; black-white female difference: d=0.7) and the smaller race gaps could be an artifact of blacks lying on the test. Regardless, the conclusion of racial differences in MMPI scores show that blacks are more cynical, have greater mistrust, conflict with authority, and “externalization of blame for one’s problems.”

The paper literally clarifies that it ”is tempting to conclude, as others have (e.g., Ball, 1960), that Negro high school students are more maladjusted than white students. It is not clear, however, that this conclusion necessarily follows from the data. A basic limitation is that relations between scale scores or configurations and behavior have not been explicated with that high a degree of precision.” In fact their measure of socioeconomic status was not rigorous at all. The more rigorously that moderator variables and profile validity issues are controlled by an investigator, the less likely it becomes that Black-White differences will be found. For example, Costello et al. (1972) reported that no Black-White differences were found if invalid profiles were excluded (profiles were excluded in Macdonald and Gynther). The scope of this paper is extremely limited because of it sample. It has low statistical power and these effects are being incorrectly detected.

After controlling for social status, IQ, and levels of education, the differences in MMPI scores go away. This does not mean that MMPI differences are a result of these variables, rather traits scores may also affect status and levels of education. For example, conflict with authority can stop advancements in social status and levels of education. Controlling for these variables may make the differences go away, but this does not mean the differences are simply artifacts.

The problem with these MMPI papers is that they’re pre-1989, meaning they weren’t using the MMPI-2 that was released, which was revisioned to be more valid than the original.

Some commentators have also made the claim that the MMPI test is biased against nonwhites, but the evidence does not support this (Prichard and Rosenblatt 1980).

Investigation of the Ferguson Police Department
Between 2012 and 2014, black people in Ferguson accounted for 85 percent of vehicle stops, 90 percent of citations and 93 percent of arrests, despite comprising 67 percent of the population.
Blacks were more than twice as likely as whites to be searched after traffic stops even after controlling for related variables, though they proved to be 26 percent less likely to be in possession of illegal drugs or weapons.
Between 2011 and 2013, blacks also received 95 percent of jaywalking tickets and 94 percent of tickets for “failure to comply.” The Justice Department also found that the racial discrepancy for speeding tickets increased dramatically when researchers looked at tickets based on only an officer’s word vs. tickets based on objective evidence, such as a radar.
Black people facing similar low-level charges as white people were 68 percent less likely to see those charges dismissed in court. More than 90 percent of the arrest warrants stemming from failure to pay/failure to appear were issued for black people.

The issue of vehicle stops, speeding tickets, and citations will be dealt with down below and not be taken up here. With respect to the 93% number, it’s unknown how the FPD got to this number given that their report doesn’t offer much to do calculations. I have also been unable to find any data on racial differences in crimes in Ferguson, but since Ferguson is 67.6%, and blacks commit the most crime in The United States, we should expect them to make up almost a majority of the arrests. More information is needed. The issues in the Ferguson Report revolve around driving violations mostly, with the inclusion of small stuff like jaywalking, things like charge dissmals, and warrants. The issue of driving violations will be dealt with down below when talking about driving offenses, but the issue of charge dismissals is not racially biased. According to one study, “White males were no more likely than non-White males to have the charges dismissed” and “White females, in other words, were less likely than non-White females to have all charges dismissed” (Guevara, Herz, and Spohn 2006).

This is in the juvenile justice system, not the overall justice system. It’s also just federal courts. Only ~10% of all cases heard in the American Court system happen at the federal court. State courts handle by far the larger number of cases, and have more contact with the public than federal courts do see: https://www.findlaw.com/litigation/legal-system/federal-vs-state-courts-key-differences.html
Judges are overwhelmingly white, but Federal judges are more likely to be black and hispanic. Considering the vast vast majority of criminal cases dealing with violent crime are tried at the state level I can see there being less of a disparity at the federal level. Because it's probably true that disparities are much much higher when someone is convicted of a violent crime.

(Guevara, Herz, and Spohn 2006) also said

“non-white males were treated more harshly than White males”.

Not to mention how the study has limitations:

“This study has two limitations that must be taken into account. The first limitation concerns the data. Because of the small numbers of Latino and Native American youth in the sample, these cases were combined with those involving African American youth into a non-White category. As a result, this study compares outcomes for White versus non-White youth only, rather than for each individual racial category. The second limitation of this study relates to the generalizability of our findings. The data used for this study came from two juvenile courts in the Midwest. Therefore, the results are applicable to these two courts only and cannot be said to reflect juvenile justice processing in other jurisdictions.”

It does not provide detailed information on the specific measures used in the study. The data was also very old from 1990 to 1994. Without more detailed information on the specific measures used, it is difficult to evaluate their reliability and validity. (i.e race and gender could have been measured using self-report or official records, such as birth certificates or government identification documents. These measures should be reliable and consistent across all cases to minimize misclassification and bias. It is also important to ensure that all measures are culturally sensitive and appropriate for the population being studied. For example, if the study includes youth from diverse racial or ethnic backgrounds, it may be necessary to use measures that are translated into multiple languages or adapted to reflect cultural differences in attitudes towards the justice system.)

Additionally, the study does not provide detailed information on how missing data were handled or what steps were taken to ensure that data were accurately measured and reported.

They also could’ve just analyzed all the the referrals instead of excluding thousands in County A and County B. Even in their samples, they excluded hundreds of delinquency referrals in County A and over a hundred in County B.

They also included Asian Americans in the analysis without controlling for socioeconomic status. Asian Americans have higher socioeconomic status which throws off the curve of the distribution for non whites. This would of course make it seem like whites are less likely to have charges dismissed.

The study does not provide information on the statistical power of the analysis to detect meaningful effects. It is possible that the sample size was not large enough to detect significant differences between racial and gender groups, particularly if the effect sizes were small. They should’ve considered conducting power analyses to determine appropriate sample sizes for detecting meaningful effects. This would help ensure that studies are adequately powered to detect differences between groups and minimize the risk of false negatives.

They also should’ve addressed concerns about outliers and influential observations via alternative statistical methods that are more robust to these issues. For example, robust regression methods such as trimmed mean regression or M-estimation can help minimize the impact of outliers and influential observations on the results of the analysis.

Unaccounted temporal trends in the data, this could lead to biased or misleading results. For example, if there were changes in juvenile justice policies or practices during the study period that affected outcomes for different racial or gender groups, this could drive the overall results towards null effects.

To minimize the impact of unaccounted temporal trends, researchers should carefully consider potential confounding variables that may be related to both time and outcomes. For example, they may include a variable for year of processing in their analysis to account for any changes over time.

Additionally, they should’ve conducted sensitivity analyses to examine how their results change when different time periods are included or excluded from the analysis. This can help ensure that any observed effects are not simply due to chance or unaccounted temporal trends.

It should also be noted that it says

“All of the other variables affected the decision to dismiss the charges”.

So, to the degree there is a racial bias, it seems to favor blacks when looking at females but nobody of any race when looking at males.

Stops, Searches, and Arrests

The Concentrated Racial Impact of Drug Imprisonment and the Characteristics of Punitive Counties
While White & Black Americans admit to using and selling illicit drugs at similar rates, Black Americans are VASTLY more likely to go to prison for a drug offense.
In 2002, Black Americans were incarcerated for drug offenses at TEN TIMES the rate of White Americans.
Today, Blacks are 3.7x as likely to be arrested for a marijuana offense as Whites, despite similar usage.
97% of “large-population counties” have racial biases in their drug offense incarceration.

To know how these studies are done, all one has to know is that these studies get a large sample, and then ask the respondent if they have ever used drugs recently. From there, they usually compare drug use by race to arrest rates for drug offenses by race. So the first part (asking about drug use) is based on self-reported data. In credit to Vaush, it is true that studies have found that blacks and whites use drugs at similar rates, or that blacks have higher or slightly lower drug use than whites, but blacks are arrested more often for drug offenses. For example, Johnston et al. (2002) looked at 43,700 students and gave them a questionnaire that asks about their drug use. According to Johnston et al., “Use also tends to be much higher among White students than among African American or Hispanic students.” Schanzenbach et al. (2016) reported that whites have a higher rate of drug use, as can be seen in the chart below, but black are arrested at higher rates. Using data from SAMHSA, the ACLU reported that blacks report slightly higher cannabis use in the past month and past year, but whites report higher lifetime usage (50.7% for whites compared to 42.4% for blacks). Even though whites seem to use cannabis at higher rates overall, blacks are still arrested at higher rates for drug use (Edwards et al. 2020). Human Rights Watch (2009) remarked that blacks are more likely to be arrested for drug offenses, but this can not be pinned onto higher drug use among blacks since blacks and whites use drugs at similar rates (Gorvin 2008). Edwards et al. (2003) found similar results as the above 2020 revised report. Utilizing a probability-based sample of 4,580 college students who completed an online questionnaire, McCabe et al. (2007) found that Hispanic and white students reported higher drug use in college and before entering college when compared to blacks and Asians. The evidence seems quite clear, and to some only a fool would deny this as many analysis have found blacks to be more likely to be arrested for drug use even though nationwide data shows similar rates of drug use (see Owusu-Bempah and Luscombe 2020; Hughes 2020; SPLC 2018; see the report by the Justice Policy Institute). Despite the overwhelming evidence showing this to be the case, these disparities are not a result of systemic racism. The null-hypothesis should not be that racism is to blame for these racial disparities, but rather it should be racial differences in how different racial groups use drugs and how often they do it. In reality, there is no reason to assume that the above findings are accurate for two strong reasons: [1] blacks are more likely to lie on self-report surveys, especially those that deal with crimes, and blacks are more likely to lie about their drug use when compared to whites, thus artificially decreasing their actual drug use rates

There are national longitudinal analyses that actually find whites report more drug usage (NLSY). So even if lying (which 1 is extremely rare. Most studies still find African Americans around 85-90% truthful in comparison to whites) have been proven to generalize to a nationally representative sample (probably not because they all have super poor sample size/geographic location issues.

Have more quality than the studies which find that truthfulness/reliability is consistent within races) it may just distort the results from whites reporting more usage to again similar rates. One study not only failed to replicate findings of underreporting by blacks, but also found that “black males generally had the highest validity in these analyses” (Jolliffe et al. 2003, 194, emphasis added). Similarly, Alex Piquero, Carol Schubert, and Robert Brame find that the “correspondence between the prevalence estimates for the two arrest measures appears to be consistently higher for Blacks” than for whites and Hispanics (2014, 541).

So 1. lying is extremely rare.

2. There are severe sample size issues.

3. there’s other studies that find black people over-reporting.

4. They may not be lying because drugs stay in darker hair longer. Up to 90 days in fact and black people have darker hair.

5. It’s just not representative.

; [2] racial differences in drug consumption can explain why blacks are more likely to be arrested for drug offenses, even if drug use by race is similar or lower for blacks.

What does this even mean? That seems tautological. Usage and consumption mean the same thing.

Dealing with the first line of counter-evidence, criminologists have found that blacks are more likely to underreport their actual crime rates when asked. According to Cernkovich, Giordano, and Rudolph (2000: 143), there is “evidence that black males’ self-reports of delinquency are less valid than the reports of other groups: Black males underreport involvement at every level of delinquency, especially at the high end of the continuum.”

Cernkovich, Giordano, and Rudolph (2000) is actually addressing under-reporting of Black males as insignificant, if not, irrelevant to research on systemic racism because it is etiological research, in response to Hindelang, Hirschi, and Weis (1981):

“African American males may provide inaccurate estimates on a variety of other measures as well. If this is the case and if misreporting is more common among serious offenders, our parameter estimates could be affected, especially if our indicators are better predictors of serious as opposed to minor delinquency (or vice versa). While this has important implications for our analysis, it would be a mistake to conclude that such measurement error invalidates the data provided by the Black males in our sample. There are several good reasons to believe that it does not. First, Hindelang et al. (1981) conclude that while differential validity by race means that self-reports are poor social indicators of the absolute volume of crime and delinquency among Black males, such data can still be quite useful in etiological research.”

(which is precisely the gist of what systemic racism research is).

“Etiological research is less interested in the absolute frequency of delinquency than with how individual or group rankings on delinquency are associated with individual or group rankings on various independent variables of interest (Hindelang et al. 1981:215-16). The latter is clearly the focus of our research. Second, Hindelang et al. note that while the differential validity problem makes comparison across groups potentially misleading, analysis within groups is not compromised. This means that we can have confidence in the relative explanatory power of our independent variables within race subgroups. A third mitigating factor is our reliance on face-to-face interviews in the collection of these data-the method Hindelang et al. found to produce the least-biased self-reports among Black males (Hindelang et al. 1981:178). Finally, our research has incorporated the most basic implication of the Hindelang et al. findings: stratification by race in both sampling and data analysis.”

Hindelang, Hirschi, and Weis (1981) report that self-reports are less valid for groups like blacks, with similar findings being remarked by Huizinga and Elliott (1986).

Hindelang has already been addressed again, it was based on a concordance strategy using an unrepresentative sample of local arrest records that assumed no differential validity by race in official arrest records.

The thing about Huizinga and Elliott (1986) is that the black sample is small (n=76) and it was also based on a concordance strategy using an unrepresentative sample of local arrest records that assumed no differential validity by race in official arrest records. The author even contends,

“assumption [was] seriously challenged by Geerken (1994) who concluded that there were serious racial biases in local arrest records which overstate the arrests of blacks relative to whites” (1995, 7).”

and

“there is no evidence here that blacks have systematically lower reliabilities than whites for any of the measures of reliability.” (see also Huizinga and Elliott 1983). Farrington et al. (1996) goes against Huizinga and Elliot and finds no consistent ethnic differences in predictive validity.

Due to this, there is no reason to assume that blacks are being honest about their drug use. Although this is for crime in general, the same is true for drug use specifically. Page et al. (2009) did a urinalysis test and asked the people in their study if they have used drugs recently. After running a linear model, it was found that non-whites were more likely to say they have not used drugs recently when they in fact did.

Page et al. (2009) isn’t really from 2009, it’s from 1977.

It was only from one county (Dade County) in one state (Florida). McElrath et al. (1995) reported no race differences in the validity of self-reported drug use in Manhattan, Ft. Lauderdale, Los Angeles, and Phoenix (see also Nelson et al. 1998).

Studies like Page et al. (1977) and McNagny and Parker 1992) found that race had no effect on conditional probabilities of underreporting marijuana use and Lu et al. (2001) found that conditional probabilities of underreporting crack use were significantly higher for Whites (see also Page et al. 1977; McNagny and Parker 1992).

Similarly, Magura et al. (1987), Gray and Wish (1999), Hser et al. (1999), and Golub et al. (2002) reported no race differences in the conditional probabilities of underreporting drug use. Not to mention how significant interactions are likely to exist.

Page at al. (1977) noted that complex interactions in the covariates of prevarication rates should be examined.

Falck et al. (1992) looked at 95 drug users and had them do a urine test after asking them if they had used drugs recently. In Table 3, they found that blacks were more likely to falsify their self-reports on opiate and cocaine use.

Another small sample size which was only done in one location (Dayton-Columbus Ohio) in which the study even acknowledges is a limitation:

“Another limitation of the study is the small number of subjects. These findings should be cautiously interpreted.”

It’s also subject to over reporting given they omitted cases in which individuals admitted to drug use in the multivariate analysis.

Additionally, there’s the possibility that more white people could have misreported abstinence since their system could only detect extremely recent drug use despite a 6 month interval between the testing and questionnaire. In fact the paper literally acknowledged this:

“The number of inaccurate reports found here could conceivably be higher since the ONTRAK® system is capable of detecting only recent (within the last week, at best) drug use. This limitation of the assay procedure tends to work in favor of participants' claims of abstinence.”

The ONTRACK is also a system that doesn’t really declare positive and negative results because it is discretionary on the analyst. So there’s the subjective nature of identifying the results. Definite positives and negatives pose no difficulty, but urine samples are encountered that contain enough drug to noticeably lessen the amount of agglutination, but not enough to inhibit it totally, hence, the microflocculation indicative of a cutoff reaction.

Using the "spectrophotometer of the eye," some analysts will inevitably read the reaction more closely than others and report a negative rather than a positive, or vice-versa. While the method is inherently more subjective than one using a mechanical device to measure a "signal" (absorbance, fluorescence, gamma radiation) to distinguish positive from negative, the coefficients of variation of the ONTRAK assays may compare favorably to those of some automated immunoassays.

Differences in the source and purity of the drugs or metabolites used to make the calibrators, and other factors, probably also have an effect.

They also didn’t take into account that different immunoassays can produce disparate quantitative and semiquantitative results from the same sample, depending on the cross-reactivity profile of the assays and the drug/drug metabolite composition of the sample. So they should’ve confirmed with other immunoassays.

There's also the problem of false negatives. False negatives may be attributed either to the lack of sensitivity (amount of change in "signal" to change in analyte concentration) of the qualitative assay in comparison to the quantitative/semi quantitative assay or to degradation of the drug in the samples with storage. From these observations, for the two most commonly detected drugs of abuse, the ONTRAK system is more likely to produce a false negative result than a false positive result.

Also, they used an incorrect cut off for benzoylecgonine (300 ng/mL) when the actual cutoff for drug testing is 150 ng/mL. For morphine it is way higher at 4000 ng/mL.

The results aren’t even good enough to conclude anything about the black population on top of that. For example, in table 2, we see that for the “black” variable, the confidence intervals were very large (1.03 - 4.53) and there was a lower confidence level of 1.03 or only 3%. The overall p value for Sociodemographic and Drug Use Variables Associated with Accuracy of Self-Reports is .65 which is above the typical .05 threshold we deem statistically significant. So table 2 isn’t statistically significant.

The same can be said about table 3: (CI = 1.04 - 6.30, LCL = 1.04 or only 4%. Overall p value = 0.31.)

These are only just marginal results too. The alpha values for the black variables are very close to 0.05 (0.04) so these results could go either way. But it's more likely to go in the way of statistical insignificance given the lack of randomization for this sample. There's also the possibility of p-hacking given the small sample size. There’s also no calculation for how likely it is that black people actually give false negative results. The number of people that actually give false negative results are only 20 people. There should be regard for the likelihood of type I error. It should also be noted that Langenbucher & Merrill (2001) note that the evidence Falck et al presents isn’t conclusive.

Feucht, Stephens, and Walker (1994) looked at 88 juvenile arrestees and had them do a hair test and urine analysis. In their urine analysis, blacks were more likely to lie about not using cocaine when they in fact did, as argued by Feucht and his colleagues when they said that “However, the higher rate of urinalysis cocaine-positive results for black arrestees suggests that the higher hair assay levels may actually indicate greater use of cocaine among the black arrestees in the sample.”

A lot of problems arise with this study. These include the use of a small, non-representative sample, the lack of control for other factors that may contribute to differences in cocaine use, and the reliance on self-reported measures that are prone to bias and inaccuracies.
First, the study relies on a small, non-representative sample of arrestees. Specifically, the sample in the study consists of juvenile arrestees and detainees brought to the county juvenile detention facility who were invited to participate in the voluntary interview and urinalysis screening for recent drug use. There is self selection bias; the juveniles who chose to participate in the study differs from those who did not in terms of their drug use or other characteristics.
Another flaw is the presence of detention facility staff during the interviews and drug screenings, which influenced the responses of the participants. For example, some juveniles may have been reluctant to report their drug use due to fear of retribution or other consequences. Alternatively, some juveniles may have felt pressure to report drug use in order to conform to the expectations of the detention facility staff.
The sample is limited to juveniles who have been arrested and detained in a county juvenile detention facility, which is be representative of the larger population of juveniles or people in general. This sample is likely to include a disproportionate number of juveniles who have been involved in the criminal justice system, which may not be representative of the general population of juveniles.
They also use the Enzyme Multiplied Immunoassay Test (EMIT) which has several problems that this study did not adjust for:
Lack of sensitivity: EMIT is not as sensitive as some other immunoassay methods, such as radioimmunoassay (RIA) or enzyme-linked immunosorbent assay (ELISA). This means it is not able to detect very low levels of a substance in a sample.
Interference from structurally similar substances: EMIT is prone to interference from substances that are structurally similar to the substance being measured. This affects the accuracy of the test and lead to false positive or false negative results. False positives means that the EMIT may detect the presence of a substance even when the individual has not actually used the substance. False positives can occur due to a variety of factors, such as cross-reactivity with other substances, contaminants in the sample, or interference from medications or other substances. False positives can lead to incorrect conclusions about drug use and may have negative consequences for individuals who are falsely accused of drug use. False negatives mean that the EMIT may not detect the presence of a substance even when the individual has actually used the substance. False negatives can occur due to a variety of factors, such as the timing of the drug use in relation to the test, the sensitivity of the test, or the presence of drugs that are not detected by the EMIT. False negatives can lead to incorrect conclusions about drug use and may have negative consequences for individuals who are not properly identified as drug users and do not receive the appropriate treatment or support.
Limited specificity: EMIT tests are not always specific to the substance they are designed to measure. They cross-react with other substances, producing false positive results.
Limited dynamic range: EMIT tests have a limited dynamic range, which means they can only accurately measure a substance over a narrow range of concentrations. If the concentration of a substance is outside of this range, the test produces inaccurate results.
Limited stability: Some substances degrade or bind to other substances in the sample, which affect the accuracy of the EMIT test. This is particularly a problem for substances with a short half-life or that are sensitive to temperature or pH changes.
They didn’t even include certain hair specimens in the study due to shortcomings of aligning and cutting hair and the results were inflated.

Studies like this one have acknowledged Feucht, Stephens, and Walker (1994) by saying

“While prior research questions the validity of self-reported substance use (Dembo, Williams, Wish, & Schmeidler, 1990; Ehrman, Robbins, & Cornish, 1997; Fendrich & Xu, 1994; Feucht, Stephens, & Walker, 1994; Mieczkowski, Newel, & Wraight, 1998). It is important to note that these studies rely on samples of arrestees and persons in substance use treatment programs. Thus, findings that question the validity of self-reported substance use among these at-risk populations may not be generalizable to the general population. That being said, the NSDUH was designed, specifically the module of questions related to substance use, in an effort to maximize the accuracy of survey respondents.”

This one says

“It is important to note, however, that these studies examine the validity of self-reported substance use in high-risk populations, arrestees, and persons in treatment programs, not persons in the general population.”

This is because only the cocaine-positive urine results were obtained for only 13 of 110 black subjects.

Looking at marijuana, Fedrich and Johnson (2005) found a lower concordance rate within blacks, with the same being true for cocaine use.

This study asked about cocaine and marijuana use in specific neighborhoods; largely low income respondents in fact the study admits Black people were overwhelmingly low SES in the data Black people admitted to using drugs less often than White people (Black rates were roughly 90% concordant with urinalysis/blood test, White rates were near 100%)

“With the addition of SES, the odds ratio contrasting Whites with African Americans is reduced to nonsignificant point estimate value…The model suggested that those in the lower SES category had significantly reduced odds of cocaine concordance compared with those in the higher SES category.”

In other words, poor people lied about cocaine use.

Results suggest Black people may lie about marijuana use in as much as 10% of interviews.

“Discordant responses for marijuana and cocaine were less than 10% of the total respondents in the subgroup analyzed…Note that those who potentially overreported drug use were eliminated in the analysis.”

However, authors note overreporting was rare, particularly for cocaine. The numerical breakdown was: of 191 African-Americans in the study, 14 had a drug test come back positive when they said they hadn’t used it in the last 30 days. It should be noted that the latest research says marijuana can show up in hair drug tests up to 90 days after use, and dark hair is more sensitive to such testing.

The authors of the study summarized their findings with more measured phrasing than this author’s.

“The concordance rates were nominally lower for African-Americans.”

It also noted that,

“Those reporting their interview was more private had significantly elevated odds for providing concordant marijuana reports.”

Studies actually find the opposite that Black people underreported and that Black people actually overreport. Titled, “Differences in the Validity of Self-Reported Drug Use Across Five Factors: Gender, Race, Age, Type of Drug, and Offense Seriousness,” the report analyzed Drug Use Forecasting Data collected in 1994 and found:

“Black offenders are also more likely to overreport both marijuana and crack/cocaine use relative to White offenders,”

researchers found.

A more recent (but by no means current) study by NIH from 2008, again found the disparity reversed:

“Apparent overestimate of marijuana by self-report (i.e., self-report was positive and hair test was negative) was associated with being African-American.”

Other studies have also found that blacks and non-whites are more likely to report lower drug use, even though testing them shows that they’re lying (see Miyong, Hill, and Martha 2003;

The sample size is small and isn’t even community based from what I can gauge. It’s limited to only African American males and only 89 underreported. This doesn't justify a nearly 4x difference in arrest rates.

Also this study was done in 2003 when attitudes towards drugs were more negative compared to now so it's more likely those people would lie.

Also the 4x difference in vaush’s studies was directed towards marijuanna which this study doesn't control for.

I just want to clarify a few things. This study ONLY consists of African American males. If the author wants to say African Americans lie MORE than white Americans he’d need a sample of both demographics. Looking at a study that looks at one racial demographic is NOT informative if you want to understand differences between races. FB is conflating differential validity of arrest reports and differential validity of reporting offending(I.e drug usage).

Ledgerwood et al. 2008

The study literally says

“Apparent underreporting of marijuana (i.e., self-report was negative and hair test was positive; Table 3, middle column) was significantly related only to frequent marijuana use since 1972 (χ2 (1, n = 513) = 4.83, p < .05). Apparent overestimate of marijuana by self-report (i.e., self-report was positive and hair test was negative; Table 3, right column) was associated with being African American (χ2 (1, n = 519) = 17.15, p < .001), with not being Caucasian (χ2 (1, n = 519) = 14.85, p < .001), and being unmarried (χ2 (1, n = 518) = 11.38, p < .001). It is also associated with frequent marijuana use since 1972 (χ2 (1, n = 518) = 27.04, p < .001), having diagnoses of antisocial personality disorder (χ2 (1, n = 519) = 7.10, p < .01) or PTSD (χ2 (1, n = 519) = 8.37, p < .01) since 1972, and providing a short hair sample (< 3cm) (χ2 (1, n = 519) = 17.34, p < . 001). All other comparisons were non-significant.”

Several limitations were in this study.

“Because only middle-age men were included, these findings may not generalize to women or different age groups.”

“Validity of self-report measures could be questioned. Third, the hair testing was conducted in 1996 and 1997, and some changes have occurred in testing techniques and technology since that time (e.g., marijuana; Uhl & Sachs, 2004).”

“Because confirmation tests were carried out only for positive screening tests, any false negative from screening would have remained a false negative. Thus any specificity based on screening would be as good or better than the specificity based on confirmation and any sensitivity based on confirmation would be only as good as the screening sensitivity.”

“Hairs with insufficient quantity may have occurred disproportionately in polydrug user samples because more hair was required to complete confirmation testing for other drugs.”

They also don’t conclude what he’s saying as it literally says

“overestimate of marijuana by self-report (i.e., self-report was positive and hair test was negative; Table 3, right column) was associated with being African American (χ2 (1, n = 519) = 17.15, p < .001), with not being Caucasian (χ2 (1, n = 519) = 14.85, p < .001),”

; Fendrich and Xu 1994).

Again, this study relies on samples of arrestees and persons in substance use treatment programs. Thus, findings that question the validity of self-reported substance use among these at-risk populations are not to be generalizable to the general population.
Also the study shows that African Americans do not have a statistically significant odds of underreporting marijuana:
The confidence interval is wide and includes 0.

One study also found that blacks admit they'd lie about drug use when asked, especially when compared to whites; 14% versus 6% for marijuana, and 19% versus 8% for heroin (Johnston, Bachman, and O’Malley 1984).

It literally does not talk about concordance. Or race. The only time it even mentions race is when it says

“we performed a number of additional series of regression analyses to see whether the basic relations just described were altered when additional variables were included in the equations. One series involved the inclusion of a number of background and lifestyle predictors, all measured in the senior-year data collection: race, high school grades, truancy, religious commitment, political beliefs, evenings out for recreation, and frequency of dating. In no case did these measures add as much as 1% of explained variance beyond that accounted for by senior year drug use. more importantly, when these measures were combined with post-high-school role status and social environment measures as predictors of drug use, there was no appreciable change in the patterns just described.”

However this author claims to say black people are lying. That doesn’t appear to be even mentioned here. Hence this paper isn’t relevant to that discussion.

Furthermore, Ramchand, Pacula, and Iguchi (2006) noted that “African Americans are nearly twice as likely to buy outdoors (0.31 versus 0.14), three times more likely to buy from a stranger (0.30 versus 0.09), and significantly more likely to buy away from their homes (0.61 versus 0.48).” This shows that blacks are more reckless when buying drugs since it seems they’re more likely to buy it from someone they don’t know and use it outdoors where they can be caught, especially since these outdoor areas have higher crime rates where there is more police presence and blacks use drugs in areas with higher crime rates (Lagan 1995).

1. This is among drug users, not all people. 2. That data is crazy old.
The study is also antithetical because it still says there’s a disparity:
“The aim of the analysis was to investigate a 23-percentage-point racial disparity in connection with drug possession arrests (blacks are 36% of drug possession arrests but 13% of drug users, a disparity of 23 points). The analysis revealed that 10 of the 23 points were attributable to race- neutral factors. The analysis leaves unexplained 13 percentage points (the difference between 36% and the explained 23%). Perhaps the 13 percentage points or some portion of them reflect a practice of police unjustifiably overarresting blacks…”

Some critics have pointed to this paragraph from Ramchand et al.: “What these numbers show is that risky purchasing patterns among African Americans and their more frequent participation in transactions can account for only a relatively small amount of the observed differential in arrest rates. According to these calculations, Whites should still be arrested at a rate at least twice that of African Americans if the only thing driving these arrests were differential purchasing patterns. Instead, we observe in the real world that it is African Americans who are arrested at a rate that is twice that of Whites.” This is supposed to show that differences in use and purchasing do not explain why blacks are arrested more often for cannabis consumption than whites. The issue is, it’s not supposed to. This line of evidence is supposed to be viewed with other lines of evidence, specifically lying about not using drugs. So, the reason blacks are arrested for drug use at higher rates is because they’re more reckless when buying and using drugs and they do them more often than whites. It just seems like they don’t do it more often because they lie on surveys. Not sure how this paragraph changes anything when you’re supposed to view it with other lines of evidence.

Yeah FB try to use that dataset to show at face value that behaviour differences matter but the actual study disagrees with their conclusion. They shouldn't be citing the study to support anything "in line with other evidence" unless they can somehow show that the methodology of the study they cite is flawed in their analysis.

"So, the reason blacks are arrested for drug use at higher rates is because they’re more reckless when buying and using drugs ."

FB can't go and claim this with a study that straight up states that differences in behaviour do not influence arrest rates significantly. So FB is just bullshitting lol, you can’t cherry-pick that single aspect of the study and ignore the actual conclusion.

All these pieces of evidence, while viewed alone do not offer satisfactory explanations, do offer an alternative explanation when viewed together. In conclusion, racism can not explain why blacks are more likely to be arrested for drug offenses. As has been noted above, these studies rely on self-reported data in which blacks are more likely to lie on than whites. The fact that drug testing shows opposite results from what places like the ACLU argue should cast strong doubt on the racism hypothesis. Racial differences in drug use and consumption also show that race differences in these areas can explain why blacks are more likely to go to jail for drug offenses than whites, even if drug use by race was similar or slightly higher for blacks. Purveyors of the racism hypothesis have yet to dispute these findings, instead relying on flawed methods of proving racism for racial differences in drug arrests.

https://www.acludc.org/sites/default/files/2020_06_15_aclu_stops_report_final.pdf
This ACLU report reviews 5 months’ of data from DC police stops & searches by race and outcome.
The black population of DC is 25% greater than the white population, but black people were 410% more likely to be stopped by the police than white people
This disparity increases to 1465% for stops which led to no warning, ticket or arrest and 3695% for searches which led to no warning, ticket or arrest.
This data indicates the disproportionate stopping and searching of blacks in the Dc area extended massively beyond any disproportionate rate of criminality.
The Problem of Infra-marginality in Outcome Tests for Discrimination
Analysis of 4.5 million traffic stops in North Carolina shows blacks and latinos were more likely to be searched than whites (5.4 percent, 4.1 percent and 3.1 percent, respectively).
Despite this, searches of white motorists were the most likely to reveal contraband (32% of whites, 29% of blacks, 19% of latinos).
https://drivingwhileblacknashville.files.wordpress.com/2016/10/driving-while-black-gideons-army.pdf
Between 2011 and 2015, black drivers in Nashville’s Davidson County were pulled over at a rate of 1,122 stops per 1,000 drivers — so on average, more than once per black driver.
Black drivers were also searched at twice the rate of white drivers, though — as in other jurisdictions — searches of white drivers were more likely to turn up contraband.
A large-scale analysis of racial disparities in police stops across the United States
Enormous study of nearly 100,000,000 traffic stops conducted across America.
Analysis finds the bar for searching black and hispanic drivers’ cars is significantly lower than the bar for white drivers.
Additionally, black drivers are less likely to be pulled over after sunset, when “a ‘veil of darkness’ masks ones’ race”.

Before we continue, Vaush is correct to argue that blacks are stopped and searched more often than whites. This finding has been replicated in many locations across America, but there are few problems with this. To determine if racial bias may be present, researchers use a benchmark based on population demographics. For this benchmark, researchers compare the racial distribution for x to their groups total population. For example, say that in California 45% of people arrested for drugs are black but blacks are only 12% of the total California population. Since the racial distribution for drug arrests among blacks is higher than their total population, then this is evidence of racial bias since the two distributions do not align. Readers with an IQ of 90 can see how wrong this. First of all, we should not expect racial distributions to align for most stuff; second of all, blaming this disparity on racism doesn’t work unless it can be explained by racism with evidence — not just with the existence of a disparity. The null-hypothesis should not be that racism is the cause of the disparity, because if this is so then any racial gap is a product of sin rather than of differences.

The point itself is obviously valid but a lot of these studies do in fact control for these things and still find sometimes vast unexplained differences which they say can be attributed to racism among other things, but FB here consistently finds some excuse to dismiss these things out of hand, such as the study being only carried out in Hawaii (never mind that one of their counterexamples to a different point was only carried out in New Jersey) or that it only had a sample size of 66. For FB, there’s always something, so long as it’s not accepting that the study can point to racism. It’s as desperately biased as it is blindingly obvious. No study can be perfect, and the existence of these imperfections, no matter how slight, is this author’s excuse for not accepting that racism is a causal factor here.

Let’s take another example that relates to the topic at hand. Say that on the 105 freeway in Los Angeles 54% of drivers pulled over and searched are black, but blacks are only 23% of the Los Angeles population. Using the benchmark method used by many social scientists, they find that racism is responsible for this disparity since the two racial distributions (% black drivers stopped and their population % in L.A.) do not align and thus they must reflect racial bias. Instead of the social scientists seeing if racial differences in driving behavior can explain this disparity, they just argue that it’s because of racism because of their benchmark used.

This is just the “disparity ≠ discrimination” argument.” In the Stanford study it controls for race (since it’s dark cops can’t see skin color) also I would argue the explanations still tie into systemic racism.

Because of this issue with this benchmark, it’s not exactly a good benchmark (see Ridgeway and MacDonald 2010 for more).

The Rand document identifies a range of benchmarks and finds problems with all of them. However, for example, PPIC data in “Racial Disparities in Law Enforcement Stops” employs census data, but also differences in stop experiences, stop contexts, stop location, stop outcomes, and enforcement agency while controlling for community demographics. Interestingly, while arguing for subtlety in method and carefulness in assessment, Rand took census data at face value. Indeed, the report assumes throughout that race is a thing in an entirely unproblematic way and that once reported it just is what it is. But this reflects neither the shifts in census options over time, the shift in individual responses to those options over time, nor the fact that third party determinations of “race” may not conform to personal views of race as part of a wider identity matrix. This is not to “rubbish” the Rand report - it was an interesting analysis for its time and on its own level; I just find it interesting that the report is seeking more scientific rigor in employing a method of classification for a thing that doesn’t even scientifically exist.

This is just saying that there's no way to quantitatively determine racism and to flag problem officers. It doesn't fundamentally challenge the findings of any of these sociological studies; just goes "this isn't that useful for weeding out specific racist officer.”

Also not gonna lie, this review study is also pretty milk toast with it's critique. It just is skeptical of social science results because they can't quantitatively determine the racism of officers and say "they should remain measured" but they don't contest that policing has a systemic problem in any meaningful way.

Another benchmark used is the hit rate benchmark. According to this benchmark, blacks being stopped and searched more is not a result of racial bias if their stop rates reflect their successful hit rates. If their stop rates and hit rates do not align — or rather their stop rate is higher than their hit race — then racial bias does seem to play a role. To repeat what I’ll say in the next following sections, hit rates do not matter. Say you’re a campus officer and there are people who wear blue backpacks and black backpacks. While patrolling the campus, you notice that those who wear black backpacks are more likely to commit campus violations/show suspicious behavior. Due to this, you stop them more often and search them — but it turns out that those who wear blue backpacks are more likely to have contraband. Simply knowing their backpack color doesn’t help you see who to stop and search, but behavior and violations do.

Yes this is what racial profiling is. If you’re guessing something based off of someone’s race, thats racial profiling.

The final sentence will make sense in the following sections, but this will be repeated again near the end. Overall, the evidence does suggest that non-white drivers are more likely to be stopped than white drivers. This is not an area of dispute, but the reasons for this disparity are. Regardless, let’s continue onto what the evidence for traffic stops by race tell us. Explanations for this disparity won’t be given in this section since this is more of a literature review, but it will be given in the next section. Reviewing 5 months of data, the ACLU (2020) looked at D.C. police stops and searches by race. Black people made up 46.5% of the D.C. population but made up 72.0% of stops overall, 86.1% of stops that led to no warning or ticket, and 91.1% of searches that led to no warning or ticket. Although the ACLU does say that they really can’t say that this disparity is because of racism, they do say that it can be because of racism because (1) most black stops are unjustified; (2) blacks are more likely to be stopped in white areas and; (3) blacks are more likely to be searched than whites despite whites being found with contraband more. Using a population benchmark, a California study by Durali et al. (2020) found that despite being 6.3% of the population according to ACS data, 13.4% of blacks were stopped by police. According to their findings, “a higher percentage of Black individuals were stopped for reasonable suspicion than any other racial identity group”: Furthermore, blacks were searched more often despite whites yielding contraband at a higher rate than blacks. Blacks were also more likely to be stopped and arrested in the morning than at night, something called the “veil of darkness (VOD)” — when one’s race is masked at night. According to the VOD logic, this supports the hypothesis of racial bias since black drivers are less likely to be stopped at night when one’s race is harder to make out. Looking at the San Diego Police Department (SDPD), Berjarano (2001) remarked that blacks only make up 8% of the San Diego population but 12% of all those stopped, and 14% of those stopped for equipment violations. Similarly, Zingraff et al. (2008) found that although blacks make up only 19.6% of licensed drivers in North Carolina, 22.9% of traffic tickets were issued to blacks. In Florida, blacks made up 22% of all seat belt citations but only 13.5% of the Florida’s drivers (ACLU 2016). These citation differences could not be explained by seat belt compliance since the difference in seat belt compliance between blacks and whites was not large enough to begin with (91.5% v. 85.8%). Similar results have been found in Maryland and Illinois (Harris 1999; ACLU 2014), with whites being more likely to be found with contraband in Maryland. One of the more popular studies comes from Lamberth (2010) in New Jersey. Despite blacks making up only 13.5% of drivers on the road, they made up 42% of those stopped by police. In a more recent study looking at over 100 million traffic stops across America, Pierson et al. (2020) found that blacks and Hispanics were more likely to be stopped at higher rates than whites, more likely to be searched despite whites having more contraband, and blacks are less likely to be stopped at night than in the morning. This study, though, did not use a population benchmark and instead used the threshold test (Simoiu, Corbett-Davies, and Goel 2017). This test uses the rate at which searches occur and their success rate (i.e. they find contraband), with them finding that the bar to search non-whites is lower. Looking at national data, Persico and Todd (2006) looked at 15 studies and looked at their hit rates, with Last (2019) making their table clearer and adding a difference(s) section: As can be seen, there are regional differences in hit rates by race. Overall, whites have a higher hit rate (i.e. being found to have more contraband) than blacks, with the difference being 2.4). Based on all this evidence, some would conclude that this disparity is due to racial bias in the criminal justice system. Indeed, this has been the position taken in some of the studies cited above, and in the media with them calling blacks being stopped and searched more than whites “driving while black” (e.g. LaFraniere and Lehren 2015; Brown 2019; Lartey 2018; Gold 2016). Good research does not just leave it here, it attempts to explain the findings instead of blaming it on a ghost (racism).

Good research also does not casually dismiss the consensus opinion of sociology (that racism does in fact exist) as a “ghost.” This is a classic case of begging the question—assuming that the racism explanation cannot be the case before rationalizing and manufacturing an excuse for why that might be.

Although blacks are more likely to be stopped and searched than white drivers, the racism hypothesis could be argued with racial differences in driving behavior.

Note the flailing for some alternative hypothesis, despite the fact that there is no study provided that states differences in traffic stops are completely explained by differences in driver behavior. FB cites separate studies that differences in driver behavior exist, but not ones that find that these effects match up.

First and foremost, the effects of race prior to being stopped are weak and statistically insignificant. As was found in Geoffrey et al. (2012), the effects of race prior to being stopped were -0.063 and not statistically significant. Thus, simply being black has almost no effect on being stopped. Since race seems to have no effect on being stopped while driving, it’s hard to think that police are racially biased in who they stop.

Other than the immensely small sample which doesn’t allow us to make conclusions, I would recommend FB read the methodology - it is based on structured questionnaires completed by an observer in conversation with the officer. Standard self-reporting. This report is a case study of a single region and also relies on subjective opinions for its data. The entire report is based on the expressed opinions of officers in Savannah to individuals they knew to be researchers producing a report - the incentive to have the Savannah PD seen in a positive light is clear. As all the data is drawn from officer self-reporting of their opinion as to why they made a stop, it is not surprising that the highest correlation is observed behaviour of motorist and the lowest is race of the motorist. If you are reporting on your behavior to a researcher producing a report on your department’s activity during traffic stops, are you going with “behavior of the motorist” or “race if the motorist” no matter what the reason?
Also using the pearson correlation coefficient to measure this is flawed. The Pearson correlation coefficient does not provide information about the magnitude of the relationship between the variables, only the direction and strength of the relationship. This means that it is not possible to determine from the Pearson correlation coefficient alone the extent to which race may be a factor in the likelihood of being stopped by a police officer.
These variables are nonlinear in the sense that it may not be uniform across all racial groups or across all contexts. For example, research has shown that certain neighborhoods and communities may be disproportionately targeted for policing, which can result in higher rates of stops and arrests for residents of these areas, regardless of their race or ethnicity. In this case, the relationship between race and the likelihood of being stopped by a police officer may be stronger in certain neighborhoods or communities than in others.
It is also possible that other factors, such as the time of day or the type of vehicle being driven, may interact with race in complex ways and influence the likelihood of being stopped by a police officer. It is important to consider these and other potential factors when examining the relationship between race and policing, and to use appropriate statistical tools and methods to analyze the data.
Additionally, FB cherry picks a correlation that doesn’t account for other variables for this source. Table 21 shows a logistic regression model that controls for other factors and the opposite is shown:
“Officers are significantly more likely to form a non-behavioral suspicion when the suspect is Black (b=1.49; p<.05). The odds of a non-behavioral suspicion being formed were 4.4 times greater if the suspect were Black. There was no relationship between the race of the officer and the likelihood of forming a non-behavioral suspicion.”
Their logistic regression model for getting stopped show a HCL odds ratio of 2.226.
So essentially FB is completely misrepresenting the findings by mixing up two different but related stats. It does not at all say what FB says. The study is not saying there is no racial affect in traffic stops. It's pinpointing the racial effect to the very first moments the officer notices a person.
For this study they had observers ride around with officers on patrol and take notes on what they did with their discretionary time when they weren't responding to calls from the radio. They recorded quantitative data about the officers' activities and also took qualitative notes on the officers' reasoning and mindset behind their decisions as well as recording their own subjective impressions of officers' conduct.
The officers would basically self report to the observer when they had "formed a suspicion" and then the percent of those "suspicions" that turned into an actual stop of the suspicious person is what we see in table 12. Once an officer suspects someone, they are just as likely to act on that suspicion regardless of the suspicious person's identity, and the decision to stop someone or pass them by once a suspicion is formed is more related to factors like workload and time of day.
The second number I showed is the relative rate that officers formed suspicions about black people in the first place. Put the 4.4x chance of being suspected together with the statistically insignificant difference in chance of being stopped if suspected and you get black people stopped 4.4x more often than the general population during the sampled shifts even while the officers have an observer present with them.
So this doesn't say black people don't get stopped more. It says that the reason black people get stopped more is that police suspect them more frequently, not that they act on their suspicions more frequently.

The null hypothesis should not be that race is the reason, but rather racial differences in how blacks and whites drive. Looking at the National Survey of the Use of Booster seats, we find racial differences in seat belt usage (Pickreall and Jianqiang 2009). These differences are larger between the ages of 13-15, and smaller in older age groups. Although we have these differences, they are not large enough to explain racial differences in driving stops — except at a younger age. If non-whites live in areas where police aggressively enforce seat belt violations, then blacks will be more likely to be stopped.

I would argue that this data doesn’t contradict the claim against black people being less likely to be pulled over at night because it would a lot harder to see a seatbelt.

While there seems to be very little data looking at racial differences in driving violations, the best evidence comes from criminologist Heather MacDonald. In the context of New Jersey and North Carolina, MacDonald (2016) notes: “Though most criminologists are terrified of studying the matter, the research that has been done, in New Jersey and North Carolina, found that black drivers speed disproportionately. On the New Jersey turnpike, for example, black drivers studied in 2001 sped at twice the rate of white drivers (with speeding defined as traveling at 15 mph or more above the posted limit) and traveled at the most reckless levels of speed even more disproportionately.”

To begin with, this book has several troubling technical aspects, not to even get into the content. 1) No bibliography to tell me what she read or where she got her information from. 2) No end notes. None. Not. A. One. You're writing a NONFICTION book. You need to back up your claims. Otherwise, it's hyperbole and opinion. 3) She had 8 footnotes, which I didn't find particularly useful. 4) Really short chapters. I think she had two chapters that were over five pages. I burned through this book. 5) (And I admit this is getting into content a bit) She uses "black suspect," "thug," and "savage" somewhat interchangeably. Lastly, 6), She cherry picks her data and anecdotes. For instance, she focuses on the years 2014 - 2016 to look at the rise of a so-called attack on law enforcement. She makes claims like "From 2005 - 2014, 40 percent of cop-killers were black," yet she cites no data or source to show where and how she got that claim from. She does not acknowledge that over a 10 year period, according to the FBI
(https://www.boston.com/news/local-news/2016/01/12/are-police-officers-being-assaulted-and-killed-more/),
the trends in felonious murders of police officers were either flat or down. There are many anecdotes sprinkled through this book, such as the homeless, convicted felon in California who said, and I'm paraphrasing, theft should be taken seriously and if I get a ticket rather than go to jail for theft, us thieves won't take the law seriously. Mac Donald used this tiny anecdote to bolster her argument that the overcrowding of California prisons which caused several constitutional violations and led to prisoner releases is a failed policy and must stop. When you're writing a polemic, however, apparently facts don't matter and any individual story that supports your distorted narrative will do. Going into some of the underlying assumptions that Mac Donald builds on is the presumption that the cause of crime in Black communities is some kind of inherent flaw in Black people and won't be resolved until "the black family is reconstituted." (p. 30) This line of reasoning comes straight out of the Moynihan report. For those not familiar, you can read it here:
http://web.stanford.edu/~mrosenfe/Moynihan%27s%20The%20Negro%20Family.pdf
For a wonderful scholarly and historical treatment of the Moynihan report, check out Daniel Geary's "Beyond Civil Rights." On page 44, Mac Donald writes, "Perhaps if the media had no shrunk from reporting on the flash-mob phenomenon and the related 'knockout game'–in which black teenagers try to knock out unsuspecting bystanders with a single sucker punch–we might have made a modicum of progress in addressing, or at least acknowledging, the real cause of black violence: the breakdown of the family. A widely circulated video from the mayhem shows a furious mother whacking her hoodie-encased son to prevent him from joining the mob. This tiger mom may well have the capacity to rein in her would-be vandal son. But the odds are against her. Try as they might, single mothers are generally overmatched in raising males. Boys need their fathers. But over 72 percent of black children are born to single-mother households today, three times the black illegitimacy rate when Daniel Patrick Moynihan wrote his prescient analysis of the black family breakdown in 1965." A few corrections. First, the so-called knockout game is a myth. Many news outlets have reported on the myth. Here's a good place to start: http://www.thedailybeast.com/guess-wh.... Second, the author states that the "real cause of black violence" is the breakdown of the black family. We're expected to accept this assumption without question. Third, she talks about the birthrate of black children to single-mother households and compares that to Moynihan's report from 1965. Not only does she not cite where she gets the 72 percent number, but Moynihan's data was flawed, among many other flawed aspects of his report He acknowledged, on page 4 of his report, that he did not have data pertaining explicitly to "Negroes" but only nonwhite people. Moynihan then used Negro and Nonwhite interchangeably throughout his report. It's bad sociology. Mac Donald's contention, much like Moynihan's (who began his arch toward the neoconservative right in the late 1960s) was that the problem was Black folks and the break down of their families, not poverty, structural and institutional racism, patriarchy, or police violence. Whereas in the early 1960s up to the publishing of his report, Moynihan advocated full employment (not opportunity) through government programs, by the end of the decade he was dismissive of those programs and seemed to indicate that the problem could only be solved through racial self-help. This played well with the right and the neoconservatives of time who started putting out reports and research, citing Moynihan, claiming racial inferiority. The scientific racism of the 19th century was alive and well in the 1970s, thanks to Moynihan. This is all to say, that when you read Mac Donald–I hope you never do–the foundational assumption of this terrible book she wrote is based on an old, outdated, problematic report that both the right and the left initially found value in. The left, for its seeming suggestion of full employment, and the right, for Moynihan's nod that the only way for there to be Black uplift was through racial self-help–which means any government program is bound to fail because of the inherent problems with Black people themselves. I think if you are going to read this book, read Daniel Geary's book first, Beyond Civil Rights, and then read the Moynihan report. Or vice versa. Then read the Mac Donald book. I propose this progression only because then you will have the historical background to review Mac Donald's book and you will have also read the source material that continually pops up in her book as a foundational assumption: the Moynihan report. The rest of the book is just hyperbole.There is not one citation backing up her statistics or claims. This is deplorable in a book that tackles a controversial subject. This book badly needs footnotes listing specific sources for its data and its claims. This is an opinion piece that will satisfy all who blindly support the police and/or who lack critical reading and comprehension skills. So far I haven't seen a single citation of source for the statistics provided, which we all learned in 8th grade is an academic no-no. It's interesting to see just how people can be swayed by rhetoric that they believe to be factual without seeing any proof. This book is a gross manipulation of statistics to prove a predisposition held by the author. A polemic that focuses almost exclusively on the jaded, anti-black politics around policing rather than tackling the legitimate grievances of police reformers. An argument could be made that there are productive means of bridging police community relationships that are underutilized, and there are legitimate defenses of certain police practices, but this fails to give either any significant attention. Frankly, police officers should be insulted that something so inflammatory has been written and is promoted in their defense.

Unfortunately, I’m not aware of any more studies like this and it seems that racial differences in speeding aren’t generalizable besides in North Carolina and New Jersey. The BJS notes that whites are more likely to speed (Smith and Durose 2006), so it’s possible that races get stopped for different reasons. Indeed, races give different reasons as to why they’re stopped: In Ingraham (2014). Another reason why blacks are more likely to be stopped is that they’re more likely to have a warrant out for their arrest and have unpaid tickets (Dolan 2016).

This is great that the study is in California because the evidence is antithetical. To take California as an example, outstanding warrants accounted for about 0.6% of stops of white motorists and about 1.2% of black motorists in 2020. By far the most common reasons for stops were simple traffic stops (unrelated to warrants or unpaid tickets) and suspicion. I’m not sure how FB concludes that around 1% of all stops accounts for, for example, black motorists being searched in 20% of all stops co be by mpared to white motorists in 8%. You’d be better off making an argument about the demographics of inner city areas (where most traffic stops happen), but even that won’t iron out the disparities in racial profile vs outcomes.

This would be an argument for arrests, not traffic stops. People don’t just get pulled over because they have an unpaid fine. The warrant would be seen after the person is pulled and then they would be arrested. I don’t think cops just drive around and pull over people that they think have warrants.

A strong reason to assume that racial bias does not play a role in driving stops and searches is racial differences in suspicious behavior. If certain groups are more likely to display suspicious behavior, then cops will be more likely to stop and search them. According to the National Institute of Justice: In Savannah, Ga., trained observers accompanied police officers on 132 tours and focused on officers’ decision-making and discretion prior to a traffic stop. Officers were questioned every time a person aroused their suspicions. Of those who evoked suspicion, 74 percent were male and 71 percent were minorities. Suspicious behavior, a traffic offense, “looking nervous” or similar behavior accounted for 66 percent of the officers’ reactions; 18 percent were the result of information they had received to be on the lookout for a suspect; 10 percent because someone was where he or she would not be expected to be; and 6 percent because of the person’s appearance. Officers stopped individuals under suspicion 59 percent of the time, but the suspect’s race did not affect the outcome of the stop. The authors concluded that the results did not support the perception that a high level of discrimination occurs prior to a traffic stop. CITING GEOFFREY ET AL. (2009) Another interesting piece of data comes from Schell et al. (2007). Looking at Cincinnati, blacks had longer stops and higher search rates than white drivers. After controlling for time, place, and context of the stops, there were no differences on stop and search rates.

Did FB even read this one?

“In all three of its Annual Reports, RAND has called for a greater dialogue about how black neighborhoods are policed. [I]t may be possible to make improvements in relations between CPD and the black community by rethinking how black neighborhoods are policed. The proactive policing of motor vehicles that occur in these communities (longer stops, more searches) is likely to put a high burden on law-abiding members of these communities, and it may not match these communities policing priorities [p. 61]...Rand’s analysis again confirms that African Americans are more likely than white drivers to experience searches, passenger ID checks and other proactive actions during traffic stops. The frustration of African American drivers is evident in the videotape analysis of various stops as they are frequently impatient and suspicious toward the police officer. The parties should study the causes of this disparity and take steps to end it. Enforcement priorities and methods should be applied consistently to all drivers and by all officers. Actively engaging the community in training and policy explanation may help improve the traffic stop experience. Plaintiffs will seek to secure actual traffic stop videos to assist in community dialogue with the police on this issue.”

If these same variables were taken into account for the previous studies which argue for racial bias, then it’s highly likely that they wouldn’t find racial differences in stops and searches. Ridgeway and MacDonald (2010) note that “A comparison of the racial distribution of observed traffic violators to actual police traffic stops in the same areas suggested little evidence of racial bias in stop decisions.” What about the ACLU’s findings that blacks were more likely to be ticketed and whites were more likely to be given a warning? After controlling for different variables, Smith and Petrocelli (2001) note that “minority drivers were more likely to be warned, whereas Whites were more likely to be ticketed or arrested.”

The study’s data has several problems:

Selection bias: The sample of traffic stops in the study may not be representative of all traffic stops in the area, as it is limited to those that were conducted by officers equipped with mobile data computers (MDCs).

Limited data: The study may not have access to complete or accurate data on all relevant factors that could influence the likelihood of being stopped by a police officer, such as the behavior of the driver or the specific reasons for the stop.

Time frame: The study is limited to traffic stops that occurred between January 17, 2000 and the date of data collection, which does not provide a complete or representative picture of the relationship between race and policing over a longer time period.

Geographical limitations: The study is limited to traffic stops in Richmond, which is not representative of other areas or jurisdictions.

MDCs do not accurately capture all of the interactions between officers and the public, particularly if the devices are prone to technical issues or if officers are not using them consistently. This could result in incomplete data sets that do not accurately reflect the full scope of policing activities.

The use of MDCs may be associated with certain biases or inconsistencies in data collection. For example, if officers are more likely to use the devices in certain types of situations or with certain groups of people, this could introduce biases into the data that are not representative of the overall population.

Furthermore, the data collected through MDCs may not always be complete or accurate. For example, officers may not consistently enter all relevant information into the devices, or they may enter information in a way that does not accurately reflect the circumstances of the interaction. This could lead to inaccuracies in the data that are used to analyze patterns of racial bias in policing.

They also measure the age and years of service of officers on an ordinal scale meaning that the data are arranged in order of magnitude, but the intervals between the categories are not necessarily equal. For example, the difference between being an officer with 1 year of service and an officer with 2 years of service may be the same as the difference between being an officer with 10 years of service and an officer with 11 years of service.

The limited number of traffic stops in the study that were conducted by officers who were not equipped with mobile data computers (MDCs) may disadvantage the study in several ways:

Selection bias: The sample of traffic stops in the study may not be representative of all traffic stops in the area, as it is limited to those that were conducted by officers equipped with MDCs.

Limited data: The study may not have access to complete or accurate data on all relevant factors that could influence the likelihood of being stopped by a police officer, as it is limited to a small subset of stops.

Limited statistical power: With a small sample size, it may be difficult to detect statistically significant differences or relationships in the data. This could lead to false negative results (i.e., finding no difference or relationship when one actually exists) or to results that are not statistically robust (i.e., results that may not hold up under further scrutiny).

The problem about Smith and Petrocelli is that there was only a 64% completeness rate and the missing data could be quite different than the data collected. This is called non-response bias in which the officers who did not enter data for the remaining 36% of traffic stops may have different characteristics or motivations than those who did enter data. This could introduce bias into the study and affect the validity of the findings. The study a even says this:

“the missing data represent a potentially different pool of traffic stops, and so our findings should be interpreted with caution.”

The study also used S. census data on the city's population of persons 16 years of age and older as a proxy for the city's driving-eligible population. This is flawed because:

Inaccurate assumptions: The U.S. census data may not accurately reflect the characteristics of the city's driving-eligible population, as it includes all persons 16 years of age and older, regardless of whether they hold a driver's license or are actively driving.

Limited data: The U.S. census data may not provide complete or accurate information on all relevant factors that could influence the likelihood of being stopped by a police officer, such as driving behavior or the characteristics of the vehicle being driven.

Limited comparability: The U.S. census data may not be directly comparable to the data on traffic stops collected by the officers, as it is based on a different population and may not account for differences in the characteristics of those who are stopped versus those who are not.

So the study hasn’t shown how this is a valid proxy as they haven’t shown how it is related to the driving-eligible population in a meaningful way, how it is correlated with the disturbance term in the model (e.g., by capturing the influence of unobserved factors such as driving behavior or vehicle, and time of day).

This was also just in Richmond so it’s not representative. It is not longitudinal as well in fact it is old data from 1990.

They also admit their self reported data is flawed:

“Although self-reports are a common source of information in criminal justice and police research (Garner, Buchanan, Schade, & Hepburn, 1996; Garner & Maxwell, 1999; Snyder & Sickmund, 1999), the sensitive nature of these data heightens concern over their validity.”

I also recommend what FB reads what they say on their results:

“The finding that minorities in Richmond were more likely than Whites to be warned rather than legally sanctioned is capable of several interpretations. For example, this finding may indicate that Richmond officers altered their behavior because of the research study. Such subject reactivity (Neuman, 1997) might also explain why White officers were no more likely than Black officers to stop Black motorists. In other words, it is possible that officers "cleaned up their act" while the research was under way. An alternative explanation for the finding that minorities were more likely than Whites to be warned by the police is that minorities may have been stopped more frequently than Whites based on weak (or nonexistent) evidence. Consistent with racial profiling practices, Richmond officers may use minor traffic infractions as a pretext to stop minority motorists; once their suspicions are dispelled, they may send those minority drivers on their way with only warnings to show for the experience. This explanation is consistent with Hepburn's (1978) conclusion that minorities were more likely than Whites to be arrested under conditions that would not support the bringing of formal charges. Similarly, Richmond officers may be more likely to stop minorities than Whites for reasons that will not support the issuance of a traffic summons.”

One of the more important things that should be responded to is the VOD findings. To give a quick refresher, the VOD refers to the fact that blacks are more likely to be stopped in the morning than at night when the driver’s race is harder to see. The original “sunset” and ”veil of darkness” study referenced in the Stanford paper study DOES mention work variances (i.e. differences by race that could lead to more blacks stopped by police in the day), which is a detail completely omitted in Stanford study. As the original veil of darkness study says, “For a number of reasons, the assumption of constant relative risk is restrictive. One reason for this is that temporal travel patterns may vary by race due to differences in hours of work. If so, then the race distribution of the at-risk population may vary by time of day. Racial differences in police exposure or driving behavior could also cause the relative risks to vary.”

The assumption about behavior being the same is a big assumption.

They also say that “In the case of the Oakland data, our approach yields little evidence of racial profiling, and our sensitivity analysis suggests that the departures from our maintained assumptions would have to be substantial to overturn our conclusions” GROGGER AND RIDGEWAY (2006)

We understand those variables can factor into the results but FB would have to prove that they actually matter. Every study has variables it doesn’t control for.

Also the two studies are different. In the Stanford study it measured black people that got stopped a short period before it was dark to the morning. It doesn’t make sense that black arrest would decrease that much in less then an hour.

Additionally this report has two problems:

For one, this report analyzed data that fail to capture the unseen selective process through which police come to engage civilians, a process that prior work strongly suggests may be a function of citizen race (Gelman, Fagan and Kiss, 2007). In this way, this report fails to account for the impact of race on the composition of the sample under study. Failing to account for this undocumented first stage of the police-citizen interaction will lead to statistical bias, even if the goal is to estimate the effect of suspect race within the sample of individuals who appear in police data and, in many cases, even with a “complete” set of control variables that render civilian race as-if randomly assigned to police encounters.
Despite making at least implicitly causal claims, leave ambiguous the precise quantity of interest—whether it be the total effect (TE) of race in all encounters; the total effect among the subset of encounters appearing in police data because a stop was made (TE𝑆), which differs tremendously from the TE; or the markedly more restrictive and difficult-to-interpret controlled direct effect among the same subset (CDE𝑆 , defined below). While studies commonly discuss omitted variable bias and attendant assumptions, they rarely discuss the additional assumptions necessary to identify specific causal quantities of interest. As a result, readers are unable to assess the adequacy of research designs and estimators, rendering the interpretation and policy relevance of much prior work ambiguous.

It’s highly possible that the authors of the Stanford paper just made an assumption on what their data could mean rather than testing this hypothesis. Even the original VOD study found no racial bias and said that blacks being more likely to be stopped in the morning than at night can reflect relative risks.

But does this paper address the data for the test in Texas? There is only the Oakland related one. If that’s the case then factors in Texas could just be different and for the differences in work factors the Stanford study compares black people being stopped in a short time window less than an hour. Work differences would be irrelevant in a 30 minute time period. The Stanford study seemed to be nationwide whilst the rand study only focuses on Oakland which makes it have a much smaller sample pool in comparison meaning what was found in Oakland may not be the case found in every other city.

It should also be noted that another factor that could be causing blacks to think that them being stopped by a cop is due to racial bias is the fact that when stopped by an officer of a different race, they think that the stop was not legitimate when compared to when stopped by an officer of the same race (Langton and Durose 2013).

I’m not sure how the person feels getting stopped has much to do with the point that they are getting stopped and how frequently their demographic gets stopped.

FB is also pretty wrong on this. What the BJS found is that there was a 29% difference in perceptions of proper behavior between black and white officers in perceived illegitimate stops of black motorists. The percentage of black motorists who felt the stop was illegitimate when stopped by a white officer was essentially identical to the perception when stopped by a black officer (29.8% vs 29.3%). In fact, the ratio gap is even whiter with white drivers (16% vs 17.7%). The only notable difference was where black motorists were stopped by Hispanic/Latino officers and given the percentage of H/L officers involved in these incidents anyway that is a very small sample for FB to draw such a sweeping conclusion.

All these variables can lead to blacks being stopped more and being searched more, especially when there are racial differences in suspicious behavior when driving. If one exhibits odd behavior and gets stopped by an officer, it’ll increase their chances of also being searched. Bringing up hit race by race does not matter. Although whites are more likely to have contraband, police just can’t stop every white person. They have to stop someone who looks suspicious or who is committing traffic violations. Going back to Geoffrey et al., being suspicious was correlated with the chances of being stopped by an officer. The reasons for suspicions were coded into (1) appearance, (2) behavior, (3) time and place, and (4) information. According to the researchers, “Appearance” refers to the appearance of an individual and/or vehicle, and can refer to things such as distinctive dress, indicators of class, vehicle type, color, condition, and the like. “Behavior” refers to any overt action taken by an individual or vehicle that seemed inappropriate, illegal, or bizarre. “Time and place” refers to an officer’s knowledge of a particular location (e.g., park, warehouse district) and what activities should or should not be expected there after a particular time (e.g., after hours). Finally, “Information” refers to information provided by either a dispatcher or fellow officer (e.g., BOLO). Blacks had a higher rate of suspicion formed than whites but were a lower % of stops made when compared to whites. Thus, it does not seem race is responsible for blacks being stopped at a higher rate when driving than compared to whites.

Can FB explain how suspicions being overwhelmingly formed of black motorists and stops being overwhelmingly of black motorists acts to disprove “driving while black”? Even the authors don’t try to draw conclusions from the fact that the lower stop numbers from suspicion numbers is slightly greater for black motorists than white motorists. All they can responsibly do is speculate that it might be that because where non-behavioural suspicions are formed they are overwhelmingly formed of black motorists, such suspicions may tend to be “unfounded or inefficient”. As the overwhelming majority of suspicions and stops are of black motorists (the statistical definition of driving while black) you simply can’t conclude that what this actually does is disprove driving while black. They don’t try to draw a conclusion on that tiny data point, no. All they do explicitly is mention that it exists and later speculate on possible explanations.

They also never try to link it to any notion of bias refutation (which is perhaps your basic problem in trying to squeeze a refutation of driving while black from a tiny data point in a very large report: this is simply not what the report was designed to do, rather, it seeks to suggests efficiency and training improvements for small city, relatively low intensity policing).

Thus, it does not seem race is responsible for blacks being stopped at a higher rate when driving than compared to whites. However, a response to this may be that racial bias does play a role since blacks are less likely to be stopped at night than in the morning. The most widely cited study on this issue, as already noted above, is Pierson et al. As can be seen from their table below, blacks are less likely to be stopped at night than in the morning. However, there are significant limitations. The first limitation is the lack of effect size given. An effect size would allow us to see how big the difference between blacks stopped at night and in the morning is, to begin with, but instead, we only have percentages to go off of. This could lead to overestimates on the size of the difference since a percentage could lead to either small, medium, or large difference.

The claim there is no effect size is false:
There's effect size indicators in there, just not the ones we're both used to seeing. Given the keyword and it seems to me, pretty clear effect sizes like cohen’s d and pearson’s r have been included though.

The second limitation is their lack of adjustment for driving violations. This is important since if races do differ in driving behavior, then not controlling for driving violations can lead to misleading results under the VOD model.

On what basis can we conclude or even assume races have different driving behavior?

Looking at San Diego, Chanin et al. (2016) found that once driving violations were adjusted for, the black-white difference in stops under a daylight model was statistically insignificant and the OR was almost at 1.00, showing no racial differences in being stopped in the morning and at night. The same was even true for Hispanic-white differences.

This is because of the unreliable dataset being analyzed which the study pointed out and FB omitted:

“Records of traffic stops conducted in 2014 and 2015 were often incomplete, raising questions as to whether data generated by the SDPD’s traffic stop data card system are a reliable measure of actual traffic stops conducted”.

In fact, the same author found in another study that there’s significant underreporting:

“Findings indicate a 19 percent error rate in stop data submitted between 2014 and 2015, amidst evidence of substantial underreporting.”

Besides, the report literally confirms the veil of darkness study. Not necessarily as clear cut as other studies but:

“disparities bw Black and White drivers were evident in vehicle stop data from 2014…Data from both 2014 and 2015 revealed distinct and divergent stop patterns…Narrowing the focus to the division level revealed strong and consistent disparities in the day-night stop rates among Black and Hispanic drivers stopped in the Northeastern division.”

Schell et al. (2017) looked at stop differences, stop length, search rates, and hits, and stops through daylight differences in Cincinnati. After adjusting for time, the context of stop, and place, there was no racial bias in driving stops in the daylight and dark. After adjusting for the prior variables, there was no racial bias in stop duration and even searches. The same was also found for hit races once confounders are adjusted for.

The inherent limitations and flaws in this report’s measurements makes the results inconclusive:

“In approximately one-quarter of the recordings, either the video or the audio was of poor quality (e.g., camera was not aimed so that driver and officer were in the field of view, or the audio quality would not allow coders to understand the driver). The number of cases in which the video record was not complete (omitting either the beginning or end of an incident) dropped to 3 percent…The fact that an effect is not significant within every year’s data should not be interpreted as a change in police or driver behavior across years but as an inherent limitation of working with a random sample of 300 incidents. Analyses of the communication variables have somewhat less power, due to the incomplete data caused by inaudible audio…The actual content and quality of the recordings presented real limitations on what measures could be reliably extracted from these interactions. Specifically, the single camera position (almost always 30–50 feet behind the stopped driver); low video resolution; single, lapel-style microphone on the officer; and high ambient noise limited the measurements that could be taken from analysis of the recordings.”

Even in England, there seem to be no consistent differences in the racial proportions of those stopped in the morning and at night. While not adjusting for moving violations, there seems to be no consistency overall (Waddington et al. 2004).

Hallsworth (2006) builds upon the earlier work conducted by Waddington et al. (2004) which had questioned the use of residential figures as an appropriate baseline for assessing whether the exercise of police stop and search powers in an area was proportionate and which, in opposition to this, advocated as an alternative profiling the ethnic profile of the available street population. The street surveys were able to accomplish this. While their findings confirm Waddington et al.’s (2004) claim that using residential populations figures as a baseline to assess disproportionality in stop encounters is problematic, our findings did not confirm their contention that the exercise of stop powers were, as they argued, ‘proportionate’.

What about the findings that blacks were more likely to be ticketed and whites were more likely to be given a warning? After controlling for different variables, Smith and Petrocelli (2001) note that “minority drivers were more likely to be warned, whereas Whites were more likely to be ticketed or arrested.” Thus, neither driving while black or even the veil of darkness argument can be supported by the idea of racism. Rather, these issues are due to racial differences in driving behavior, and adjusting for these confounders makes the gap go away, showing racial differences to mediate these differences. To repeat myself, say you’re a campus officer and there are people who wear blue backpacks and black backpacks. While patrolling the campus, you notice that those who wear black backpacks are more likely to commit campus violations/ show suspicious behavior. Due to this, you stop them more often and search them — but it turns out that those who wear blue backpacks are more likely to have contraband. Simply knowing their backpack color doesn’t help you see who to stop and search, but behavior and violations do.

Note the complete failure to engage with the notion that law enforcement’s concept of what constitutes “suspicious behavior” might be informed by bias against said population.

In conclusion, racial differences in driving violations and behavior explains why blacks are more likely to be stopped and searched, even though whites are more likely to have contraband on them. Contrary to media and political narratives, “driving while black” is a result of racial differences rather than of racial bias. This line of argument, while popular, does not make a good argument as to why blacks are stopped and searched more often than whites.

Judges, Juries & Prosecutors

Demographic Differences in Sentencing: An Update to the 2012 Booker Report
Extensive multivariate regression analysis indicates black male offenders receive 19.1% longer federal sentences than similarly-situated white male offenders (white male offenders with similar past offenses, socioeconomic background, etc.)
This disparity seems to stem mostly from black males being 21.2% less likely to receive non-government sponsored downward departures or variances. Non-government sponsored departures and variances refer to deviations from standard sentencing guidelines due to judicial discretion.
Black males who do receive non government-sponsored departures and variations still serve 16.8% longer sentences than white males on average.
In contrast, when sentencing length follows standard guidelines, that disparity is only 7.9%, and a substantial assistance departure for both groups nullifies that disparity.
IN SUMMARY – much of the sentencing disparity between similarly situated black males and white males comes down to judicial discretion to deviate from standard sentencing guidelines.
BONUS – regression analysis suggests violence in a criminal’s history does NOT explain sentencing disparities between black males and similarly situated white males – the effect of that factor seems to be statistically insignificant.
https://sci-hub.tw/https://onlinelibrary.wiley.com/doi/abs/10.1111/jels.12077
A study of first-time felons in Georgia found black men received sentences of on average 270 days longer than similarly-situated white males.
However, when black males were differentiated by skin tone, it was found light-skinned black men saw virtually no disparity in their sentencing while dark-skinned black men actually saw a disparity of around 400 days in prison.

The issue of racial bias in sentencing has been a long standing issue in criminology. In fact, there have been about 5 waves so far, according to Alexander (2014). The first wave had poorly done studies that found large amounts of racial bias (readers should keep in mind that this hypothesis is not tested, it’s based on the interpretation of what the remaining disparity could mean), the 2nd wave in the 1980s controlled for more things and found no racial bias,

With the possible exception of the implementation of death sentences in the South. Also the 2nd wave re-analyses also indicated that race may have an indirect discriminatory effect operating through other variables or race interacted with other factors to influence decision making.

and the 3rd wave went to look at more things besides sentencing only.

Research in this wave indicated that racial discrimination occurred in both overt and more subtle forms in at least some social contexts.

The 4th wave was just like the 2nd wave, but it found that the best done studies found little evidence of discrimination: “Langan’s interpretation matches those of other scholars such as Petersilia (1985) and Wilbanks (1987) in suggesting that systemic discrimination does not exist. Zatz (1987) is more sympathetic to the thesis of discrimination in the form of indirect effects and subtle racism. But the proponents of this line of reasoning face a considerable burden. If the effects of race are so contingent, interactive, and indirect in a way that to date has not proved replicable, how can one allege that the “system” is discriminatory?” The fifth wave found a decrease in racial bias, but still found racial bias. It’s unknown why there’s such a discrepancy in the literature, and why controlling for legal variables does not seem to close the gap.

Ok so FB is acknowledging that multiple studies have found racial disparities In sentencing? And the ones that didn’t were from the 80s? Yes everyone is aware of the weaknesses in the early literature which is why Mitchell 2005 addresses all of these but still finds sentencing disparity.

Regardless, other studies have also found support for the racial bias hypothesis and against it (e.g. Sweeney and Haney 1992; Everett and Wojtkiewicz 2002; for a discussion on how the race variable is not statistically significant and is manipulated by how researchers conduct their study, see Pratt 1998;

So Pratt says

“see Zatz, 1987). According to the above analysis, the estimated effect size of race on sentencing decisions does not approach the magnitude of those associated with the legally relevant variables (primarily the seriousness of the offense). At this level of aggregation, however, the race effect is diluted by differences in the operationalization techniques by various researchers. Isolating the methodological differences in racial classification approaches illustrates how the effect of race on sentencing outcomes may be concealed. Given this condition, it then becomes necessary.”

So apparently the whole point of Pratt is that legal considerations are relevant to sentencing, but the 7-9% disparity that still remains represents thousands of people treated unfairly by the system.
Also Pratt notes that the race variable isn’t necessarily statistically insignificant because of type II error:
“A note of caution may be warranted here because it is not only counterintuitive, but (arguably) methodologically incorrect to take seriously a measure indicating the strength of a statistically insignificant relationship. Hunter, Schmidt, and Jackson (1982) and Schmidt, Gast-Rosenberg, and Hunter (1980), however, argue that tests of statistical signif- icance may be irrelevant in these cases, since the procedure of hypothesis testing is contingent primarily upon sample size.”
They also say that statistical insignificance in their results is due to lack of race operationalization. When they test this, race is a statistically significant variable in sentencing.
Pratt 1998 seems ineligible because of no empirical analyses. Pratt says the race variable is statistically insignificant when controlling for offense severity however he only analyzed data from the 60s, 70s, and 80s, but crime decreased in the 90s. So this paper is irrelevant.

FB mentions Mitchell 2005, however Mitchell addresses Pratt 1998. Mitchell’s meta analysis significantly expands and extends Pratt’s earlier work in several important ways.

First, unlike Pratt (1998), Mitchell reviews published and unpublished studies.

Second, Pratt’s research excluded discrete sentencing outcomes (e.g., incarceration decisions); in contrast, Mitchell synthesizes both continuous and discrete sentencing outcomes.

Third, and most importantly, the focus of Mitchell is on explaining why the findings of sentencing research vary so dramatically; an issue neglected in Pratt (1998).

Mitchell 2005 found the race variable to be statistically significant but small and highly variable).

High variability is just explained by differences in methodology between studies.

Even in the Mitchell study, though, racial bias was found. So, is there a variable that closes the gap to the point where it’s statistically insignificant? Yes. As has been well supported by scientific institutions and the overall literature, races differ in average IQs (see Shuey 1966; Coleman et al. 1966; Garner and Wigdor 1982; Lynn 2011; Roth et al. 2001; Chuck 2013; Neisser et al. 1996).

Right so the argument would be against the validity of IQ and theory behind what IQ purports to measure. (Ex. we don't know what intelligence means, there is no unit for intelligence etc...). Furthermore, even if we accept IQ, the Black-White IQ gap has been decreasing see Smith 2018.

IQ also correlates with many social variables like educational attainment, income, job performance

1. Correlation ≠ Causation.

2. You can’t use measures which are roughly normally distributed (like IQ) to predict other things that are not normally distributed. Most of the relationship that social scientists see is “noise” or unexplainable, random variation in results.

(Strenze 2006, 2015)

Many issues can be found (besides Berka-Nash, the main critique of these studies):

“Some readers may be tempted to say that success is a purely subjective phenomenon, which each individual defines for oneself. That is certainly true, but it seems that there is usually a high degree of consensus in society as to what is desirable and what is not.”

This doesn't really deal with the argument. Notice that despite the point made above, we would still criticize the lack of a unified definition. This is because even if people supposedly have a consensus on one definition, researchers have yet to define terms that derive statistical value from.

Notice as well that the study will ascribe some "genetic basis" (component) to intelligence from Jensen's values for the heritability of intelligence. This runs back into the various arguments on a misunderstanding of heritability we have discussed before.

He then lists the correlations with IQ. It is unclear what we are supposed to do with the smaller correlations of academic performance found such as .23 with high school students. This is problematic as teacher assessments correlate better with academic performance than IQ tests do. Also that IQ tests circularly confirm academic tests.

His sections on Job performance and IQ repeats misinformation specified by two notorious studies. I have far too much research on why supervisor ratings will not determine who is "more intelligent" so I could send the studies to you directly instead.

Also unclear is why we should focus on Education attainment considering Lee et al.'s results and critiques in 2018....3 years after Strenze's paper here.

He then cites Kanazawa to explain the findings above. An evolutionary theory of intelligence but these have been critiqued in depth in the past. see:

https://notpoliticallycorrect.me/2019/06/13/how-things-change-perspectives-on-intelligence-in-antiquity/
It is also unclear why Strenze cites Herrnstein and Murray's conclusion in The Bell Curve that "western society evolves toward genetic hierarchy where people with 'good genes' live in luxury and people with 'bad genes' struggle to survive." considering it has already been refuted see:
https://www.sociologicalscience.com/download/vol-3/july/SocSci_v3_520to539.pdf

and a bunch of other variables

If IQ is proven to be invalid these correlations are just correlations that mean nothing. Even then there are issues see:
https://developmentalsystem.wordpress.com/2019/11/05/the-predictive-invalidity-of-iq/

[readers who doubt the validity of IQ should know that it has a higher mean statistical power than other areas of science, like neuroscience and psychology, for example: in Last 2019].

I assume this was the classic "drawing correlations from images" joke? This was addressed when Last 2019 was posted.

Since IQ is a valid construct with predictive validity and is taken seriously, despite what some may lead you to believe, it should be asked if this variable can close the black-white gap in sentencing. Beaver et al. (2013) found that after controlling for past criminal record and IQ, the sentencing gap between blacks and whites became statistically insignificant, and this time it found the race variable to already be statistically insignificant like past studies.

This isn’t even the sentencing graph
The fundamental problem with Beaver 2013 is that the analysis was underpowered. The paper says based on NHST results that it finds no evidence of racial discrimination, but this is a type II error. Beaver et al (2013), only used a subsample of African American and White males and did not include any measures of disadvantage status. Beaver et al (2013) limited their study to comparisons between White and Black males. Race categories that do not include Hispanics can mask disparities between groups by inflating arrest and incarceration proportions among Whites and deflating the proportions among Black people. Expanding the race category to include Hispanics will avoid this. Prior research has clearly demonstrated the connection between criminal offending and socioeconomics and these variables may also be connected to criminal justice outcomes. Additionally their results aren’t aren’t as reliable because they use negative binomial regression. By using an extra parameter, their precision is decreased. Here is an example we can use to represent how we may use the original argument. Disregarding that IQ itself was not statistically significant in the study, see the first argument made in this post. It all comes back to one argument! Besides, Beaver et al 2013 asserts that differences in arrests between black and white Americans can be explained when Verbal IQ and Lifetime Self Reported Violence is accounted for, yet Schleiden et al 2020 finds that “Although the Differential Involvement Hypothesis believes that minorities do in fact commit more crimes, studies have found that there is more to the overrepresentation of minorities in juvenile and justice systems than criminal behavioral differences. Longitudinal studies found after controlling for differences in offending, racial differences in police contact remained significant (Gase et al., 2016). Additionally, studies controlling for factors such as criminal behavior, substance use, and mental health issues, found minority youth continued to be more likely involved in the justice system (Gase et al., 2016)... Neither contextual nor behavioral differences account for the arrest disparity between those who are Black and those who are White.” This means that Verbal IQ and Lifetime Self Reported Violence are not the be all and end all to explaining racial disparities in prison sentencing and the justice system. Additionally the Beaver et al study used an outdated IQ test and it didn't actually account fully for violent crime and doesn’t close the disparity all the way. Also there are a few sources that directly debunk the evidence but see this from burt and simmons 14: relation to Beaver et al. 13 is at the END Beaver (2011a: 282) investigated “genetic influences on being processed through the criminal justice system” using the subsample of adoptees included in the Add Health Study. Although Beaver (2011a) found that adoptees whose biological parents “had ever spent time in jail or prison” were significantly more likely to have contact with the criminal justice system, this finding is vitiated by several serious limitations, including those mentioned. Perhaps the most significant of these is the study sample. The only requirement for inclusion in the study sample was that the respondent indicated that he or she was adopted sometime before the survey (which took place when youth were in grades 7 through 12) and did not currently live with a biological parent. Because of data limitations, Beaver (2011a) could not ascertain the age at which the children were adopted and did not control for contact with biological parents. Additionally, information about the biological parents’ incarceration or lack thereof came from the adoptees themselves, and only respondents who were aware of their biological parents’ jail or prison experiences were included in the analyses. (Adoptees who answered “I don’t know” to the question of biological parents’ prison or jail experience were excluded.) As such, those respondents who had no knowledge about their biological parents’ jail or prison status—almost certainly those who had the least contact with their biological parents (and could not be influenced by potential labeling processes involved in having a criminal parent)—were not included in the analyses. This same Add Health adoption subsample and model also was used to “estimate genetic influences on victimization” (Beaver et al., 2013: 149). From Burt and Simmons 14. Felson & Kreager 14
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1020.5126&rep=rep1&type=pdf
Apparently this might respond to the claims made in Beaver et al. 13 (???)
https://en.wikipedia.org/wiki/Race_in_the_United_States_criminal_justice_system
A 2015 study found that African American males only had a higher likelihood to commit violent crime than White males, with a similar likelihood to commit property crime and a lower likelihood to consume drugs. This study used a notably larger sample size than Beaver et al. (n=18060 compared to n=1197 since Beaver et al. limited themselves to looking at those with complete data on race, age, IQ, and self-reported lifetime violence), and failed to observe consistent effects of race for different offenses. They went so far as to posit "The inconsistent pattern challenges the stereotypical image of the criminality of Black communities. It is also a challenge to the idea that crime theories can explain race differences." Also no incriminating evidence of Beaver exists.
https://link.springer.com/article/10.1007/s10560-019-00618-7
Although not a direct refutation, it controls for actual number of times of arrest instead of mere probability. It also used the Add health sample and doesn’t have the limited generalizability that beaver admits his sample may suffer from.
https://sci-hub.do/downloads/2020-06-15/83/10.3102@0013189X20932474.pdf
This refutes it.
So, Beaver 2013 finds IQ was a non significant predictor of sentence length and they did not control for nonviolent criminal involvement , they only controlled for violence rendering its measure of criminal history insufficient relative to measures that include both nonviolent and violent offenses.

Some commentators in the past have said that the 0.05 difference shows that there still is in fact a gap, but the gap is not statistically significant in the multivariate model after controlling for lifetime violence and IQ. Although it has been noted that effect sizes are better than statistical significance, that difference of 0.05 is very weak, and it’s doubtful race of the offender can explain these results. The remaining disparity can most likely be explained by legal variables. Starr and Rehavi (2012) 58,000 federal cases and found that 83% of the black-white sentencing gap can be explained by differences in criminal record, arrest offense, gender, age and location. The remaining disparity was a result of charging differences. (Starr and Rehavi 2014 found a 10% disparity, but that was due to charging differences.)

Yeah so the remaining 9-13% is unaccounted. And you can say it’s charging differences but I don’t see how that solves the problem. Charges can be applied differently based on racial prejudices so “charging differences” could represent the prejudiced attitudes themselves. Literally the next source by the same authors says this “Using quantile regressions, we estimate the size of racial disparity across the conditional sentencing distribution. We find that the majority of the disparity between black and white sentences can be explained by differences in legally permitted characteristics, in particular, the arrest offense and the defendant’s criminal history. Black arrestees are also disproportionately concentrated in federal districts that have higher sentences in general. Yet even after we control for these and other prior characteristics, an unexplained black-white sentence disparity of approx- imately 9 percent remains in our main sample. The disparity is nearly 13 percent in a broader sample that includes drug cases. Estimates of the conditional effect of being black on sentences are robust, fairly stable across the deciles, and economically significant. There are ap- proximately 95,000 black men in federal prisons. Eliminating the “black premium” that we identify would reduce the steady-state level of black men in federal prison by 8,000–11,000 men and save $230–$320 million per year in direct costs.”

Franklin and Henry (2019) found that blacks got sentenced longer than whites, but the difference was only 1.6% rather 19.1% as in the Booker study.

That’s a pretty interesting finding for a paper published in a journal with such a low impact factor (2.827). Such a finding could easily appear in a very high-profile journal so this is a strong indicator that something is not right about the story. The study might be underestimating the difference in sentencing because of their standard error. Constructing a 99% confidence interval, it could actually be as much as 3.4%. They also controlled for mandatory minimums which drives the sentencing gap a lot. This is relevant since mandatory minimums and race are correlated. For this reason, there should be a Variance Inflation Factors (VIF) that tests for multicollinearity in the model which is common in these observational studies. The Booker report used a similar regression on the same dataset, but including a longer timeframe, resulted in a slightly higher estimate in the US Sentencing Commission’s Demographic Differences in Sentencing: An Update to the 2012 Booker Report. Large cross-sectional regressions are difficult to compare without deep domain knowledge, and preferably robustness tests that make many such comparisons preemptively. Unfortunately, though the report was published two years earlier, it’s not commented on in the Franklin & Henry paper FB linked, and neither contain robustness analyses. (However, e.g. footnote 8 comments on some differences in control variables.)
The paper also shows the true disparity is likely masked. The “1.6%” figure is not zero, and does not fully represent the authors’ modeling results:
“In the present analysis, for example, our baseline models indicated that African American offenders were sentenced fairly similarly to White offenders (1.6% longer). Our interactive models, however, demonstrated that this was not the case—African Americans received notably longer sentences at the low end of the criminal history scale and notably shorter sentences at the high end of the scale. […] At a criminal history level of 1, Black offenders received sentences that were approximately 7.4% longer than White offenders. […] By a criminal history level of 6, the pattern of disparity reversed, such that Black offenders received sentences that were approximately 7.4% shorter than White offenders.”
Studies of smaller phenomena can have advantages. Tuttle’s 2019 job market paper, Racial Disparities in Federal Sentencing: Evidence from Drug Mandatory Minimums (pop-sci coverage in The Economist), focused on a specific drug law changes that created a quasi-experimental situation. It has better grounds to argue causality and went into more detail than basic curve-fitting. It argues that in this case, skin color affected sentence lengths by a considerable amount, through the channel of prosecutorial discretion in pushing the charges above a legal threshold.
Depending on the counterfactual sentence imputed for the affected offenders, bunching at 280g can account for 2-7 percent of the racial disparity in crack-cocaine sentences. A highly conservative estimate suggests that being bunched at 280g adds 1-2 years to an offender’s sentence.
More debunking here: https://www.reddit.com/r/AskSocialScience/comments/vd97r8/is_there_a_substantial_racial_gap_in_sentencing/?utm_source=share&utm_medium=ios_app&utm_name=iossmf

Their R2 was 0.645, meaning that 64% of the variance could be explained by legal and case-processing factors and extralegal factors. This means that 36% of the variance is being left unexplained.

FB is satisfied with almost 40% being left “unexplained”?

The issue of legal variables not being able to close the black-white sentencing gap, even if the effects of race are small, is interesting. This could reflect either other underlying variables that are being left uncontrolled, or measurement and statistical issues in the way criminologists do these studies.

It could but that is nothing more than a guess

Studies should look at legal variables, criminal record and IQ all together to see if the gap still persists. The fact that no study has done this is confusing, but it’s obvious the gap is most likely not due to racial bias. When it comes to the Georgia study, the document failed to note that light-skinned blacks had lower sentences than whites. Why this was omitted is unknown, but it’s unknown as to why light-skinned blacks have a smaller effect than whites. Regardless, this model can still be interpreted via a hereditarian hypothesis. Since light-skinned blacks are smarter than darker skinned blacks (Chuck 2008; Shuey 1966; Rowe et al. 2002; Last 2019), it would make sense as to why they get less harsher sentences than their darker counterparts. So even taking the effects of light-skinned blacks into consideration, it still falls inline with a hereditarian hypothesis and the findings by Beaver et al. So, it seems that much of black-white sentencing gap is a result of IQ differences, not racism in the criminal justice system.
Racial Disparity in Federal Criminal Sentences
Examination of federal data indicates Black Americans spend about 10% more time in prison when compared to comparable Whites who commit the same crimes.
Additionally, Black arrestees are 75% more likely to be charged with a crime carrying a mandatory minimum sentence.
Prosecutors contribute massively to this undeniable racial bias.???
https://www.yalelawjournal.org/article/mandatory-sentencing-and-racial-disparity-assessing-the-role-of-prosecutors-and-the-effects-of-booker
Black men are twice as likely to have charges which carry mandatory minimum sentences filed against them than similarly-situated white men.
This article recommends against the tightening of judicial discretion, arguing that the process has historically led to greater racial sentencing disparities.
The issue of blacks spending more time in prison has already been discussed above, but what about mandatory minimums? First, it’s unknown exactly how similar these criminals are due to the way crime type is being measured. According to the 2nd study, “we are estimating sentencing gaps between black and white defendants who look similar…” – meaning that the severity of the crime is not similar, but rather they’re just similar in terms of the type of crime committed. The first study also did the same thing where they looked at criminal offenses rather than the severity of the crime itself. Because of this, it’s unknown if these are accurate comparisons. More information is needed before we can definitively say that race is associated with mandatory minimums. When mandatory minimum laws are in place, the black-white gap in sentencing is smaller (although, the gap itself is a result of IQ, as discussed above). The black-white gap may widen or remain persistent due to the elimination of rigid sentencing guidelines (mandatory minimums), thus sentences are lower for BOTH blacks and whites compared to decades past. It seems mandatory minimums kept sentencing racially “fair” in some respects (Pryor et al. 2002). It’s highly possible that blacks commit more crimes that carry mandatory minimums, especially given the fact that blacks commit more crimes than whites (Beaver, Ellis, and Wright 2013). So, blacks could be committing crimes that have a mandatory minimum sentencing attached to it, and this explains it, not prosecutors. (Personally, I was unsure why the last claim for the first study ended with question marks. Did the study make this argument, or is this a personal interpretation of the data?)
Report on Jury Selection Study
Between 1990 and 2010, state prosecutors struck about 53% of black people eligible for juries in criminal cases, as opposed to 26% of white people. The study’s authors testified the odds of this taking place in a race-neutral context were around 1 in 10 trillion.
After accounting for factors prosecutors select for which tend to correlate with race, black people were still struck twice as often.
North Carolina’s state legislator had previously passed a law stating death penalty defendants who could demonstrate racial bias in their jury selection could have their sentences changed to life without parole. The legislature later repealed that law.
When looking at Table 13 of the study, the race variable had a positive coefficient of .906, and it was statistically significant at <.001. When controlling for all the variables that tend to correlate with race, and race itself, the entire R2 was 0.32, meaning it explained 32% of the variance into as why black jurors were struck down more often. Even when controlling for other variables that may be race neutral, there was still a disparity. Interestingly enough, another North Carolina study on jury selection found that defense attorneys struck down potential white jurors far more often than they did potential black jurors (22% vs. 10%).

From exposition of the study, the article explicitly says "starting with the defense attorneys, who used their removal powers at the highest rate, perhaps the simplest explanation is best: they used all the available voir dire clues (including the race of the prospective jurors) to seat juries who were more sympathetic to human frailty, or those who were more skeptical of local police. Perhaps the use of the jurors’ race was the explicit basis for the defense attorney’s choice, or maybe the race correlated with other clues, such as expressions of general respect for authority. Put another way, defense attorneys may have used race as one factor to pick a jury to win a trial." The other numbers are justified due to systemic racism "As for the judges, it is more difficult to reconstruct the reasons why they removed a higher percentage of black jurors from the venire. The 30% increase in the rate of removal among black jurors, when compared to white jurors, might reflect greater economic stresses among black jurors, such as transportation difficulties or pronounced hardship from missing days away from a job." and "it is also possible that prosecutors removed jurors based on a factor correlated with race – most prominently, jurors with a felony conviction, a prior arrest, or close family members who had negative experiences in the criminal justice system. 86 Prosecutors might have been fully aware of the disparate racial impact of these choices and regretted that unintentional side effect of their removal strategy." or due to explicit strategic decisions “One potential explanation for the race removal ratios higher than 1.0 (by prosecutors) would be intentional strategic decisions that incorporate race. 85 perhaps line prosecutors relied on race as a clue about the general receptiveness of jurors to a law enforcement perspective. Like the defense attorneys, the prosecutors may have relied in part on race to pick a winning jury.". No implicit racial bias was seen as a direct explanation of the disparities.

So, it seems that both races experience some form of “racial bias” in jury selection, and it seems to depend on whether a state prosecutor or defense attorney is the one selecting the jurors. If both groups seem to be affected, is it really due to racial bias? Makes no sense for race to be inconsistent in the face of systemic racism.

Holy shit, the lack of imagination here is staggering. Does this person actually believe that systemic racism means that racism cannot affect all races to some degree, that is must be solely localized on black people? The racism found in this study vis-à-vis jury selection is obvious—prosecutors and defense attorneys are assuming, based on race, that a particular juror will be favorable or unfavorable to their case, and behaving in inversely related ways on that basis. There’s a single, uncomplicated explanation. This same author even goes into detail about juror bias based on race, which shows it is absolutely a thing, but somehow cannot conceptually tie that back to this, or see that it is racism. To clarify: just because black jurors as a population tend to rule more often in certain predictable ways does not mean it is not still the textbook definition of racism to either strike or reject an individual black juror on that basis, rather than, you know, their individual characteristics as a person. Even if the prosecution or defense can be said to reliably tell a juror’s ruling ahead of time, and on that basis disproportionately struck or accepted black jurors, there’s no guarantee that rate would be commensurate with the rate at which p jurors are generally inclined to rule as a population.

Different Shades of Bias: Skin Tone, Implicit Racial Bias, and Judgments of Ambiguous Evidence
In this study, two groups of mock jurors were given a collection of race-neutral evidence from an armed robbery, with one group’s alleged perpetrator being shown to be light-skinned and the other dark-skinned.
Jurors were significantly more likely to evaluate ambiguous, race-neutral evidence against the dark-skinned suspect as incriminating and more likely to find the dark-skinned suspect guilty.
On a personal note, this study should not have been included. The sample used in this study was not even representative of actual jurors, and it had a small sample size. The sample for the study was 66 students from the University of Hawaii — not at all indicative of jury members serving under legal obligation. Regardless, the authors included a previous 2003 review that found no consensus in the literature for this issue (Sommers and Ellsworth 2003), casting doubt on if this study changes anything in that. Furthermore, the 2003 study also remarked that “Black mock jurors seem to be influenced by a defendant’s race regardless of the salience of racial issues at trial.” White jurors were less influenced by race, so racial bias is coming from blacks and not whites.

Look at that bolded sentence. Reeeeally look at it. It’s illustrative of this whole thing. What the study actually said was that white jurors were less influenced by race, not that they weren’t influenced by race at all, yet the author hares off with the wildly inaccurate and binaristic conclusion that “racial bias is coming from blacks and not whites” in that same fucking sentence. The mental block, zero-sum thinking, and lack of nuanced understanding here could not be more obvious. Furthermore, the study Vaush cites notes the methodological flaws of Sommers and Ellsworth: “first, focusing on guilt and punishment judgments may overlook the way implicit racial bias truly functions. Measuring verdicts and punishment judgments without also measuring cognitive processes might cover up the most meaningful part of the jury decision-making story. And second, testing verdicts and punishment judgments in a mock trial setting may actually heighten differences between decision making in real trials (with real consequences) and mock trials (with no consequences).” In fact, it discusses the problems with studies attempting to disprove jury bias. It does this by pointing out it doesn't test social cognition and little research has looked at race effects in the way jurors evaluate evidence. Therefore, the hypothesis that racial cues lead to biased evaluations of trial evidence has yet to be fully examined. The next section thus sets the stage for our empirical test of Biased Evidence Hypothesis by explaining the ways that simple racial cues can activate powerful racial stereotypes.

Furthermore, other studies have found no anti-black bias from whites who acted as mock jurors, but did find a racial-bias from blacks.

These studies have tended not to focus on jurors' cognitive processes, such as memory and evidence evaluation, instead focusing on outcome measures such as guilt and punishment.

Mitchell et al. (2005) analyzed data from 34 studies in which people acted as jurors and voted on whether a given defendant was guilty. It was found that whites have nearly no bias in such decisions while the black people exhibit an in-group bias that is 15 times larger than the minuscule bias seen among whites.

This is a prime example of a paper not focusing on jurors' cognitive processes, such as memory and evidence evaluation, instead focusing on outcome measures such as guilt and punishment. First of all, there is no clarification of exactly how many studies were excluded which could potentially impact the results. Second of all, the overall effect size for white people might not be all that “minuscule” since the effect sizes they couldn’t calculate were given .00. Even they say this artificially weakens the overall effect size (Pigott, 1994). They also didn’t take into account variables like race of the victim, socioeconomic status, or more importantly the type of crime that was even committed. This is reflected in their tests of homogeneity. For context, if the homogeneity test is significant (p < .001), it means that the variance of the effect sizes in the sample is larger than what would be expected due to sampling error alone, which suggests that there are factors other than sampling error that are contributing to the differences in the effect sizes. In the case of this meta-analysis, the significant result of the homogeneity test (Q = 279.28, p < .001) suggests that there is significant heterogeneity among the studies included in the analysis, which means that the effect sizes of the studies differ significantly from one another. This affects the reliability of the results of the meta-analysis and the conclusions that are drawn from it. Despite the significant result of the homogeneity test, this meta-analysis in question used a fixed-effects regression model to analyze the data. This is not the most appropriate statistical model to use in the presence of significant heterogeneity, as the fixed-effects model assumes that the studies being analyzed are similar and that any differences in the effect sizes are due to sampling error. If there are other factors that are contributing to the differences in the effect sizes, which there are as the test shows, the results of the fixed-effects model may be biased.
More importantly, the effect seen is only because outlying studies on black students, which is not representative, bias the effect size. This must be recognized because fixed effects models are sensitive to the inclusion of outlying studies, which produce biased estimates of the overall effect size.
It is also possible that these effects were only observable largely in the 70s because the “1970s” moderator was one of the strongest variables that influenced the racial bias in jury verdict decision (d = 0.404). So they’re not even applicable now (d for 1980s, 1990s, 2000s = only 0.031, −0.032, & 0.029, respectively.) They also didn’t do a test for homogeneity or fixed effects regression like they did for sentencing. It should also be noted that these effects for black jurors would only be observed because they used a continuous measure of guilt and not a dichotomous one. The meta-analysis even makes notes of these: “The race of participant moderator effect that deserves careful consideration. On the one hand, only nine of the samples involved Black participants and seven of those nine studies failed to provide instructions and involved continuous guilt measures (conditions that appear to promote the racial bias effect).” So the significant effects were small and the statistical significance disappeared if the experimenters eliminated certain types of studies. (E.g the researchers found that studies using dichotomous (guilty/not guilty) variables did not have the same race effects as studies using continuous scale variables (e.g. on a scale of 1-10. how guilty is the defendant). The researchers also found that community samples displayed greater race-based sentencing bias than college student samples.) This meta analysis also doesn’t take into account what kind of instructions we’re actually given to juries (i.e just “jury instructions”). It is totally unlikely that each study included all of the language provided in standard case law instructions (standard pattern, modified standard pattern, etc.) so this inflates results. This also wasn’t a group level analysis so there’s no conclusion whether racial bias is produced when jurors make decisions as a group. There's an inclination that the degree of racial bias may be influenced by the race of the foreperson, with a Black foreperson resulting in less guilt being attributed to Black defendants (see Foley & Pigott, 2002).

Devine and Caughlin (2014) conducted a meta-analysis and found that white jurors had no bias against black defendants, but did have a moderate bias against Hispanic defendants. Black jurors, though, showed a pro-black or anti-white bias.

The meta analysis says that white jurors actually may possess bias that is masked: “racism in its modern form tends to be less overt and more likely to manifest itself when race is not a salient factor in the decision context (Sommers, 2007; Sommers & Ellsworth, 2001). It therefore could be that a real outgroup severity bias on the part of White jurors is being masked by the conspicuous nature of the defendant’s race in many experimental studies of juror decision making.” In the study Vaush cites it even says this on pg. 325. “Even then, the author would be referencing the bivariate relationships from table 1. They're just that: bivariate relationships with no regard for omitted variables. In fact, the relationship between “B Jurors with W/B Defendant” and guilt decisions is barely statistically significant with larger confidence intervals. In fact they say “I 2 values were substantially larger than zero for all characteristics (ranging from 39% for juror education to 81% for defendant race).” When they do the moderator analysis, it turns out that it may not be dependent upon race, but upon case type: “There is also some support for the notion that the weak general

tendency for jurors to be harsher toward defendants of a different

race varies somewhat according to the type of case. Specifically,

there was little if any indication of outgroup severity bias for

violent cases (r .02, k 17) or homicide cases (r .03, k

14), but noticeably more bias when trials involved property crimes

(r .12, k 5) or adult sexual assault (r .13, k 5).” Sexual assault is more common among white people than black people which shows the bias is conflated.

Another issue is the use of weighted least-squares regression to analyze the effects of the moderators. This method can lead to overestimation of results, especially in cases with small sample sizes or when the data is not normally distributed. For example, if there are only a few cases of black jurors, the weighted least-squares regression can overestimate the size of the effect of these black jurors on guilt judgments.

Another statistical issue is the failure to account for potential confounds. Confounds are variables that can influence the relationship between two variables and can lead to inaccurate results. For example, if the study did not control for the type of crime (e.g. violent vs. property) or the defendant's race, this can lead to overestimation of the effect of black jurors on guilt judgments.

Finally, the study does not account for potential selection bias. Selection bias occurs when participants are not randomly assigned and can lead to inaccurate results. For example, if the study only included black jurors who were more likely to be convicted, this could lead to an overestimation of the effect of black jurors on guilt judgments.

Zigerell (2018) meta-analyzed 17 studies and found that white people exhibited a statistically insignificant tendency to favor black people while black people exhibited a pro black bias that was larger and statistically significant.

Zigerell is the study analyzing 10+ (forgot the number) unpublished studies.
https://journals.sagepub.com/doi/pdf/10.1177/0146167218757454
"Considering changes in implicit attitudes by participant race, Whites became less implicitly pro-White during BLM, whereas Blacks showed little change. Regarding explicit attitudes, Whites became less pro-White and Blacks became less pro-Black during BLM, each moving toward an egalitarian “no preference” position." during Black lives matter movement (although even they use IAT) from a date: 2009-2016 with a final sample of 1.3 million participants. One benefit in comparison to zigerell is the dates of participation which allow us to make a case that as time progressed, the data accumulated by zigerell has changed. although this could simply be (as zigerell stresses in the limitations section) a product of other factors like participants answering over the phone which could result in more or less discrimination. Zigerell also has a limited scope. The study only looked at data from 17 survey experiment studies, and did not consider data from other sources or research designs. As a result, the findings of the study may not be representative of the full range of research on in-group bias, and may not capture the full range of experiences and outcomes for Black and white participants. Zigerell also used a pooled analysis of existing studies, which can be subject to limitations and biases, such as the diversity and quality of the studies included in the analysis, and the potential for combining heterogeneous data in an inconsistent or inappropriate manner. As a result, the findings of the study may not be as reliable or valid as those of other methods, such as original empirical research or more detailed meta-analytic techniques (they also don’t provide I2 or Q values). Now they do say

“net discrimination favoring Black targets appeared across three studies in which the target was a political candidate and across the four studies in which the target was a worker or job applicant.”

However, this does not necessarily mean that the study did not combine heterogeneous data in an inconsistent or inappropriate manner. Heterogeneity in research data refers to the diversity and variability of the data, and can arise from a variety of factors, such as the sample size, the sampling method, the research design, the measurement instruments, or the analysis techniques. Heterogeneity can be a source of bias and error in research, as it can affect the reliability and validity of the findings, and can make it difficult to compare and combine different studies or data sources. In the case of Zigerell, while it is true that the study found similar patterns of small-to-moderate net discrimination in favor of Black targets across different studies and contexts, this does not necessarily mean that the data were not heterogeneous in other aspects. For example, the study may have combined data from studies that used different research designs, measurement instruments, or analysis techniques, which could introduce bias and error into the results of the pooled analysis. Additionally, even if the study did not combine heterogeneous data in an inconsistent or inappropriate manner, this does not necessarily mean that the findings of the study are reliable or valid. As previously mentioned, the use of a pooled analysis of existing studies can be subject to a number of limitations and biases, such as the diversity and quality of the studies included in the analysis, and the potential for overfitting or underfitting the data. As a result, the findings of the study may not be as reliable or valid as those of other methods, such as original empirical research or more detailed meta-analytic techniques. The study did not consider the potential moderating effects of other factors, such as the context, the nature of the group, or the specific situation, on the relationship between race and in-group bias. This is a given since there’s no moderator analysis. The fact that the sample is conducted over the phone is an example of undercoverage bias in which Zigerell is systematically excluding members of the population from being in the sample. People who use mobile phones, have unlisted numbers, or don’t have a phone at all couldn’t be in the sample. It also suffers from voluntary response bias since it was done online. Primary differentiation for why this study reports LESS pro-white bias as a result of date cannot be confirmed as there is no study comparing the two individual studies. The study I linked, mind you, claims there was a shift, implying different levels of pro-white and pro-black bias before so you interpret it that way. The pooled estimate for the black participation group could also be lower since there's only 1 study, Cottrell and Neuberg 2004, that is biasing the distribution, therefore, biasing the pooled estimate. It should also be noted that the confidence interval for the pooled results on black participants favoring black targets was about 3 times larger than those for white participants favoring white targets (-0.030 to 0.115 vs 0.117 to 0.508) so it isn't as reliable. It seems that all the studies involving black participation groups had confidence intervals way larger compared to the white participation group. 10 out of 17 are statistically insignificant since the lower bound confidence levels include 0. It should also be noted that the black dots indicating publication does not necessarily indicate a full reporting of all outcome variables and do not include dissertations, conference papers, submitted-but-unpublished manuscripts, or reports on TESS studies that appeared in secondary sources that used the studies for illustrative purposes. It would also be interesting to see the regression models and their estimates so we can see if there is a relationship.
Statistical problems include:
Limited model fit: Linear and logit regressions are parametric models, which assume a specific functional form for the relationship between the outcome and predictor variables. However, the data may not always follow this functional form, and the model may not fit the data well, which can affect the reliability and validity of the findings.
Limited model flexibility: Linear and logit regressions are linear models, which assume that the relationship between the outcome and predictor variables is linear, or can be approximated by a linear function. However, the data may not always follow this assumption, and the model may not be flexible enough to capture the full range of relationships between the variables, which can also affect the reliability and validity of the findings.
Limited model interpretability: Linear and logit regressions provide estimates of the coefficients of the predictor variables, which can be interpreted as the effect of the predictor variables on the outcome variable. However, these coefficients are often difficult to interpret in a practical or intuitive manner, and may not provide a clear or comprehensive understanding of the underlying mechanisms or processes that govern the relationship between the variables.

So in conclusion, white seem to show no anti-black bias in mock jury studies, but blacks show an in-group bias or an anti-white bias. If systemic racism in jurors is argued to be stemming from whites against blacks, the data does not support this.
https://bja.ojp.gov/sites/g/files/xyckuh186/files/media/document/PleaBargainingResearchSummary.pdf
Government aggregate of data on plea and charge bargaining.
“Studies that assess the effects of race find that blacks are less likely to receive a reduced charge compared with whites.”
“Studies have generally found a relationship between race and whether or not a defendant receives a reduced charge.”
“The majority of research on race and sentencing outcomes shows that blacks are less likely than whites to receive reduced pleas.“
In short, collected data strongly indicates a racial bias against blacks with regards to sentencing and plea bargains.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.821.8079&rep=rep1&type=pdf
Black defendants with multiple prior convictions are 28% more likely to be charged as “habitual offenders” than similarly-situated white defendants.
“Assessments of dangerousness and culpability are linked to race and ethnicity, even after offense seriousness and prior record are controlled.”
It’s hard to know the effect race has on plea bargaining without knowing the effect size. If there is an effect to be measured, an effect size should be given to see how strong or weak it is. The issue is far more complex, though, and not as simple as the document paints it out to be. In respect to the first study, Shermer and Johnson (2010) found race to be unrelated to the probability of someone getting their charges reduced.

So let’s note the methodological drawbacks of this study, then we’ll look at how their results aren’t conclusive. They exclude something like 16,000 cases which lowers the statistical power to correctly detect an effect, which probably explains the type II errors in their results. But there is a significant result in table 3 showing there’s a negative relationship between black people and charge reductions when it comes to weapons cases (b = -0.35). But moving on they say this

> “this study only examines reduction in charges. It is therefore unable to capture potentially important differences in initial charge severity, or in other prosecutorial decisions of consequence such as the imposition (or avoidance) of mandatory minimums (Ulmer, Kurlychek, & Kramer, 2007) or the use of substantial assistance departures (Hartley et al., 2007). Second, our measure of charge reduction provides a conservative estimate of prosecutorial charge bargaining in federal courts. Data constraints required that we restrict our analyses to charg- ing decisions that resulted in the lowering of statutory maximum penalties. While this type of charge reduction is of great consequence in that it lowers the ceiling for federal punishments, it fails to capture more subtle types of prosecu- torial bargaining that may also affect final punishment dispositions. Charge reductions that do not alter statutory maxima are unobserved in our analysis as are other types of plea negotiation such as fact bargaining and guidelines stipu- lations. As Tonry (1996, p. 78) observed, in the federal system, “prosecutorial discretion is all but immune from judicial review and many tools for fine-tuning sentences besides charge bargaining are available”. To better understand the role of the prosecutor in the federal punishment process, then, additional mechanisms for “fine-tuning” federal punishments must be incorporated into future work. Ideally, measures of actual time served would also be incorporated in addition to nominal sentence lengths. Finally, data constraints precluded examination of some potentially important omitted variables. These included measures of evidentiary strength20 as well as inter-organizational relationships among the different court actors, both of which are likely related to federal charging decisions. They also include additional offender and victim characteris- tics, such as detailed measures of victim injury, socioeconomic and family status, and prior histories of victimization and substance abuse. Unfortunately, these measures are not collected by either the AOUSC or USSC, so the addition of such measures would assist invaluably in future investigations of federal charging decisions.” With these methodological caveats in mind, let’s look at how their results aren’t conclusive. Looking at table 2, the standard error is higher than their coefficient estimate for the black variable.

This is the same for table 3 when they examine different types of crimes (some are lower but still close to the coefficient estimates). This is important to recognize because this shows how imprecise their estimates are especially for logistic regression. When the standard error is considerably lower, higher precision, that’s when we see significant results for the black variable (see weapons cases in table 3 as stated before they say “race and ethnicity emerged as strong predictors primarily for weapons offenses, where black and Hispanic offenders were about .70 times as likely to have their initial charges reduced.)”

Metcalfe and Chiricos (2017) analyzed the effects of race and remarked that “Pleading guilty increases the probability of a charge reduction by 50.1% for blacks, as opposed to 55.8% for whites“, and that “blacks generally have slightly lower offense seriousness scores, more extensive prior records, and are detained at higher rates—all factors that decrease the likelihood of a charge reduction. This may partially explain the lower value blacks are getting for their plea.” “Aha”, some Vaush readers might say. “Your own quote from the study shows a bias against blacks, so Vaush was right!” No so fast, reader. After looking at the effects of both gender and race, the authors noted that “Pleading guilty increases the probability of a charge reduction by 46.1% for black males, compared to 58.1% for black females, 53.9% for white males, and 55.9% for white females.” It seems that the effects of gender are stronger than that of race, and black females benefit more than white males and females when pleading guilty. If the criminal justice system was racist in respects to plea bargaining, the effects should not differ by sex.

1. If it were given that this was accurate, do I even have to explain the fallacy here? The existence of sexism does not somehow abrogate the existence of racism. The numbers are right there— even if you assume both sexes are equally represented (which one should not do when discussing the legal system unless it is explicitly the case that it factually is equal) and average them out, there’s still a negative disparity between races. Also, it should be news to exactly no one that racism and stereotypes differ based on sex. Asian women are treated as attractive, submissive sex objects while Asian men are treated as undesirable rejects. 2. This is from a singular county in Florida. The black female white male difference and black female white female difference also wasn’t statistically significant. The sample size seems kind of small but not necessarily debilitatingly so. Also interaction effects (say stereotype threat/combo of black and male) would theoretically be consistent with a systemic racism hypothesis. But like either way using data from a singular county in Florida to discount systemic racism is just weird.

2nd study did find the results the document is claiming, as seen in Table 6 of the study. Controlling for gender might show contrasting results to the racism hypothesis. As Steffensmeier and Demuth (2020) noted, female defendants who aren’t white are treated better than their male counterparts in criminal sanctioning. So, the 2nd study should control for gender and see if the effect still persists.

Bruh isn’t gender in the table they cited? Male is right there. Like Gender through male status (appears to be included as a control variable in level 1 variables

Furthermore, if it seems to only affect males, it doesn’t make sense for racism to be the explanatory variable if it differs by gender. Past critics have noted that the system could be sexist and racist, but we are focussing on race for this. If race seems to have little to no effect, pairing it up with gender doesn’t change that.
https://www.urban.org/sites/default/files/publication/22746/413174-Examining-Racial-and-Ethnic-Disparities-in-Probation-Revocation.PDF
The Urban Institute analyzed the histories of four probation offices and found black people were 18-39% more likely than similarly-situated white people to have their probation revoked.
This study has significant limitations not mentioned by Vaush. As the study says, key variables were missing from the data set, making the analysis results constrained by what they were limited to: “Data on some key factors likely related to revocations were not available for analysis. In no site was the data sufficiently populated regarding violation type, including whether violations were related to new crimes or technical violations of probation conditions. This is a very substantial limitation, as the type of violation is strongly related to the likelihood of revocation. We also did not have the data necessary to parse out the contributions of different decision points and actors to the disparity. Probation revocations are a product of probationer conduct, probation officer discretion, judicial discretion, and supervision conditions, making these four factors important determinants of which probationers experience a revocation. Other processes, such as law enforcement practices (which could detect more or less probationer misconduct) or policies and statutes (which could limit discretion in responding to probation violations), also play a role in many jurisdictions. Given these limitations, conclusions regarding the drivers of observed disparities in probation revocations are provisional and constrained by the data available for this study.” Why this was omitted was unknown, and it’s clear the limitations do not allow us to see if racial bias even plays a role.

So this study uses the Blinder-Oaxaca decomposition method, which is actually mathematically robust even with incomplete variable sets. It decomposes observed differences into "explained" and "unexplained" portions, so it’s explicitly accounting for unmeasured factors in the unexplained portion. It doesn't assume we have all variables, it just quantifies what we can explain with available variables and acknowledges what we can’t. The statistical significance and directionality of the racial disparity persists across four distinct jurisdictions with different population demographics, legal frameworks, time periods, available variables, and base revocation rates. This consistent pattern across heterogeneous settings strongly suggests the observed disparity is real rather than artifactual. The probability of finding consistent directional effects by chance across four independent samples is extremely low (p < 0.0625 assuming just directional consistency). The missing violation type data would actually strengthen rather than weaken the results. If violation types were randomly distributed across races, their omission wouldn't bias the results. If violation types were systematically different by race, including them would just explain more of the disparity. Either scenario doesn't invalidate the finding of disparity, it just affects how much of it we can explain. Also, each jurisdiction had different available control variables, yet produced similar findings. This natural experiment shows the core finding is robust to different control variable specifications. If the result were an artifact of omitted variables, we’d expect greater variation across sites with different control sets. The study explicitly partitions the disparity into explained and unexplained portions through the B-O decomposition. In statistical terms, this means we've quantified our uncertainty rather than ignored it. The unexplained portion (20-49% for black-white comparisons) represents an upper bound on how much the missing variables could potentially explain. The findings also persist across different time periods for different jurisdictions. This temporal robustness suggests the observed patterns are structural rather than temporal artifacts. Additionally, the study uses both logistic regression and B-O decomposition, finding consistent results across methodologies. This methodological triangulation strengthens confidence in the findings. While more complete data would certainly be valuable, the mathematical and statistical architecture of this study is designed to be robust to missing variables. The limitations cited don't invalidate the core findings - they simply constrain our ability to fully explain the mechanisms driving the observed disparities. The consistency of findings across different jurisdictions, time periods, and methodological approaches provides strong evidence that the observed disparities are real, even if we cannot fully decompose their causes with available data. This is fundamentally different from saying the results are wrong. Rather, we have high confidence in the existence of disparities (the "what") but more limited ability to definitively establish their complete causal mechanisms (the "why"). This is a common situation in observational studies, and the authors have appropriately acknowledged these limitations while employing robust statistical methods that remain valid in the presence of unmeasured variables.

Given probation officers do not treat blacks differently than whites, it’s doubtful that the higher probation revocation among blacks is due to racial bias (see Bechtold et al. 2015).

Bechtold et al. (2015) is only from one state (Washington) and it had several limitations not mentioned. 1. “We do not have a measure of the offenders’ actual behavior while on probation.” 2. “our examination of probation officer monitoring relies on probation officer notes. We do not know the extent to which probation officers reliably and accurately kept these records (or whether probation officers’ attitudes biased their record keeping). As such, it is possible that some youth received warnings that were not recorded in the probation officer logs. Furthermore, the large difference in the distribution of warnings between sites could be because of differences in documentation practices (not differences in probation officer behavior).” 3. “in order to achieve adequate sample sizes of each race and ethnicity, our sample included individuals who had probation violations within a prespecified date range. It is also possible that there may be time effects, with racial or ethnic bias changing over time as different laws are enacted, different judges take the bench, or other shifts occur. We cannot eliminate such factors in the analysis presented here.” “Moreover, the study is limited because we obtained data from juvenile offenders who had received a probation violation in one of two jurisdictions; therefore, future studies should replicate our findings in additional, diverse counties with a diverse population of juvenile offenders, preferably with data on youth who do and youth who do not receive probation violations. Furthermore, we obtained our race and ethnicity information directly from court and probation databases; however, it is possible that the courts mis- classified some Hispanic and White youth.”We did not have access to demographic data regarding probation officers or judges.” “Finally, a sample of juvenile offenders as a whole, with some individuals who violate their probation terms and others who do not, would eliminate issues related to sample selectivity. Indeed, the most significant limitation in the present study is that our analysis was limited to those who had received at least one probation violation. One major consequence of this limitation is that we do not know whether there are racial differences in those who received several warnings but never received an official probation violation.” The sample size is too small so the statistical power is too low to detect the effects of bias. A lot of these results are the result of low statistical power given they’re all below the .80 threshold. A lot of their effect sizes were around ~+1.00 with very large standard errors so their results are not precise as well.
The interaction analysis also shows some bias as they say:
“there was a trend (z = 1.81, p = .07) among the boys for Black youth to receive more probation violation consequences than White offenders”

https://poseidon01.ssrn.com/delivery.php?ID=111089002009100067015104070125088087034086041036045026092114023030105096098083115031022029052037057008050110111109102109109082122004033060060127095069095118065079047041043119111022084005069011027093102024111094019030009109028001123116092074102031092&EXT=pdf&INDEX=TRUE
A study of bail in 5 large counties found blacks received significantly higher bail than whites who had committed similar crimes.
The bail was $7,000 higher for violent crimes, $13,000 higher for drug crimes and $10,000 higher for crimes related to public order.
Once again, this study is being misrepresented. As the study says in respects for violent crime, “For violent crimes, this coefficient is roughly -$140, suggesting that blacks’ bail is actually slightly lower than whites after regression adjustment, which obviously does not support the hypothesis of discrimination against blacks. For the other offense-type categories, regression-adjusted differences are much smaller than raw black-white differences.”

Yet it also clarifies “there is no support for the hypothesis that blacks are favored, either, as the estimate’s p-value against the two-sided null of no effect is 0.95.”

Why this was omitted from the document is unknown, and the coefficient for violent crime seems to show an anti-white disparity, not an anti-black one.

This isn’t an “anti-white disparity” in fact the study actually says “these findings suggest the possibility of substantial bias against blacks in bail settings. This result is important for multiple reasons. First, it is evidence of substantial judicial bias against blacks, which is of per se concern. Second, bail affects defendants’ utility directly by affecting the probability that a defendant will lose his freedom for the potentially lengthy pre-trial period. Discriminatorily higher bails thus will make black defendants worse off in this sense. Finally, being held over might also affect the probability of conviction. Our results are therefore of substantial legal policy interest.”

For drug crimes and crimes related to public order, the coefficients were much smaller than when just looking at raw data and they were not statistically significant, except for drugs at 0.00.

Death Penalty Sentencing

https://files.deathpenaltyinfo.org/legacy/documents/WashRaceStudy2014.pdf
Analysis of 33 years of data from Washington State to determine which characteristics best predict the decision to implement a death sentence.
Black defendants are 4.5 times as likely to receive a death sentence as similarly-situated whites.
Other factors (presence of aggravating circumstances, involvement of sex crimes, hostage-taking, etc.) explain only a small fraction of the disparity in prosecutors’ and juries’ decision to invoke the death penalty against black defendants.
Race was by far the most influential statistical factor.
Contrary to what was claimed, race was not even “by far the most influential statistical factor.” If we look at Table D3 from the study, we see that the beta coefficient was strong, but it was beaten out by extensive publicity, white victim, and police officer victim. Although the beta coefficient was strong, it was in 4th place when we look at the data ourselves. The variable wasn’t statistically significant also, but we will ignore that given that effect sizes are better than statistical significance.

Table 5 is death notices. Decision to impose a death sentence is table 7 I think. The same size is quite tiny though

This study was misinterpreted, and actually looking at the data gives a different picture, contra Vaush. Percent black was a statistically significant variable, but this can be explained by the high rates of crime among blacks. It’s well known that blacks commit more crime than whites, as pointed out in a giant literature review by Beaver, Ellis, and Wright (2009): I wouldn’t put much hope into that variable and thinking “ha! Gotcha!” It most likely does not mean much, especially since the U.S. doesn’t sentence people to the death penalty on a population basis. What was said above doesn’t dispute that blacks are more likely to get the death penalty, but this is not due to race once other factors are controlled for. Focusing specifically on race and implementing the death penalty, Klein and Rolph (1991) note that “After accounting for some of the many factors that may influence penalty decisions, neither race of the defendant nor race of the victim appreciably improved prediction of who was sentenced to death.”

This study did not examine prosecutorial decisions. Instead, it examined 496 cases in which the prosecutors had charged special circumstances and the defendants had been convicted of first-degree murder. Because prosecutors make a range of discretionary decisions before conviction, the Klein and Rolph study is vulnerable to criticism of sample selection bias. For example, their methodology is unable to detect any racial or ethnic disparities that may result when prosecutors decide not to seek the death penalty for those accused of the murders of African American victims less frequently than for those accused of the murders of whites. Such disparities also go undetected when, having charged one or more special circumstances that make the defendant eligible for the death penalty, prosecutors later negotiate a plea agreement and thereby remove the death penalty as a possible sentence. Thus, Klein and Rolph's research focused only on penalty trial sentencing decisions, almost all of which are made by juries. The study began with homicides committed on August 10, 1977 (the date that California's death penalty statute took effect). Only defendants under a sentence of death or life without parole on March 1, 1984, were included in the sample. In the end, 352 inmates (71%) were sentenced to life without parole, and 144 (29%) were sent to death row. Klein and Rolph's analysis divided the cases into white and non-white victims and defendants, omitting further racial/ethic distinctions. Initially they found a small race-of- victim difference. Thirty-two percent of defendants with white victims were sentenced to death, compared to 23% of those with non-white victims. The authors then constructed a statistical model that utilized several factors to predict whether the defendants would be sentenced to life without parole or to death. The model correctly predicted the sentence in 81% of the cases in the sample. Because 71% of defendants in the sample were sentenced to life without parole, however, the model increased predictability only slightly. Of the 144 defendants sentenced to death, the authors' model predicted a death sentence in less than half (70) of the cases (The authors' model predicted a death sentence in 70 out of 144 cases in which the death penalty was actually imposed.) Upon statistically controlling for legally relevant variables, (For example, Klein and Rolph included measures of the offender's prior criminal record, the offender-victim relationship, and whether or not the murder involved torture) the authors concluded that neither the victim's nor the defendant's race had any impact on death sentencing. This conclusion has been criticized. David Baldus and his colleagues argued that Klein and Rolph may have overlooked a statistically significant race-of-victim disparity because they used a statistical method ("CART") that could not capture the full effects of race. The original authors somewhat admit this too: “We did not examine possible bias at earlier stages such as police investigation and arrest practices, prosecutor charging decisions, case preparation, jury verdicts regarding guilt or innocence, and prosecutor requests for the death penalty. Bias at any of these stages could affect which cases reach the point at which a death/LWOPP decision is made (reference Berk (1983), An Introduction to Sample Selection Bias in Sociological data).” “we do not draw any general conclusions about arbitrariness or bias in capital cases.” Moving on, the object of the study is to evaluate prosecutorial decisions to seek a death sentence or a sentence of life without possibility of parole (death was sought in 41% of the cases). The unadjusted data reveal a 9 point disparity in the rates at which death is sought in white-versus nonwhite-victim cases (.39- .30). A similar analysis shows no race-of-defendant effects. The authors report partial results from a logistic regression analysis that controls for six legitimate factors related to the circumstances of the crime and the victim that were screened in a stepwise regression from a list of 35 such variables. The table reports no regression coefficients but does indicate the level of statistical significance of the six variables. The variable for the race of the defendant did not enter the analysis, but the variable for the race of the victim did enter at the .01 level of significance. Because of the minimal controls for legitimate case characteristics and the weak fit of the reported regression results (R = 12.7), the results of this analysis are merely suggestive. They hardly support the conclusion of the authors that "the available data suggest prosecu- tor requests for the death penalty in Los Angeles County were not influenced by racial considerations." For a similarly skeptical observation about this conclusion by one of the GAO researchers who prepared the GAO report noted above, see Conference, supra note 70 (remarks of Ganson). The data also contain no basis for assessing prosecutorial decisions in death-eligible cases that were not tried because they terminated in a negotiated plea bargain. The one that focuses on 496 California jury penalty trials conducted between 1977 and sometime before 1984 also has problems. Juries returned death verdicts in 29% of these cases. Because the study is limited to penalty trials, the authors make clear that their findings cannot be generalized to prosecutorial decision making. (“This study addresses possible racial bias only in the [death-sentencing] step and does not speak to possible racial biases at earlier stages.”) The unadjusted data show no race-of-defendant effects, but they reveal a 10 percent- age point race-of-victim disparity (.33 - .23) in the rates at which a death penalty is imposed, significant at the .024 level (by calculations). The authors do not present regression results, relying instead on (1) two different clusters of cases defined as similar because they share similar death-sentencing rates and (2) a multivariable case classification system known as Classification and Regressions Trees ("CART"). The results of the cluster analysis show race-of-victim effects in one subgroup of cases that are not trivial (a 13 percentage point difference in death-sentencing rates), but the disparity is not significant because of small sample sizes (only 15 nonwhite-victim cases). The CART analysis, which controls simultaneously for 15 vari- ables related to the defendant, the victim, and the circumstances of the offense, measures the impact of race in terms of the extent to which the inclusion of the race variables in the analysis increases the accuracy of the model in predicting correct sentencing outcomes. The race of the defendant had no effect and the inclusion of the race of victim increases the number of correctly predicted death sentences by only 10% and the number of correct predictions overall by only 1%. (“The full unpruned tree achieved a 91% accuracy rate with victim race included and a 90% rate without it.”). On the basis of this evidence, the authors conclude that penalty-trial outcomes in California are “not systematically related to victim or defendant race.” It is regrettable that the authors did not use logistic regression, which would have provided a basis for comparing their results with the results reported in the broader litera- ture. This concern is particularly true because there is a fallacy in using the increase in correct predictions as a measure of the impact of adding a factor such as race as a predictor. The reason is that the CART measure of the impact of race based on the extent to which race improves predictions has the potential to mask significant race effects that are detectable in a multiple regression analysis. Specifically, under the CART analysis, cases falling in a category for which the death rate is less than .50 are predicted to be life cases and cases falling in a category for which the death rate is greater than .50 are predicted to be death cases. Adding, say, race of defendant as a predictor to a classification system based on nonracial factors involves splitting each (nonracial) category into subcategories with black and nonblack defendants. This split will increase the number of correct predic- tions only if a category (e.g., murder-rapes) splits into one racial subcategory (viz., murder- rapes with nonblack defendants) with an under .50 death-sentencing rate and the second racial subcategory (viz., murder-rapes with black defendants) has a death-sentencing rate of more than .50. In other words, for the split, which occurs as a result of adding race to the model, to improve predictive power the death-sentencing rates for the two racial sub-groups must straddle .50. Thus, there can be a substantial increase of the risk of death (say from .05 to .15 or from .80 to .95) for black defendants compared to nonblack defendants in a particular category, without any improvement in the prediction rate; in other words, as a metric, change in the correct prediction rate ignores increased risk of death associated with race, unless one race has a death-sentencing rate under .50 and the other has a death- sentencing rate over .50 in one or more (nonracial) categories. In Klein and Rolph's analysis, this scenario apparently happened in five of the over 140 death-sentenced cases in the analysis, but any other race effects that did not meet this test were ignored. (“Including victim race therefore generated five more correct classifications.”)

Baime (in Systemic Proportionality Review Project: 2001-2002 Term): “[W]e state our conclusions: (1) there is no sustained, statistically significant evidence that the race of the defendant affects which cases advance to penalty trial; (2) there is no sustained, statistically significant evidence that the race of the defendant affects which cases result in imposition of the death penalty.” found neither race of the defendant nor race of the victim predicted who would get the death penalty. Baime notes that there is no sustained, statistically significant evidence that the race of the defendant affects which cases advance to penalty trial. Although bivariate analysis reveals that a greater proportion of death-eligible white defendants than African-American defendants advance to the penalty phase, that finding is not supported by regression studies and application of case-sorting techniques. Finally, Corzine, Codey, and Roberts (2007) report that “The available data do not support a finding of invidious racial bias in the application of the death penalty in New Jersey.”

Aside from the fact that this is from 14 years ago, the study that Corzine, Codey, and Roberts cites to justify that there’s supposedly no racial bias in the in the death penalty in New Jersey comes from David Baime’s study which he voiced at a 2006 testimony. The study faced lots of scrutiny from people like Professor Jeffery Fagan.

https://www.uky.edu/AS/PoliSci/Peffley/pdf/Eberhardt.2006.Psych%20Sci.Looking%20Deathworthy.pdf
Analysis of the relationship between racial stereotyping and death sentence convictions.
Black defendants who possessed darker skin and more “stereotypically black” features were twice as likely to be given the death penalty when accused of murdering a white person, as compared to lighter-skinned blacks with less “stereotypically black” features.
This disparity disappears completely when the murder victim is black.
Check above to see a response to the race of the defender, fits for within-race differences too. Since claim #1 has been responded to up above, we will move onto the final claim. Turning to the race-of-victim effect, there also seems to be no racial bias once confounding variables are adjusted. Walsh and Hatch (2017) found no evidence of race-of-offender bias or race-of-victim:“[We] fail to find any race-of-victim bias”.

This is not a study. It has no statistical analysis done for them to say there no evidence of the race of offender/victim effect. It was just published in the Journal of Ideology, which isnt a peer reviewed journal. Its just a book that says bad things about what is considered conventional. So there’s no reason to take this seriously.

Klein and Hickman (2006) say: “When we look at the raw data and make no adjustment for case characteristics, we find the large race effects noted previously—namely, a decision to seek the death penalty is more likely to occur when the defendants are White and when the victims are White. However, these disparities disappear when the data coded from the AG’s [Attorney General’s] case files are used to adjust for the heinousness of the crime. For instance, [one of the studies] concluded, “On balance, there seems to be no evidence in these data of systematic racial effects that apply on the average to the full set of cases we studied.” The other two teams reached the same conclusion. [One team] found that, with their models, after controlling for the tally of aggravating and mitigating factors, and district, there was no evidence of a race effect. This was true whether we examined race of victim alone . . . or race of defendant and the interaction between victim and defendant race.” [the third study’s author] reported that his “analysis found no evidence of racial bias in either USAO [U.S. Attorney’s Office] recommendations or the AG decisions to seek the death penalty”

Other large scale analyses of the patterns of capital sentencing reveal that Black defendants are more likely than White defendants to be convicted of capital murder and more likely to be sentenced to death (Baldus et al., 1998; Paternoster & Kazyaka, 1988). In addition to bias against Black defendants, research also offers evidence of bias against any defendants whose victims were White (Amnesty International, 2003; Indiana Criminal Law Study Commission, 2002; Paternoster et al., 2003; Unah & Boger, 2001). Experimental research has found a similar bias against Blacks in capital trials (Lynch & Haney, 2000; Sommers & Ellsworth, 2001). For example, Dovidio, Smith, Donnella, and Gaertner (1997) examined sentence recommendations toward a Black and a White defendant. After reading a trial summary, participants provided their sentence recommendation. Results showed that the Black defendant received significantly more death sentences than the White defendant. Similarly, Lynch and Haney (2000) analyzed the responses of 402 participants who viewed a videotape of a simulated capital penalty phase. They found that participants were significantly more likely to recommend the death penalty for the Black than for the White defendant. Furthermore, mock jurors judged the trial evidence less mitigating for the Black defendant when compared with the White defendant.

After adjusting for a variety of aggravating and mitigating factors, as well as demographic, and evidentiary variables, Sharma et al. (2013) found neither race of the defendant nor race of the victim predicted who would get the death penalty.

Firstly, this isn’t representative of concluding since it’s only from Tennessee, they excluded cases from their sample which lowers statistical power, and they didn’t break down further the races they were analyzing to compare them. Second, per the study it says “prosecutors were more likely to seek the death penalty in cases where the victims were White.” What’s biasing this is their results in juries. However, the results are flawed and subject to omitted factors. For example, the study says “Over the last three decades, Black defendants in Tennessee have actually been significantly less likely than White defendants to be sentenced to death. This finding may stem from the fact that Whites are more likely to kill other Whites (see Table 1) and the fact that prosecutors are more likely to charge a capital offense when a victim is White (see Table 2).” I wouldn’t doubt this given the standard error, yielding imprecise results (speaking of which, where are the confidence intervals?)

Citing 10 other studies, Katz (2005) found that once aggravating and mitigating factors are adjusted for, there is no race-of-victim effect. Furthermore, white victims show a “greater percentage of mutilations, execution-style murders, tortures, and beaten victims, features which generally aggravate homicide and increase the likelihood of a death sentence” Paternoster and Brame (2003) found the race of the defendant to have no impact on getting the death penalty but did find a race-of-victim effect. A re-analysis of this data found no race-of-victim effect (Berk, Li, and Hickan 2005).

This is just because their measurements are weak. Per the study: “with better covariates it is possible that stronger racial effects could be found, not just weaker effects.” They also treat defendant/victim racial combinations as a four- category nominal variable (black defendant/white victim; black defendant/ black victim; white defendant/white victim; and white defendant/black victim). This drawbacks the study because conducting each of these tests separately means that each test is less powerful than a comparison of black defendant-white victim cases with all other cases combined. Substantive merit is found in comparing black defendant–white victim cases with all others combined because such a comparison represents a more powerful inquiry into the hypothesis that because of the racial threat they possibly present, such cases are treated differently than others. Previous theory and research would support such a prediction a priori (Blalock, 1967; Jacobs and Carmichael, 2002; Kent and Jacobs, 2005; Stults and Baumer, 2007), Regardless, a reanalysis by the original authors already mentioned used the conflicting studies propensity score methodological analysis and found different results:
https://paperhub.ir/dl4.php?doi=10.1111/j.1745-9125.2008.00132.x&key=phYudIL0iAQDc

Jennings et al. (2014) found that using this technique made the OR between death sentence and white victim show no effect.

Debunked further below

Katz (1989): noted that the discrepancy vanishes altogether when further controls are imposed;

Katz (1989) cites Barnett (1985) as evidence that the discrepancy disappears however Baldus (1985) responded to Barnett saying this: “Although the study by Barnett (1985) used a more intuitive method for case classification, and the study by Baldus et al employed a computerized method of determining case culpability, the substantive results of the two studies are comparable. Both conclude that over half of death sentence cases do not appear to be excessive or disproportionate in a comparative sense. Each also concludes that a good proportion of the remaining cases in the data set may be disproportionate or excessive in a comparative sense. Perhaps more significantly, both show a comparable race-of-victim effect (which disadvantages defendants whose victims were white) among the midrange of cases where the facts do not clearly dictate either a life or death sentence. Finally, neither study shows a statewide race-of-defendant effect. However, when urban and rural cases are analyzed separately, the Baldus study shows an effect in rural areas that puts black defendants with white victims at a slight disadvantage.” Baldus (1990) also finds that murderers of white victims are less likely than murderers of black victims to get the death penalty. The same has been found for Baldus (1994) and Baldus (1998) which was examined in the original study.

Bacon et al. (2003) echoed this finding multiple times: “The race of the victim effect does not hold up, however, at the decision of the state’s attorney to advance a case to penalty trial and at the decision of the judge or jury to impose a death sentence given that a penalty trial has occurred” (p. 27); “The race of the victim does not appear to matter when the decision is to advance a case to the penalty phase or to sentence a defendant to death after a penalty phase hearing” (page 29); “Among the subset of cases where the case actually does reach a penalty trial, the victim’s race does not have a significant impact on the imposition of a death sentence” (page 35); “There is no race of the offender / victim effect at either the decision to advance a case to penalty hearing or the decision to sentence a defendant to death given a penalty hearing” (page 30).

But then the author omits when it says “In Table 12E we report the results of a logistic regression model for defendants who are sentenced to death within the pool of all death eligible cases. This table shows that even taking into account jurisdiction and relevant case characteristics offenders who slay white victims are significantly more likely to be sentenced to death than those who slay all non-white victims…In order to better capture the magnitude of the race of victim effect, in Table 12G we have calculated the predicted probability of each outcome in the death sentencing process for white and non-white victim cases both before and after adjusting for case characteristics. The adjusted probability that a state’s attorney will seek a death notification when a white is killed is .266 and .169 when a black is killed. This means that the probability of a death notification in a white victim cases is 1.6 times higher than that for a black victim homicide, even after considering relevant case characteristics and the jurisdiction where the homicide occurred. The probability of a death notification “sticking” is 1.5 times higher in white victim than black victim cases again after taking into account case factors and jurisdiction. At both these early decision making points, then, the race of the victim killed in a homicide is an important factor in determining which death eligible defendants are notified that the state will seek the death penalty against them, and for whom that notification will “stick”. The last entry in Table 12F shows that for all death eligible homicides the probability of a death sentence in a white victim case is three times higher than in a non-white victim homicide. The estimated probability for a death sentence among death eligible homicides in the stepwise model is .022 for white victim cases and .011 for non-white victim cases. The probability that a white victim death eligible homicide will result in a death sentence is now only two times higher than in a non-white victim homicide.” So when they improve the quality of their measurements, low and behold, the race of the victim effect still holds. In fact they say this is precisely the reason why at later stages, there seems to be no effect: “while these effects do not appear at other, later decision making points in the capital sentencing process they are generally not corrected.”

Jennings (2014) found no evidence that cases with white victims were more likely to result in the death penalty compared with similar cases involving non-white victims, even when examining the most disadvantaged situations for black defendants.

Jennings (2014) is only from one state and it even acknowledges that as a limitation. “Although these data represent a population of jury decisions in capital murder trials from 1977–2009 where the jury carried out their specific instructions regarding aggravation and mitigation, and where at least one aggravator was found so that the case remained death pen- alty eligible, the data is from only one state. There is certainly likely to be variability across jurisdictions and states in how the implementation and application of the death penalty and how the complex and nuanced process of death penalty decision-making plays out in their respective settings that has implications for generalizability. In fact, this variability has been previously documented in Illinois (Pierce & Radelet, 2002), Nebraska (Baldus et al., 2002), California (Pierce & Radelet, 2005), Maryland (Pasternoster et al., 2004), Colorado (Hindson et al., 2006), and North Carolina (Radelet & Pierce, 2011b; Unah, 2011) as well as in the United States military (Baldus et al., 2012).” They also acknowledge how their propensity score methods have limitations like model misspecification as PSM only accounts for observed and observable covariates, potential for hidden bias, etc Pearl (2000). They also say “furthermore, as reported previously, the North Carolina data suggest that more than 20 legal and extralegal case characteristics significantly vary when comparing cases involving White victims to cases involving Non-White victims. Thus, the findings from this particular study may not necessarily apply to jurisdictions where a decision to send a death penalty eligible case to trial and to a jury for recommendation and/or where a jury’s recommendation for the death penalty are statistically rarer events. Similarly, these results may not extend to jurisdictions where there is significantly and/or substantively less variability in the legal and extralegal case characteristics observed in death penalty cases that go to trial and when a jury is commissioned to provide a recommendation as to punishment.” It also says “another interesting point for discussion centers on the negative effect of time (year that sentence was imposed) on death penalty decision-making that was observed in the traditional logistic regression models. This is suggestive of a trend that has received a growing amount of attention in the popular press and media where the application of the death penalty, in general, is becoming less com- mon in more recent times and public perceptions have varied to a degree in their support of the death penalty in the U.S. and abroad (Applegate, Cullen, & Fisher, 2000; Behnken, Caudill, Berg, Trulson, & DeLisi, 2011; Bohm & Vogel, 2004; Bohm, Vogel, & Maisto, 1993; Jiang, Lambert, & Nathan, 2009; Liang, 2005; Lu & Zhang, 2005; Mancini & Mears, 2010; Wozniak & Lewis, 2010). Thus, it is an empirical question as to whether or not the application of this sentence (perhaps now) be- coming reserved for only the most aggravated and least mitigated cases will have an influence on racial disparity going forward.”

In conclusion, the death penalty is not biased against blacks either through race-of-offender bias or race-of-victim bias.

Implicit Bias

DOES APPEARANCE MATTER?: THE EFFECT OF SKIN TONES ON TRUSTWORTHY AND INNOCENT APPEARANCES
Photos of capital inmates shown to entry-level criminal justice students for them to evaluate the trustworthiness of the faces.
Students rated pictures of light-skinned inmates as more trustworthy when they preceded pictures of dark-skinned inmates.
Most study participants (79.9%) were white, but the study predicted that this wasn’t a major factor – “When controlling for race, no statistically significant result was found. This suggests that each race, White and non-White, were consistent in their rating outcomes. Prior research has found similar results, where Whites and light-skinned Blacks are likely to share similar attitudes towards darker-skinned Blacks
The sample for this study was not representative as the sample came from undergraduates from a single university. Per the study, “Undergraduate students at the University of Alabama in the Criminal Justice Department were used to analyze the photographs of capital case defendants.” The only time dark-skinned defendants were rated as less trustworthy was when they came after a light-skinned defendant, and that’s the only time. It’s unknown why this was the case, but Cohen’s d was large (d=0.8).

I’m not sure how this is an argument. This would mean that dark-skinned defendants who come after light-skinned defendants get rated as being less trustworthy than those preceding light-skinned defendants 71.4% of the time.

Regardless, the limited sample restricted to a single university does not allow us to make generalizations about anything.
Black Boys Viewed as Older, Less Innocent Than Whites, Research Finds
Students and police officers participated in tests to determine levels of racial bias and perception of innocence.
Black boys as young as 10 are more likely to be considered criminal or untrustworthy, and more likely to face police violence.
Police officers were tested on dehumanization of blacks by comparing people of different races to animal groups. Police who engaged in higher levels of dehumanization were more likely to use violence against black children.
On the topic of perceiving blacks as older, this can be attributed to racial differences in physical maturation (blacks mature faster than whites). According to Winegerd et al. (1973), “The white-black differences were great enough to provide the basis for an effective discriminant function. The total variation in maturity within the hand (the “disharmony” or “imbalance”) differs in blacks from such variation in the other races.”

They use the G&P method to see their results however the G&P method isn’t a measurement to accurately capture maturity between black and white people. There is this other method called the Tanner Whitehouse method but people use the G&P method because it’s less time consuming. Previous comparative studies in normal populations have shown that bone is estimated as younger with the G&P method than with the TW method. This finding has mainly been attributed to racial and socioeconomic differences between the reference populations used for the two methods. The G&P method was based on study of american children of high socioeconomic status in the 1940s whereas the TW method were based on British children of low socioeconomic status in the 1950s. Previous comparative studies have compared the G&P method with either the TW1 or the TW2 methods. In the 2001 third edition of the TW method (now termed TW3) there are considerable changes in the reference population, which now includes population data from North America and Europe. Thus, bone ages estimated with the TW3 method are 1 year younger than those estimated with the TW2 method for children aged from 10 years upwards, however they show smaller differences at younger ages (which goes against Winegard). Regardless, do we still need such standards? To determine whether the G&P standard can still be applied to American children of diverse ethnicity, Ontell et al. (1996) found that bone maturity in Asian and Caucasian American girls approximated chronologic age throughout childhood. The only significant discrepancy that they found was in Caucasian adolescent girls, whose bone maturity exceeded their chronologic age by about 4 months. This 4-month difference between the chronological age and the bone age is less than the normal distribution of bone age. Almost 20 years later, Cole et al. (2014) examined ethnic differences in the pattern of skeletal maturation in South African adolescents using a novel longitudinal analysis technique, superimposition by Translation and Rotation (SITAR). No ethnic differences were found in the pattern or timing of skeletal maturity in the girls, while skeletal maturity in white boys was reached 7 months earlier than black boys. They concluded that the delayed maturity of black boys, but not black girls, implied that black boys are more sensitive to environmental factors than black girls. Because these sex and race differences imply the existence of environmental constraints, we should be using unified standards of bone maturation in the clinical analysis of a given patient. When it comes to black people in particular though, they are a special case study with respect to body growth and composition. On average, puberty and skeletal maturation of African American children occurs earlier and their BMI is higher than those of Caucasian children. However, the variation within each of these groups is greater than the racial difference. Russel et al. (2001) found investigators confirmed that skeletal age in African Americans was more advanced than that of Caucasian Americans and that the advancement in skeletal maturation was due to the greater BMI of African Americans. After correction for lean body mass and either BMI, BMI SDS, or dual X-ray absorptiometry (DXA) fat mass, the difference between bone age and chronological age (BA-CA) and the ratio of bone age to chronological age (BA/CA) of African American and Caucasian children were no longer significantly different. The skeletal microarchitecture of African Americans and their children is denser than that of Caucasian Americans and their children (Putman et al. 2006, Hui et al. 2010), and this increase in density is also correlated with their greater BMI. A unified standard for skeletal maturity allows for that conclusion, which would have been missed if a race specific standard of bone maturity were used. Based on such arguments, the Center for Disease Control and Prevention (CDC) promotes one set of growth charts for all US racial and ethnic groups. Racial and ethnic specific standards of bone maturity are not recommended because the results of studies support the premise that differences in growth among various racial and ethnic groups are due to environmental factors and genetic differences between children Ogden et al (2002). Many countries have developed their own growth charts to describe the national, racial, and ethnic distribution of its child population. For example, the racial and ethnic distribution of the reference population in the US CDC growth charts for the USA is representative of the US population at the time when the National Household Education Survey (NHES) and National Health and Nutrition Examination Survey (NHANES) were conducted. However, the US CDC’s growth charts and the growth charts of most other countries rely on data that were collected from children who live in urban zones, and the validity of these data for children who live in rural zones has not been addressed. The growth and maturation of children who live in rural zones is different than those of children who live in urban zones, and this difference is very exaggerated in children who live in developing countries Mpora et al. (2014). That does not mean that urban or rural, privileged or underprivileged, this or another ethnic group require their own reference. Many of them will change their living places and conditions. Rather, I claim that children grow very differently due to their living conditions; the same growth standards are recommended for them precisely because the difference is down to plasticity in relation to the environment, not genetics. In developing its international growth charts, the WHO working group has used a similar rationale de Onis et al (2006). It recommended an approach that described how children should grow when they are healthy and well provided (the standard) rather than describing how children grow in their current milieu (the reference) Garza et al (2004). Their main finding was that “child populations grow similarly across the world’s major regions when their needs for health and care are met.” Greulich and Pyle designed their Atlas project in a similar way by selecting white children from the upper socioeconomic classes. So basically, environments change continuously, and we adapt our phenotype to the prevailing environment, even when the environmental changes are disruptive or even catastrophic. Adaptive plasticity has enabled individuals and societies around the globe to respond to environmental changes to survive and reproduce and may manifest itself as a continuous variation in traits Hochberg et al (2010). Adaptive responses override the “canalization of development” Waddington et al (1942) and the inheritance of acquired characteristics. The notion that genes are the primary determinants of physiognomy, which also includes growth-related traits, has been repeatedly disproved. Based on a fundamental understanding of phenotypic plasticity and an individual’s ability to respond to environmental cues, we do not need ethnic-specific standards for bone maturity. Clinicians are aware of the multitude controllers of bone maturation Hochberg et al. (2002). Maturation is delayed in children with constitutional delay of growth, malnutrition, chronic illness, high altitudes, and hormonal deficiencies. Often, several of these occur in the same child, and the clinician contemplates the combination. It is therefore still important to gather data on how and why the maturation rate varies and how this varies between populations. The potential implication for the diagnosis of using a single reference dataset, in which the whole population differs significantly from that reference, is part of that clinical contemplation. What we need is an international standard for assessing bone maturity. The current gold standards for assessing bone maturity—the G&P Atlas and the TW3 tables—are globally used to assess the bone maturity of children of different nationalities, races, and ethnicities. The appropriateness of these two methods explicitly needs testing as a priority, and new standards need to be developed if these data are found to be inadequate. Winegard itself even says “The race differences in particular warrant further investigation.” Again, multiple studies have concluded that the G&P method is not applicable to black people and physical maturity. Marjan Mansourvar et al. (2014) concluded the GP atlas is not applicable for other ethnic groups for different ranges of age, especially in the sample of the male African/American group from 8 years to 15 years and Asian during childhood. Dembetembe et al. (2011) results of this study have shown that the current skeletal age estimation standards, formulated by Greulich and Pyle are not directly applicable to male South Africans of African biological origin. Alshamrani et al. (2019) concluded The G&P standard is imprecise and should be used with caution when applied to Asian male and African female populations, particularly when aiming to determine chronological age for forensic/legal purposes.
Having a single reader read all roentgenograms using a specific atlas can introduce observer bias and limit the reliability and validity of the study's findings. The use of a single reader may not account for individual differences in interpretation and classification of bone maturity, which can introduce systematic error and reduce the generalizability of the findings.
Interpolating age to the nearest three-month interval can lead to imprecision because it involves rounding the exact age of each bone center to the nearest three-month interval. For example, if a bone center is measured to be 5.7 years old, the interpolation would round the age to 5.5 years old. Similarly, if another bone center is measured to be 5.9 years old, the interpolation would round the age to 6.0 years old. This rounding leads to imprecision because it introduces measurement error and variability into the bone age estimates. In the above example, the interpolated age for the first bone center is off by 0.2 years, while the interpolated age for the second bone center is off by 0.1 years. These small differences accumulates over multiple bone centers, leading to imprecision in the overall bone age estimate for each child. Moreover, interpolating age to the nearest three-month interval leads to misclassification of children into different bone age groups, particularly if there are many children whose bone age falls near the cutoff point between two age groups. This leads to misinterpretation of the results and inaccurate estimation of the prevalence of bone age delay or advancement in the study sample.
Varying the order in which individual centers were read from film to film introduces a potential source of systematic error known as "order effect" or "sequence bias." This occurs if the order in which the bone centers are read systematically affects the reader's interpretation of the images. For example, if the reader consistently reads the bone centers in a particular order, they may be more likely to interpret the later bone centers as being more mature or advanced than the earlier bone centers. Alternatively, if the reader is fatigued or rushed towards the end of the reading session, they are more likely to make errors or miss subtle differences in the later bone centers. Varying the order in which bone centers are read also introduces variability and reduce the reliability of the bone age estimates. This is because different readers may have different preferences for the order in which they read the bone centers, which can lead to differences in bone age estimates even if they are reading the same films. Overall, varying the order in which individual centers were read from film to film can introduce systematic error and reduce the reliability and validity of the bone age estimates.
Limiting the study to only 22 specific centers that can be expected to be present at the ages considered can potentially disadvantage the study by introducing a selection bias. This is because the bone age estimates obtained from these specific centers may not be representative of the overall skeletal maturity of the hand and wrist, which is a complex and multifaceted process involving many bones and structures. Furthermore, limiting the analysis to only a subset of the bones in the hand and wrist can lead to overestimates of skeletal maturity in some racial groups, such as Blacks, who have been shown to have relatively advanced skeletal maturation compared to Whites and Asians. This is because the 22 centers that were studied may be more mature in Blacks than other bones in the hand and wrist that were not included in the analysis. Additionally, the exclusion of certain bones from the analysis may also lead to underestimates of skeletal maturity in some children, particularly those with developmental delays or abnormalities affecting the excluded bones. This can lead to inaccurate diagnosis and management of growth and developmental disorders in these children.
The variability of the lunate and triquetral bones can potentially be accounted for through the use of more advanced imaging techniques, such as magnetic resonance imaging (MRI) or computed tomography (CT) scans, which can provide a more comprehensive assessment of skeletal maturity.
The use of the contrast as the unit of measurement may disadvantage the study by introducing a bias that can result in overestimation of skeletal maturity in Blacks and underestimation in Whites and Asians. This is because the contrast is defined as the difference between the age of each bone centre and the mean of all 22 centres considered. The use of this approach assumes that each bone centre matures at the same rate, and any differences observed are due to variability in the timing of maturation. However, it is well established that there is significant variability in the timing of bone maturation between different bones, with some bones maturing earlier or later than others. By using the contrast as the unit of measurement, this variability in the timing of maturation between different bones is not accounted for, which can lead to inaccurate estimates of skeletal maturity.
The fact that 21 of the 22 contrasts are independent and the statement that for a hand that is developing homogeneously, the contrasts are small; for one in which some bones are advanced or lag behind others, the contrasts are large, may limit the amount of information available for analysis and the accuracy of the results.

Moreover, according to Rushton (2000), blacks reach sexual maturity sooner than whites, who in turn mature sooner than Asians. This is true for things like age at first menstruation, first sexual experience, and first pregnancy. One study of over 17,000 American girls in the 1997 issue of Pediatrics found that puberty begins a year earlier for Black girls than for White girls. By age eight, 48% of the Black girls (but only 15% of the White girls) had some breast development, pubic hair, or both. For Whites this did not happen until ten years. The age when girls began to menstruate was between 11 and 12 for Black girls. White girls began a year later. Sexual maturity in boys also differs by race. By age 11, 60% of Black boys have reached the stage of puberty marked by fast penis growth. Two percent have already had sex. White boys tend not to reach this stage for another 1.5 years. Orientals lag one to two years behind Whites in both sexual development and the start of sexual Interest. Young blacks are also more likely to be criminal than whites.

Rushton's work is very repetitive. There is no original information actually published which is why he is never considered in any field of study. So some things I will highlight before moving on to the maturity thing you talked about his INTERPOL data is meaningless if he wants to claim "genetic component" to crime because he has been dead WRONG in the past see: https://www.researchgate.net/publication/332449021_Interpol_Crime_Statistics_and_Rushton's_Racial_Dogma
His claims on personality were put into question here and his studies were irrelevant in that section (who the hell cares what teachers' opinions of immigrant students are?):
https://www.sciencedirect.com/science/article/abs/pii/S0092656613000640
His work on testosterone (and penis size) was wrong in every which way (in the SL doc). His sexual activity section is laughable and I should not have to tell you why that is both irrelevant and stupid ("A 1951 survey asked people how often they had sex. Pacific Islanders and Native Americans said from 1 to 4 times per week, U.S. Whites answered 2 to 4 times per week, while Africans said they had sex 3 to 10 times per week.") 💀 come on....really? So while he obviously goes over more info it is more repetitive (Brain size, IQ)
Please read BOTH of these for a good laugh
https://notpoliticallycorrect.me/2019/09/03/jp-rushton-serious-scholar/
https://notpoliticallycorrect.me/2019/05/05/what-rushton-got-wrong/
Herman-Giddens et al. (1997) is the study Rushton is referencing for black girls, that study had limitations. One of the limitations of the study was that participants were not chosen at random but were selected from girls who came to pediatricians offices for a regular checkup or a problem that required a complete examination. In addition, the study failed to determine which girls were experiencing early puberty because of an underlying medical problem, such as a brain tumor. The author acknowledged her findings may have been skewed if a significant number of the girls were brought to their doctors because of concerns they were developing too early sexually. The author also acknowledged other methodological errors. “The subjects were being seen for visits requiring complete physical examinations in largely suburban practices in a prac- tice-based research network. Neither the practices nor the girls in the study were selected randomly to represent a statistical sample. If these girls differ systematically from girls in the general population, the results could be questioned.” “Alternately, it is conceivable that a selection bias was operating such that younger girls with evidence of development were more likely to be brought in for physical examinations because their parents were concerned, and that this could account for the earlier onset of pubertal changes in our sample. If so, this bias might also be expected to be operating among parents of 12-year-old girls with no development, leading to a decrease in the prevalence of secondary sexual characteristics in that age group.” “Another consideration concerns the etiology of pubertal changes occurring in girls in the study. No data were collected on endocrine evaluations that the early developers in the study may have received; therefore, we do not know if some of these girls had pathologic conditions affecting their development.” Much of the higher age averages reflect nutritional limitations more than genetic differences and can change within a few generations with a substantial change in diet. The median age of menarche for a population may be an index of the proportion of undernourished girls in the population, and the width of the spread may reflect unevenness of wealth and food distribution in a population. Researchers have identified an earlier age of the onset of puberty. However, they have based their conclusions on a comparison of data from 1999 with data from 1969.

For instance, one report from the US Department of Education found that Black preschoolers are 3.6 times more likely to be suspended than white preschoolers, and black students are 2.3 times more likely than white students to be referred to law enforcement or arrested as a result of a school incident. Another report by the Civil Rights Data Collection remarked that black girls account for 20% of all female preschoolers and 54% of female preschoolers who are suspended more than once. Black preschool children are 3.6x more likely to be suspended than whites. This can not be pinned onto racial bias. Wright et al. (2014) remarked that the black-white suspension gap was completely accounted for by controlling for past behavioral problems, suggesting that the gap is not due to racial bias.

Wright et al was completely driven by selection bias and an improper measure of prior problem behavior. Primarily Wright et al suffers from severe sample attrition. Loss of numerous students in the study when they create their PPB metric. Also their PPB metric was too imprecise. These limitations along with others are literally in the study. “Readers may question whether our measure of prior problem behavior, which was assessed through teacher reports, is endogenous with the outcome variable. From a labeling perspective, a theoretical possibility arises that teachers may label students as ‘problem children’ in the early years and that the label may then ‘stick’ to youth as they progress through elementary and middle school grades” or “our measures of school suspensions and delinquency were relatively limited. It is indeed odd that a large educational dataset such as the ECLS-K would contain so few questions relevant to school conduct, problem behavior, and school discipline” or “sample attrition was substantial.” Studies like Huang et al. (2017) have criticized Wright et al and others saying “although studies have investigated similar research questions using large datasets which contain data on suspensions, attitudes, and behaviors, the datasets are often at least over a decade old and may not reflect current conditions.” Expanding on PPB, studies like Huang et al. (2020) have criticized Wright because the measure of PPB, measured in fall kindergarten, spring first, and spring third grades, may not actually be a measure of prior problems. The measure was based on the Social Skills Rating Scale (Gresham & Elliot, 1990) that measured approaches to learning (ATL), self-control, interpersonal skills, and externalizing problem behaviors. Huang indicated that only the last subscale was an actual manifestation of problem behaviors and that the other subscales are related (e.g., self regulation) but different from behavior problems. For example, ATL used in K to 1st grade consisted totally of items related to eagerness to learn, interest in things, and task persistence, which are not indicators of problem behaviors. A measure excluding ATL may be more suitable. If the results are driven by ATL compared to behavior problems, then results may be weaker if ATL is removed. However, having a “purer” construct of problem behaviors could result in stronger findings as well. In addition to the possible issues mentioned by Huang (2018), other concerns deserve attention as well. One basic issue is that the outcome (i.e., suspension that included both in- and out-of-school suspensions) was an eighth-grade parent-reported measure if a student had ever been suspended—meaning that that the student could have been suspended even once at any grade until the eighth grade. Although the use of suspensions may rise as a child progresses through school (i.e., used more frequently in middle vs. elementary school), the outcome is quite imprecise. Even as early as preschool, Black children were 3.6 times more likely to receive one or more suspensions compared to White students (U.S. Department of Education, 2016). The Wright et al. (2014) analyses implicitly assumed by the use of certain predictors that the suspensions occurred in the eighth grade (e.g., using eighth-grade predictor variables when the actual suspensions could have occurred even prior to the eighth grade). However, this is a basic limitation of the data set (i.e., ECLS-K) used, and others have used the suspension variable in a similar manner (e.g., Morgan et al., 2019). Another potential issue is that the main predictor variable of interest, PPB, is a teacher-reported variable, and teachers may be biased reporters of student behavior (Gilliam et al., 2016). Evidence from a majority of studies have shown that problem behaviors or attitudes have a large association with the receipt of an OSS but most studies have indicated that the differences in behavior are not enough to account for the large disparities in the issuances of suspensions for Black and White students (Gregory et al., 2010; Wu et al., 1982). For Black students, the risk of receiving a suspension is much higher, even as early as prekindergarten, not necessarily because of problem behaviors but because teachers may expect Black boys to misbehave more and thus watch them more closely (Gilliam et al., 2016). Gilliam et al.’s (2016) experimental findings are particularly important and challenge the notion of teachers as completely unbiased reporters. In addition to teachers being potentially biased reporters, another experimental study showed that even when Black and White students commit the same infraction, teachers often issued harsher sanctions (i.e., differential treatment) for Black students (Okonofua & Eberhardt, 2015). Owens and McLanahan (2019) indicated that almost half of the racial suspension gap can be attributed to the differential treatment of Black and White children who enter school with the same behaviors. If there is bias against Black students in the assessment of behaviors and the administration of suspensions, controlling for PPB could fully explain the racial disparities in disciplinary sanctions. Given that the ECLS-K is publicly available and that Wright et al. (2014) indicated that “our results await replication” (p. 263), studies reanalyzed Wright et al.’s original findings and tested alternative models specifications where (a) models used the same samples instead of shifting samples and (b) a measure of PPB was constructed excluding ATL. Additional models were tested that (a) used multiple imputation to account for missing data; (b) used the more proximal measure of fifth-grade problem behaviors, which is still a measure of PPB but should be stronger; (c) used parent-reported PPB because the use of parent-reported PPB may address some of the issues with using a teacher-reported PPB; and (d) used externalizing behavior only as a measure of PPB. Basically, studies concluded that although Wright et al.’s (2014) analyses of the ECLS-K suggested that the disparities in suspension rates could be attributed to the differential behavior between Black and White students, studies’ reanalysis, using the public-use version of the ECLS-K, shows otherwise. Once the same sample was used in comparing models—one model without PPB and one model with PPB— findings showed that race was already not a meaningful predictor of suspensions prior to the inclusion of PPB in the model. Further analyses indicated that Wright et al.’s findings were driven primarily by sample selection bias. Additional investigation—using multiple imputation, a modified PPB measure using fifth-grade teacher reports, externalizing behavior as PPB, as well as a PPB based on parent reports—showed that disparities in suspension rates based on race could not be fully explained by PPB. The shifting regression coefficients for the baseline models suggest differences with the sample selected resulting from survivorship bias. Because the baseline models used exactly the same predictors, point estimates should not change substantially (using a linear probability model) based on subsample investigated if there were no differential attrition. However, using the same model and merely excluding students from the sample, the Black coefficient was reduced from 10 percentage points to 5 percentage points, a difference of 5 points. In contrast, including PPB and using a consistent sample between models reduced the Black coefficient from 5 to 4 percentage points: a difference of 1 point. This highlights the importance of using the same analytic sample when comparing model results. Although the disparities in suspension rates are large based on race alone (i.e., 12% of White students were suspended vs. 33% of Black students, a difference of 21 percentage points), once additional covariates were included in the model, the difference in rates was effectively halved. When all covariates were included (without PPB), Black students were 10 percentage points more likely (down from 21 percentage points) to be suspended compared to White students. The reduction in the disparities has been shown as well in other studies that have included relevant variables (e.g., gender, SES) known to be related with suspension (Huang, 2018; Huang & Cornell, 2018). Implicit bias (Ispa-Landa, 2018) and differential treatment (Okonofua & Eberhardt, 2015; Owens & McLanahan, 2019) may contribute to the racial disparities in school discipline. Although problem behaviors are important to account for, PPB does not fully explain race-based disparities in suspensions.

Skiba et al. (2002) found the racial gap in suspension rates persisted even after SES was controlled for, and found that whites and blacks had the same chance of being suspended once they were sent to the office. This too suggests that these disparities are not due to bias. Skiba et al. note that “African-American students are referred to the office for infractions that are more subjective in interpretation”, but black behavior in the classroom is not the same as white students.

Skiba 2002 does not find “no evidence of bias.” I can’t believe he came to that conclusion. That’s so memeable. That study found African Americans were more likely to be suspended for more subjective infractions. Skiba 2014 I believe finds that delinquency can’t explain the gap but better studies such as Huang 2018 and Owens and Mclanahan 2019 find both to have even less of a gap than previously thought.

Kochman (1983) vividly describes race differences in attitude toward various rule governed social interactions. In formal negotiations, he finds, whites are more interested in following “the rules of negotiating” and “the negotiating procedure,” whereas blacks are more driven by their emotions and see conformity to these rules as defeat (37–42). In turn-taking situations such as the classroom, “the white classroom rule is to raise your hand, be recognized by the instructor, and take a turn in the order in which you are recognized.… The black rule, on the other hand, is to come in when you can.… Within the black conception, the decision to enter the debate and assert oneself is self-determined, regulated entirely by individuals’ own assessment of what they have to say ” (24–28). Marcus (2007) found that “Blacks showed from 13% to 78% greater involvement than Whites for all forms of aggressive and violent behavior, whereas for feeling unsafe at, or to or from, school showed 123% more Blacks felt unsafe than Whites. Racial-ethnic differences of this magnitude have been reported in other national surveys that were roughly similar.” Johnston et al. (2008) noted that based on various questionnaires, blacks self-reported being about 10 percent more violent than whites. Hartup (1974) had a group of observers rate children on their aggression levels. Particularly in instrumental aggression, older black children were more aggressive than older white children. There was likely a difference in hostile aggression as well, but it was not detailed. There was a Race x Age interaction – the differences between whites and blacks were small at a younger age and grew as the groups were older. Mayberry and Espelage (2005) found that the aggression differences between blacks and whites are larger in reactive aggression (the hostile component). If we average the means and SDs for black females and males and white females and males, and we use the white avg. SD (which is very similar to the black avg. SD), then we find black people are 0.5833 white SDs higher in reactive aggression than whites are. This is certainly large enough to be consequential.

This study is only from one middle school with a small sample size of 443 (147 were black) students so it’s not representative. Plus, the scales they use to measure aggression (which they say was developed by Little et al. (2003)) has several problems. 1. The reliance on self-report of the functional purpose of aggressive acts is not without potential bias. The degree to which other factors such as self-presentation bias, acquiescent-response bias, and the like are involved does weaken the virility of classifications. 2. There is the lack of absolute cutoffs for classifying youth. 3. It only imitates potential antecedents and consequents of the different subgroups of aggressive youth. This strictly cross-sectional approach does beg a number of questions that only longitudinal work could address. For example, are characteristics such as shyness, frustration intolerance, and hostility antecedents to aggressive behavior or are they consequences of it? Are the subgroups of aggressive youth stable over time or are they dependent upon age and context? 4. Questions for work that would provide a critical piece for this puzzle include a detailed examination of who the recipients of the different aggressive acts are. For example, that instrumentally aggressive youth would not choose “easy targets” like the neither group might, but instead would choose targets that hold a desired resource or who are at about the same level of social dominance. A well-placed and successful aggressive act toward a near challenger would have a double effect of thwarting the challenger and sending a message to all other challengers at or below the thwarted challenger’s rank: “Don’t even think about it.” Similarly, the source of the elevated reports by parents would be better understood knowing who the recipients were. Other studies contend that the reactive aggression subscale of the Little et al. measure was not related to social problems Fite et al. (2009). Mayberry and Espalage even said themselves that the results shouldn’t be used to make conclusions: “Only a few studies have examined race differences in aggression, but none have used the Little et al. (2003) scale that more comprehensively examines proactive and reactive aggression. Thus, until the psychometric properties of this measure are closely examined these results should be interpreted with caution. In addition, this study did not assess larger school climate factors that might better explain the race differences in aggression.”
The claim that these differences are large doesn’t seem true either. Mean and standard deviation calculations can overestimate due to skew and outliers.

These results can also not be pinned onto racial bias by teachers via their student assessments (Chang and Sue 2003).

This was also based on a small sample size in California and involved randomization of vignettes about students with whom educators have little context. Biases could have not been shown because of California’s diversity which reduces bias. It also has several problems including the use of only three vignettes and three undergraduates to represent the constructs of race and problem type, restriction of this study to a single gender, and the large number of related variables assessed. This means that subjects could have been responding to specific details in the vignettes or in the photographs that may not be generalizable to other overcontrolled or undercontrolled behavior presentations or to other racial phenotypes. Other studies have even found that there’s an influence of cultural miscommunication. Tyler, Boykin, and Walton (2006) found that when teachers rated students presenting mainstream cultural values as having significantly higher classroom motivation and academic achievement than students exhibiting African American values. This suggests that although teachers may not exhibit outright racial bias, they may show a preference for traditional European American-valued behaviors over conduct that may be more typical or valued by African Americans.

As crazy as it may sound, officers are correct in assuming that blacks, even at a young age, will be more criminal than white — especially given racial differences in behavior. When it comes to IAT tests in general, there are a lot of problems with them. One doc, like Vaush’s, argues that IAT tests are valid because they have predictive validity — also known as the ability to predict real world behavior. To support this, Greenwald et al. (2009) are cited as support. Unfortunately for them, a re-look at Greenwald et al. found them to be only weak predictors (Oswald et al. 2013).

Greenwald, Banaji, and Nosek (2015) responded to Oswald saying that 1. There were differences in the two meta-analyses’ published conclusions due to differences in the methods they used. The main method difference between the two studies was in their respective policies for including effect sizes. GPUB limited their meta-analysis to effect sizes for which there was reason to expect nontrivial predictive validity correlations of IAT measures with criterion measures. OMBJT included numerous additional effect sizes that lacked a basis either in existing theory or in author-provided rationale for expecting positive correlations. GPUB explicitly described their article as a “meta-analysis of predictive validity,” whereas OMBJT did not describe a goal of assessing predictive validity—they instead described their study as a “meta-analysis of IAT criterion studies.” This important strategy difference, with its concomitant difference in policies for including effect sizes, explains most of the difference between the average effect sizes that the two meta-analyses estimated. Second, both meta-analyses estimated aggregate correlational effect sizes that are large enough to justify concluding that IAT measures predict societally important discrimination. GPUB did not comment on societal significance in their article, whereas OMBJT concluded that IAT measures show “poor prediction of racial and ethnic discrimination” and provide “little insight into who will discriminate against whom”. OMBJT’s conclusion did not take into account that small effect sizes affecting many people or affecting individual people repeatedly can have great societal significance. Although Oswald et al. (2015) responded to Greenwald, Banaji, and Nosek saying that small effect sizes don’t play a role in society, Jost (2019) cited Oswald and said the reasons why past summaries have turned up fairly low correlations between implicit racial attitudes and behavioral outcomes include: 1. Measures of implicit attitudes and behaviors were low in methodological correspondence. 2. Researchers have seldom adjusted properly for measurement error (Greenwald et al., 2015; Greenwald et al., 2009; Kurdi et al., 2018). Kordi et al. (2018) even did a meta-analysis, which is based on a total sample size (N = 36,071) that is 6 to 10 times larger than those used in previous meta-analyses by Greenwald et al. (2009) and Oswald et al. (2015), and it found that standard IAT scores are indeed robust predictors of behavioral outcomes (with correlations as high, in some cases, as .37) and that they exhibit incremental validity (after adjusting for explicit attitudes in structural equation models), especially when one focuses on high-quality studies using standard (as opposed to modified) IATs with large sample sizes. Another subsequent meta-analysis from ae, they “turned things around and instead focused on the validity and reliability of the discrimination outcomes” IAT scores have been measured against. Carrlson and Agerstrom (2016) pay particularly close attention to the myriad of ways in which ‘discrimination’ can be operationally defined, and the potential effects that a wide range of criterion measures can have on a meta-analytic result. Their findings raise questions regarding the very foundations of the Greenwald et. al. (2009) and Oswald et. al. (2013) meta-analyses. In regards to measures of discrimination, they find that “the level of heterogeneity among the studies is striking and is not driven by any specific outlier study,” evidence that they conclude “casts a doubt on the appropriateness to draw strong conclusions regarding the average level of discrimination, or, the average level of predicted discrimination from the IAT.” In their meta-analysis of discrimination outcomes, because “the results are widely inconsistent between different studies,” Carrlson and Agerstrom (2016) conclude that “attempting to meta-analytically test the correlation between IAT and discrimination thus appears futile. We are, essentially, chasing noise, and simply cannot expect any strong, or even moderate, correlations, based on the current literature.” They find that “it is doubtful whether the amalgamation of these outcomes is relevant criteria for assessing the IAT’s predictive validity of discrimination.” Carrlson and Agerstrom (2016) write that “in a sense, evaluating the IAT’s ability to predict discrimination based on the current literature is akin to testing out raincoats on sunny days. Unsurprisingly, the raincoats will receive a bad score, since they are particularly unsuitable on sunny days. However, this does not invalidate their usefulness in rainy weather.” Ultimately, Carlsson and Agerstrom (2016) “caution from drawing any strong conclusions regarding the IAT’s predictive validity until there is sufficient accumulated evidence based on high quality discrimination outcomes.” Thankfully, Carrlson and Agerstrom’s call for higher quality discrimination outcomes has, at least somewhat, been answered. In the most recent meta-analysis relating IAT scores and behavior, Kurdi et. al. (2018) confront the criticisms levied by Carlsson and Agerstrom (2016) against the Greenwald et. al. (2009) and Oswald et. al. (2013) meta-analyses. Following the concern from Carlsson and Agerstrom (2016) that IAT literature is plagued by heterogeneity, Kurdi et. al. (2018) suggest that “any single point estimate of the implicit–criterion relationship would be misleading.” Conceptually, they argue, “instead of asking whether implicit measures of intergroup cognition are related to measures of intergroup behavior, it may be more appropriate to ask under what conditions the two are more or less highly correlated.” In answering the “under what conditions” question, Kurdi et. al. (2018) rely on their study’s numerous advantages over the meta-analyses from both Greenwald et. al. (2009) and Oswald et. al. (2013), including: its vast increase in size (N = 36,071, compared to 3,471 and 5,433), its greater diversity in “quality and quantity of variables represented across investigations, including the target groups and types of behaviors examined,” its inclusion of “a substantial number of studies conducted in real-world and online settings, using a considerably more diverse pool of participants and ecologically meaningful measures of behavior,” and its ability to draw upon recent “advances in statistical methodology” that now “allow for explicit modeling of dependencies among effect sizes extracted from the same study … as well as for the appropriate treatment of measurement error.” Taking advantage of these improvements, Kurdi et. al. (2018) were able to calculate “a prediction interval, which is a measure of the expected range of effect sizes in a given domain.” In doing so, they set their result apart from those reported by Greenwald et. al. (2009) and Oswald et. al. (2013), who both chose to only calculate single-point estimates of ICCs. In the domain of intergroup discrimination in particular, Kurdi et. al. (2018) “obtained a 90% prediction interval of rmin = -.14 to rmax = .32, indicting that ICCs in [this domain] should be expected to range from small negative to medium-sized positive relationships,” indicating that IAT scores may still be relatively useful predictors of behavior. One of the most important findings from Kurdi et. al. (2018), though, concerns the relationship between ICCs and a particular study’s methodological choices. After only including studies that “(a) had the relationship between implicit cognition and behavior as their primary focus, (b) used relative or difference score measures of behavior, (c) used an IAT or IRAP, (d) used attributes that were polar opposites of each other, and (e) used highly correspondent implicit and criterion measures,” Kurdi et. al. (2018) found an effect size of r = .37, suggesting that the highest validity studies also produced the highest ICCs. They further that “as criteria were systematically relaxed to include a wider range of studies, the estimate of the effect size decreased and so did the power of the average study to give rise to meaningful inferences about the underlying population effect.” After only including studies that scored low on the variables in question, Kurdi et. al. (2018) found an effect size of r = .02, suggesting that the weakest studies also produced the weakest ICCs. Taken together, the findings from Kurdi et. al. (2018) strongly suggest that IAT scores can indeed be used as predictive measures of behavior. The large size and diversity of their sample, as well as their use of prediction intervals rather than single-point estimates, likely gives their results more practical relevance than the results reported by Greenwald et. al. (2009) and Oswald et. al. (2013). But more importantly, their findings regarding the relationship between methodological choices and ICCs call into question the effect sizes of both of the previous studies: it is wholly possible that both Greenwald et. al. (2009) and Oswald et. al. (2013) underestimate ICCs by including studies in their sample that have relatively low validity and, if they had instead focused their analyses on only the highest quality studies, would have reported significantly higher ICCs. Although it seems that evidence from Kurdi et. al. (2018) somewhat concludes this discussion, and establishes the at least moderately strong link between IAT scores and behavior, the warning that they and Carrlson and Agerstrom (2016) issue regarding the heterogeneity of effect sizes should still be considered. As a result, the following sections will follow the “under what conditions” line of reasoning by referencing important IAT literature when applicable.

The test also has a low test-retest reliability (Nosek, Greenwald, and Banaji 2005),

The observation of implicit bias would affect that implicit bias. That’s how implicit bias works. Intrinsically, any interaction one has will affect their implicit bias. Implicit bias is based on the conglomeration of everything experienced, seen, and idealized. So when you have a measurement that specifically deals with the situation, the replicability of the situation will never be the same when measuring one’s implicit biases because they know what’s coming already, preempting and thought about it, implicit biases about that situation have changed, so large sample IATs are reliable whereas the replicability isn’t there. But the abstract even says “together, these analyses provide additional construct validation for the IAT”. It even references a previous study by the authors saying “the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998) has been used to study implicit social cognition in part because of its ease of implementation, large effect sizes, and relatively good reliability (Greenwald & Nosek, 2001).”

further casting strong doubt on its validity (see Blanton et al. 2009 for more).

This study compared behavior toward a single black confederate and contrasted it with behavior toward a single white confederate, effectively rendering the discrimination due to race nested within individual confederates. In other words, the (reversed) discrimination effect may have been spuriously produced because of individual differences between the two confederates that have nothing to do with their respective race. This problem with a limited number of randomly selected, or matched, stimuli has received attention both in economics (Heckman, 1998) and psychology (Judd, Westfall & Kenny, 2012). In short, this type of design makes it impossible to know whether (and how much of) the variance in differential treatment (the outcome) is due to discrimination or due to stimuli-specific effects. Hence, this type of outcome is both invalid and unreliable to use as an outcome variable when the goal is to predict discrimination. Carlsson & Agerström 2016 address and expand on Blanton’s works as previously noted.

So even though the IAT is a flawed test, police officers are still right in their assumptions about black criminality. Even if the IAT was a sound test, though, police officers would still be right.
Racial Bias in Judgments of Physical Size and Formidability
Results from three separate studies on perception and racial bias show people have a tendency to perceive black men as larger and more threatening than similarly sized white men.
Participants also believed the black men were more capable of causing harm in a hypothetical altercation and police would be more justified in using force to subdue them, even if the men were unarmed.
According to Johnson and Wilson (2019), stereotypes based on physical attributes are accurate.

This study is only from Michigan State University. The sample of this study wasn’t even random. It suffered from voluntary response bias since they were just participating in the study for course credit. Height was also statistically insignificant in the 2nd study. They even made height judgements from facial photographs and did not track targets' actual height. It also has limitations such as one they literally mentioned themselves, “One limitation of this study was that it was exploratory; stimuli were collected for unrelated purposes. Additionally, the majority of raters and targets were White women, limiting generalizability.” It also doesn’t come to the conclusion this person is spewing. “Stereotypes exaggerated the relationship between Black men and size or strength.” and, “for Black men, stereotypes caused people to overestimate the relationship between race and strength and size. The reason group stereotypes improved accuracy (other than for Black men) is because raters’ judgments tracked targets’ actual strength and size only moderately.” This is consistent with the Ultimate Research Document of perceiving Black men as larger, despite low accuracy. In fact, from table 4 of the study the author cited, we see that the targets’ physical features only explained at most 17% of strength judgements.
Additionally their confirmatory analysis was underpowered, “All analyses had over 90% power to detect these effects, except for our analyses of perceived strength for men.”

Even then, blacks are more threatening, are more likely to cause harm, and are more violent — as noted up above. Not much to be said here.

Policing and Racial Profiling

For this specific section, I want to focus on the police killings, something Rose is very fond of. As before, this follows the same format.
Menifield et al. 18
Bias in policing isn’t just a “few bad apples,” nor is it a problem among white police officers specifically; policing practices inherently operate in a discriminatory manner.
The disproportionate killing of African Americans by police officers “is likely driven by a combination of macro‐level public policies that target minority populations and meso‐level policies and practices of police forces.”
“Much research in organizational theory suggests that the problem of disproportionate killing may be fundamentally institutional.”
Also outlines past studies on policing that recognize the disproportionate impacts of institutional policies on minorities
Edwards et al. 19
Black, Indian, and Native people are significantly more likely to get killed by the police than white people
“For young men of color, police use of force is among the leading causes of death.”
The Guardian 15 (Cited)
POC are killed at a disproportionate rate, even more so when unarmed.
In contemporary politics, it’s common to hear the argument that black individuals are at risk of being shot and killed by the police. In this post, I’ll show that evidence used to support this assertion is based on a flawed benchmark and proper benchmarks show no racial bias against blacks in police shootings and killings. Buehler (2016) used national vital statistics and census data from the CDC and Prevention’s Wide-Ranging Online Data for Epidemiologic Research to estimate deaths from legal intervention. After adjusting per capita, black deaths from legal intervention were 2.8 times higher than it was for whites. Looking at 1,217 police shootings from 2010 to 2012, Gabrielson, Sagara, and Jones (2014) found that young black males were 21 times more likely to be killed by police than white males. Balko (2020), who is also critical of whites who bring up cases of white individuals being killed by the police to show that the police force is not racist to blacks specifically, remarks that “Cops may shoot and kill twice as many white people as black, but they’re about six times as many white people as black people in the United States. Proportionally, black people are much more likely to be shot and killed by cops.” DeGue, Fowler, and Calkins (2016) looked at all fatalities (n=812) from the National Violent Death Reporting System, 2009-2012. Although the victims were majority white, they were disproportionately black. Blacks were 2.8 times more likely to have had fatal force used against them than whites. Making use of data from The Washington Post, Beer (2020) and Lowery (2016) remarked that although more whites are killed by police, this is because there are more white people in the United States. Once you adjust for population differences, blacks are overrepresented among police killings, with blacks being 2.5x more likely to be killed by police. To show that blacks are overrepresented in police shootings, researchers control for population differences since there are more whites in America than blacks. We should not expect racial distributions to align for most stuff. People of different races will engage in different risks that could increase their chances of coming into contact with the police. For example, it’s well known that blacks commit more crimes than whites, according to a systematic review from Beaver, Ellis, and Wright (2009). Since blacks commit more crimes than whites, blacks would be more likely to come into contact with the police. As Johnson et al. (2019) remarked, However, using population as a benchmark makes the strong assumption that White and Black civilians have equal exposure to situations that result in FOIS. If there are racial differences in exposure to these situations, calculations of racial disparity based on population benchmarks will be misleading. In essence, benchmarking approaches test whether members from certain racial groups are shot more than we would expect relative to some benchmark. The issue is that conclusions regarding racial disparities depend more on the benchmark used (population or violent crime) than the data (the number of people fatally shot). Because of this issue, using population as a benchmark to measure supposed racial bias is not the right way to go about this. A better benchmark to use would be criminality since people who commit crimes are more likely to come in contact with the police.

This argument doesn’t work because using an encounter denominator has been proven to be invalid due to collider bias, see: https://fivethirtyeight.com/features/why-statistics-dont-capture-the-full-extent-of-the-systemic-bias-in-policing/
One important caveat is that when conditioning on race-specific crime rates based on historical police data, researchers risk inadvertently introducing past police bias into their analyses.

Looking at data from the Federal Bureau of Investigation’s (FBI’s) Summary Report System (SRS), the FBI’s National Incident-Based Reporting System (NIBRS), the Bureau of Justice Statistics National Crime Victimization Survey (NCVS), and the Centers for Disease Control’s (CDC) WONDER database, Cesario et al. (2018) found that once the crime was adjusted for, blacks were not more likely to be killed by police than whites were. Whether it be by fatal shootings, killed while unarmed and not aggressing (i.e. not being violent towards the officer), and killed while holding/ reaching for an object, there was no significant anti-black bias in police deaths. There was a significant anti-white bias.

Other than the troubling methods of this paper (no account for regional differences in police-caused deaths, cherry-picks types of crime, no account for differential contact, uses death figures known to be undercounted, conflates Latinx and Anglo, etc.) Cesario failed to explicitly describe the underlying causal model and to produce estimates of disparities based on such a causal model and that simple comparisons against violent crime rates can, under reasonable conditions, mask anti-Black disparity. Cesario doesn’t yield an unbiased estimate of either the ratios of the probabilities of police killing black vs white armed criminals, or the ratio of probabilities of police killing black vs white unarmed noncriminals. Cesario didn’t reduce the ratio of the probabilities of being killed by police for Black relative to White individuals over both causal paths. The ratio of people that aren’t individuals in each racial subpopulation acquiring weapons and engaging in violent criminal behavior over the black and white ones that are, respectively, are also no longer convex combinations of the killing probability parameters, making interpretation difficult. Hence, it can yield an unbiased estimate of the ratio of the probability of killing black armed criminals and the probability of killing white armed criminals only in unrealistic edge cases in which police never kill unarmed individuals of either race/ethnic group (that is, the probability of police killing a black unarmed noncriminal, the probability of police killing a white unarmed noncriminal = 0) and/or when the population is composed purely of criminals (i.e the probability of black individuals acquiring weapons and engage in violent criminal behavior, the probability of white individuals acquiring weapons and engage in violent criminal behavior = 1.) Ross et al. 2020 said, “The validity of the Cesario et al. (2019) benchmarking methodology depends on the strong assumption that police never kill innocent, unarmed people of either race/ethnic group. While it is true that deadly force is primarily used against armed criminals who pose a threat to police and innocent bystanders (e.g., Binder & Fridell, 1984; Binder & Scharf, 1980; Nix et al., 2017; Ross, 2015; Selby et al., 2016; White, 2006), it is also the case that unarmed individuals are killed by police at rates that reflect racial disparities. Ross (2015) and Charbonneau et al. (2017), for example, show that conditional on being shot by police, a White suspect is more likely to be armed than is a Black suspect. Even unarmed noncriminals face the risk of being killed by police, and so, the relative population sizes of noncriminals cannot simply be ignored when assessing racial disparities in killings by police.” When you actually multiply the ratio of non criminality to the white vs black ratio of people that aren’t individuals in each racial subpopulation acquiring weapons and engaging in violent behavior and then use violent crime as a benchmark, the natural log of the posterior distributions of the relative probability (for Black individuals relative to White individuals) of being killed by police are all greater than 0 (i.e anti black disparity.) Cesario et al., however, doesn’t do this and just apply the same benchmark of white individuals acquiring weapons and engaging in violent behavior vs black individuals who do the same to all of their data sets—even those consisting of police shootings of unarmed, nonaggressing civilians. This leads to incorrect estimates of the quantities they claim to identify. Using the National Crime Victimization Survey (NCVS) violent crime data from 2016 to define the armed violent criminal parameters in the Ross et al.‘s 14c equation, the bias introduced by the Cesario et al. (2019) methodology would result in multiplying the true anti-Black racial disparity by a scalar of approximately 0.38. This means that if crime rate differences in the theoretical model were as we find empirically, then even if we set the killing probability parameters in the causal model such that unarmed, noncriminal, Black individuals were 2.6 times more likely to be killed by police than unarmed, noncriminal, White individuals, the Cesario et al. methodology would suggest no racial disparities! Another problem with Cesario et al. is that the study’s conclusion rests entirely on the assumption that violent crime statistics are a reasonable estimate for the frequency of encounters with police that may result in the fatal use of force. What Cesario and Johnson are not telling is that there are much better statistics to estimate how frequently civilians encounter police. I don’t know why Cesario and Johnson did not use this information or share it with their readers. I only know that they are aware that this information exists because they cite an article that made use of this information in their PNAS article (Tregle, Nix, Alpert, 2019). Although Tregle et al. (2019) use exactly the same benchmarking approach as Cesario and Johnson, the results are not mentioned in the SPPS article. The Bureau of Justice Statistics has collected data from over 100,000 US citizens about encounters with police. The Police-Public Contact Survey has been conducted in 2002, 2005, 2008, 2011, and 2015. Tregle et al. (2019) used the freely available data to create three benchmarks for fatal police shootings. First, they estimated that there are 2.5 million police-initiated contacts a year with Black civilians and 16.6 million police initiated contacts a year with White civilians. This is a ratio of 1:6.5, which is slightly bigger than the ratio for Black and White citizens (39.9 million vs. 232.9 million), 1:5.8. Thus, there is no evidence that Black civilians have disproportionately more encounters with police than White civilians. Using either one of these benchmarks, still suggests that Black civilians are more likely to be shot than White civilians by a ratio of 3:1. One reason for the proportionally higher rate of police encounters for White civilians is that they drive more than Blacks, which leads to more traffic stops for Whites. Here the ratio is 2.0 million to 14.0 million or 1:7. The picture changes for street stops, with a ratio of 0.5 million to 2.6 million, 1:4.9. But even this ratio still implies that Black civilians are at a greater risk to be fatally shot during a street stop with an odds-ratio of 2.55:1. It is telling that Cesario and Johnson are aware of an article that came to opposite conclusions based on a different approach to estimate police encounters and do not mention this finding in their article. Apparently it was more convenient to ignore this inconsistent evidence to tell their readers that data consistently show no anti-Black bias. Cesario and Johnson are likely to argue that it is wrong to use police encounters as a benchmark and that violent crime statistics are more appropriate because police officers mostly use force in encounters with violent criminals. However, this is simply an assumption that is not supported by evidence. For example, it is questionable to use homicide statistics because homicide arrests account for a small portion of incidences of fatal use of force. A more reasonable benchmark are incidences of non-fatal use of force. The PPCS data make it possible to do so because respondents also report about the nature of the contact with police, including the use of force. It is not even necessary to download and analyze the data because Hyland et al. (2015) already reported on racial disparities in incidences that involved threats or non-fatal use of force (see Table 2, Table 1 in Hyland et al. (2015). The crucial statistic is that there are 159,100 encounters with Black civilians and 445,500 encounters with White civilians that involve threat or use of force; a ratio of 1: 2.8. Using non-fatal encounters as a benchmark for fatal encounters still results in a greater probability of a Black civilian to be killed than a White civilian, although the ratio is now down to a more reasonable ratio of 1.4:1. It is not clear why Cesario and Johnson did not make use of a survey that was designed to measure police encounters when they are trying to estimate racial disparities in police encounters. What is clear is that these data exist and that they lead to a dramatically different conclusion than the surprising results in Cesario and Johnson’s analyses that rely on violent crime statistics to estimate police encounters. Black civilians are not considerably more likely to have contact with police than White civilians. Thus, it is simply wrong to claim that different rates of contact with police explain racial disparities in fatal use of force. There is also no evidence that Black civilians are disproportionately more likely to be stopped by police by driving. Although the caveat here is that Whites might drive more and that there could be a racial bias in traffic stops after taking the amount of driving into account, the data do show that the racial disparity in fatal use of force cannot be attributed to more traffic stops of Black drivers. Even the ratio of street stops is not notably different from the population ratios. The picture changes when threats and use of force are added to the picture. Black civilians are 2.5 times more likely to have an encounter that involves threats and use of force than White civilians (3.5% vs. 1.4%, in Table 2; Table 1 from Hyland et al., 2015). Also, Cesario’s results aren’t even reliable. If you look at the confidence intervals for figure 2 & 3, you’ll see how huge they are and their overlap meaning his results aren’t conclusive and significant whatsoever. I feel like his paper is subject to type I errors too. If you look at figure 4, you see that as he has 144 different tests and as the observables are increasing, the more statistically significant results are increasing. So it’s possible that it could be just due to p hacking.
Cesario’s paper faced lots of scrutiny.

Johnson et al. (2019), which looked at fatal shootings but not deaths, looked at data from The Washington Post and The Guardian and then controlled for racial differences in criminality. Once crime, civilian, officer, and county characteristics were held constant, blacks were less likely to be fatally shot than whites were (OR=0.15).

This paper had to be retracted: https://www.pnas.org/doi/pdf/10.1073/pnas.2014148117
and corrected: https://www.pnas.org/doi/10.1073/pnas.2004734117
The study claims its approach “sidesteps the benchmark debate”---the problem of picking a baseline to use to evaluate shooting rates across racial groups. We show this is not true.The study implicitly and wrong assumes black/white civilians encounter police in equal numbers.2/N
Without this unjustifiable assumption, the results in this study are entirely inconclusive. The data does not rule out severe anti-black bias, severe anti-white bias, or no bias at all. Analyses of the role of officer race suffer from the same problem. 3/N
The study claims “racial disparities” in its analysis are “a necessary but not sufficient requirement for the existence of racial biases”—that if no anti-black disparity is found, no anti-black bias exists. 4/N
But to demonstrate racial bias, analysts must show that Pr(shot|civilian race, X) differs by race. The study analyzes Pr(civilian race|shot, X), but makes strong claims about Pr(shot|civilian race, X). 6/N
To see why Pr(civilian race|shot, X) is the wrong quantity, imagine police encounter 100 civilians---10 black and 90 white---in identical circumstances. Due to anti-black bias, they shoot five black civilians (50%), and nine white civilians (10%)... 7/N
Under this hypothetical, the study’s approach would show a much higher chance the victim is white conditional on being shot (9/14 = .64) than black (5/14 = .36), and erroneously conclude no anti-black bias. 8/N
The study invokes the same fallacy when analyzing officer characteristics. Table 2 shows the relationship between Pr(civilian black|shot, officer race, X) and Pr(civilian white|shot, officer race, X) is not significantly different between white and black officers.... 9/N
From this, the study concludes: “white officers are not more likely to shoot minority civilians than non-white officers.” Again, this inference only follows under the strong, unstated assumption that black and white officers encounter black civilians in equal numbers. 10/N
But consider another hypothetical. Suppose black officers encounter 90 black civilians and 10 white, while white officers encounter the reverse. Among these, black and white officers both shoot five black civilians and nine white… 11/N
Clearly, black and white officers in this hypothetical example exhibit very different biases. Examining fatal shootings alone, these biases are entirely concealed. 12/N
Some have told us our argument comes down to a matter of preference. A simple application of Bayes' Rule shows it is a matter of logic. We cannot infer anything about Pr(shot|civilian race, X) by estimating Pr(civilian race|shot, X) without unjustifiable assumptions. 13/N
To be clear, even if the goal is merely to describe the rates of shootings by white/black officers of white/black civilians (i.e., make no claims about racial bias as the cause for observed disparities), the approach in this study is uninformative. For more see: https://www.pnas.org/doi/10.1073/pnas.1919418117
Since the study only analyzed fatal encounters, it couldn't possibly recover shooting rates. *Every observation* in the data was a shooting. No variation in outcomes. Instead, it estimated the probability that a fatally shot person was black, white, etc. Using this approach the study concluded no racial bias simply because more fatally shot civilians were white (which we already knew from @washingtonpost). This is not a test of racial bias. For ex., there could be more fatally shot whites just bc they are the majority group. By analyzing fatal shootings only, and ignoring the vast majority of other encounters which did not escalate to that level, officers who pulled the trigger in 5 out of 5 encounters, or in 5 out of 1000 encounters, would appear identical. More formally, the study made claims about P(shot|race), but estimated P(race|shot). Those are *not* the same. That’s not a statement of preference. It’s a mathematical fact, Bayes’ rule, that we’ve known since the 1700s. Bayes rules shows w/out knowing # of encounters w/ white/nonwhite officers, can't know which group more likely to shoot. If we *could* compute shooting rates we'd then ask whether encounters being compared were otherwise similar on relevant traits (i.e. no omitted variables). In this case, no need to debate omitted variables. Original study estimated fundamentally incorrect quantity. And to maintain officer anonymity, study's posted data don't contain info on officer race, and there is no posted code. So the analyses cannot be verified/replicated.
Johnson and Cesario have stated their analysis is still informative because it controls for county crime rates by race and other shooting attributes. But as Bayes’ rule shows, the addition of these control variables, X, does not solve the fundamental conceptual problem.
In an interview, one author claimed the results show "no support for the idea that white officers are biased in shooting black citizens.” They check for collinearity of their predictors (responsible) and find that demographics correlate with race specific crime rates (no surprise). So they pick just race specific crime rates for their models, and don't use demographics since they are correlated. Then, they conclude the best predictor of the race of a victim is race-specific crime rates. They then *actually* recommend reducing crime as the best way to not get shot by cops. TOTALLY forgetting that their crime rate variable is interchangeable with racial demographics.
Also see: https://www.pnas.org/doi/10.1073/pnas.1917915117

Similarly, Johnson et al. (2020) responded to criticisms by others over the issue of population size, and found that controlling for crime and population still showed no anti-black bias in police shootings.

That defense concerns their central arguments about how to benchmark, which is an important methodological problem to solve. In their opinion, researchers should benchmark according to crime rates. Furthermore, they argue that "[r]esearch on real-world policing behavior indicates fatal shootings are strongly tied to situations where violent crime is committed." They then refer to another paper they published:
We have tackled this issue in the past. Rather than try to identify one single benchmark for exposure to police in violent crime situations, we came up with 14 different proxies for exposure, some of which were generated from police data and some independently (Cesario et al., 2019).
At this point, I believe it is important to expand the context and provide more information.
Johnson, Cesario and colleagues published two similar papers in 2019:
• ⁠Is There Evidence of Racial Disparity in Police Use of Deadly Force? Analyses of Officer-Involved Fatal Shootings in 2015–2016 published in June 2019 by SPPS
• ⁠Officer characteristics and racial disparities in fatal officer-involved shootings published in August 2019 by PNAS
The latter is the more well-known paper. It has received two formal critiques published by PNAS, followed by a formal reply and a correction which did not satisfy critics:
• ⁠Making inferences about racial disparities in police violence by political scientists Knox and Mummolo (the latter vulgarized their points on Twitter)
• ⁠Young unarmed nonsuicidal male victims of fatal use of force are 13 times more likely to be Black than White by psychologists and methodologists Schimmack and Carlsson
• ⁠Reply to Knox and Mummolo and Schimmack and Carlsson: Controlling for crime and population rates by Johnson and Cesario
• ⁠A study finding no evidence of racial bias in police shootings earns a correction that critics call an “opaque half measure”, a Retraction Watch news feature with comments by Knox and Mummolo.
The earlier paper has recently received a critique and reassessment by quantitative anthropologist Ross and colleagues, published by SPPS:
• ⁠Racial Disparities in Police Use of Deadly Force Against Unarmed Individuals Persist After Appropriately Benchmarking Shooting Data on Violent Crime Rates
As Knox and Mummolo point out in their formal letter (they make reference in this part to the fact that Johnson et al. control for homicide rates):
Johnson et al.’s (1) analysis cannot recover these shooting rates because all observations in the data involve shootings. Instead, it estimates “whether a person fatally shot was more likely to be Black (or Hispanic) than White” (ref. 1, p. 15880), which does not correspond to the stated assertions. In a preprint response to our concerns, Johnson and Cesario (2) acknowledge the gap between the claim and the quantity estimated. Yet despite this, Johnson et al.’s (1) original paper infers no “evidence of anti-Black or anti-Hispanic disparity…and, if anything, found anti-White disparities” (ref. 1, p. 15880) simply because more fatally shot civilians are White.
As far as Knox and Mummolo are concerned, Johnson and Cesario have failed to properly address this issue. Per their statement to Retraction Watch:
But when properly understood, the test that was conducted in the original article sheds no light on racial bias or the efficacy of diversity initiatives in policing, and a meaningful correction would acknowledge this. Because every observation in the study’s data involved the use of lethal force, the study cannot possibly reveal whether white and nonwhite officers are differentially likely to shoot minority civilians. And as we show formally in our published comment, what the study can show—the number of racial minorities killed by white and nonwhite officers—is simply not sufficient to support claims about differential officer behavior without knowing how many times officers encountered racial minorities to begin with.
Ross et al.'s critique tackles the previous paper by Cesario et al. on top of which rests Johnson et al.'s analysis. Their problematization provides further insight for your query:
Formal theoretical analysis of the benchmarking methodology advanced by Cesario et al. (2019), however, has yet to be done. Cesario et al. argue that “benchmarking” the race-specific counts of killings by police on relative crime counts, rather than relative population sizes, generates a measure of racial disparity in the use of lethal force by police that is not statistically biased by differential crime rates. In their words, “if different groups are more or less likely to occupy those situations in which police might use deadly force, then a more appropriate benchmark as a means of testing for bias in officer decision making is the number of citizens within each race who occupy those situations during which police are likely to use deadly force” (p. 587). In other words, they aim to produce estimates of killing rates by police unique to the interaction of suspect race/ethnicity and criminal status and test for evidence of racial disparity holding constant the relative sizes of the criminal populations. Their publication, however, lacks any formal derivation showing that their benchmarking methodology has statistical properties consistent with their conceptual objectives.
There are important issues with the assumptions made by researchers such as Johnson, Cesario and Fryer, Jr. - another researcher who published a paper failing to find racial biases in lethal use of force, which has been strongly critiqued on methodological grounds. See the Knox, Lowe and Mummolo's recent publication explaining how "Administrative Records Mask Racially Biased Policing." The problems they raise apply broadly. Ross et al.'s critique also sets to demonstrate how Cesario et al.'s methodology masks biased outcomes:
The validity of the Cesario et al. (2019) benchmarking methodology depends on the strong assumption that police never kill innocent, unarmed people of either race/ethnic group. While it is true that deadly force is primarily used against armed criminals who pose a threat to police and innocent bystanders (e.g., Binder & Fridell, 1984; Binder & Scharf,1980; Nix et al., 2017; Ross, 2015; Selby et al., 2016; White,2006), it is also the case that unarmed individuals are killed by police at rates that reflect racial disparities.
According to their assessment, "their benchmarking methodology does not remove the bias introduced by crime rate differences but rather creates potentially stronger statistical biases that mask true racial disparities, especially in the killing of unarmed noncriminals by police."

Shjarback and Nix (2019) made use of data pertaining to violence against police officers as a benchmark and looked at police shootings by race on the national level. At the national level, blacks were less likely to be shot by police than whites were (OR<1.0).

So, let’s ignore the fact it says “African-Americans were more likely than whites to be shot at by California police”. In fact, the reason why they supposedly find little evidence on the national level and in Texas is because the quality of databases they used weren’t as inclusive as the database they used for California. They conclude:
“A different picture, however, emerges when examining all officer firearm discharges in California. Both our numerators and denominators in Study 3 were much more inclusive of the true nature of transactional gun violence between the police and the public. With the broader universe of deadly force incidents taken into consideration, black citizens were more likely than white citizens to be shot at when comparing the known race/ethnicity of citizens who discharged fire-arms against officers in California.”
Let’s also ignore that the same author, Justin Nix, had another paper that said “minority groups were significantly more likely than Whites to have not been attacking the officer(s) or other civilians and that Black civilians were more than twice as likely as White civilians to have been unarmed” and focus on the paper’s methodology. They used the LEOKA to benchmark homicides and violence against police officers however using the LEOKA has several problems as the FBI points out: 1. “The data in the tables and charts reflect the number of victim officers, not the number of incidents or weapons used.” 2. “The UCR Program considers any parts of the body that can be used as weapons (such as hands, fists, or feet) to be personal weapons and designates them as such in its data.” 3. “Law enforcement agencies use a different methodology for collecting and reporting data about officers who were killed than the methodology used for those who were assaulted. As a result, information about officers killed and information about officers assaulted reside in two separate databases, and the data are not comparable.” 4. “Because the information in the tables of this publication is updated each year, the FBI cautions readers against making comparisons between the data in this publication and those in prior editions.” https://ucr.fbi.gov/leoka/2019/resource-pages/about-leoka
I’d love to see some sort of concordance rate between the “attacking” of officers and getting fatally shot because they could be using a benchmark totally unrelated. In fact they conclude that their methodology may not be accurate: “A focus on aggregate totals of transactional violence between police and different racial/ethnic groups fails to capture fully the totality of the circumstances involved in these transactions.”
FB then uses a tangential yardstick from the Shjarback & Nix study, that of people firing on police officers which--in addition to being merely a comparative metric since those events would qualify as justifiable self-defense--still shows bias because that study found that blacks were less likely to be armed see: https://spssi.onlinelibrary.wiley.com/doi/abs/10.1111/josi.12246

Tregle, Nix, and Alpert (2018) made use of data from The Washington Post and measured population, police-citizen interaction, and arrests. When looking at violent crime and weapon offenses, blacks were less likely to be killed by police (OR<1.0).

So when the study’s results showed that when looking at population and police interactions there was an anti black bias but as the author points out when looking at violent crime and weapon offenses, black people had a lower odds of getting shot. However this paper is using arrests which is a problem: the counterfactuals coded from arrest data may themselves contain bias. It is unclear how to estimate the extent of such bias or how to address it statistically. Specifically, they use race-specific arrest counts instead of racial census counts — both common approaches — and if officers have historically over-arrested minorities due to racial bias, then the use of this skewed benchmark will paint a misleading portrait by artificially inflating the “typical” level of criminal activity in this group. This method is very likely prone to underestimate racial disparities because African Americans are overrepresented in violent crime arrests but Part I violent crimes constitute only 1/24th of all arrests nationally (BJS, 2012), and previous research has found arrests for violent crimes to involve police use of force only 1.3 times as often as arrests for all other crimes (Worden, 1995). Arrest data, which provide the closest estimate of criminal activity within a population (short of direct observation), are compromised by the very nature of who makes arrests. That is, because police arrest people and our concern is with the possibility that police behave in a biased manner when applying force, there is the strong likelihood that arrest data would be biased in the same manner as use of force data. Benchmarking use of force data to arrest data likely underestimates the level of bias that may exist in police use of force. This discourages scientists from benchmarking police outcomes by arrest rates. It’s prone to false negatives and there’s problems of endogeneity. Regardless, even when controlling for the very rare occurrence of arrest for Part I violent crime within a demographic—an event that previous research suggests is only modestly more likely in and of itself to result in a use of force incident—25%-55% of participating departments still revealed robust racial disparities that disadvantaged Black people see: https://policingequity.org/images/pdfs-doc/CPE_SoJ_Race-Arrests-UoF_2016-07-08-1130.pdf

Worrall et al. (2018) looked at the effect of a suspect’s race on an officers’ decision to shoot. Blacks were less likely to be shot than white suspects even after officer and incident characteristics were held constant.

The problem with this study is that it’s limited by the fact that officers may not always properly document decisions to draw their guns without shooting, as well as the possibility of racial bias in officers’ decision to draw and point their guns, which could distort the comparison group. It’s also measuring an officer’s decision to shoot which is ambiguous because they could be measuring incidents where they don’t even shoot at the suspect. They’re also comparing shootings of black suspects to shootings of “others”? This is vague and unidentifiable. It also relies on data from only one department. It’s an unnamed department which casts further ambiguity. There’s also no benchmark/denominator taken into account so essentially this study is useless if there’s no sort of per-capita result.

As can be seen, there seems to be no racial bias in police shootings and fatal police shootings that harms blacks. There are also other pieces of evidence supporting the idea that police are not racially biased against blacks, specifically studies using simulations. In simulation studies, officers are put into a situation where the race of the offender is manipulated, and the officer’s reaction time is then measured.

Why does much of the experimental research on “police decisions to shoot” sample undergraduate students, not police, and use space-bar tasks, not shooting simulators? How did any of that work get published as legitimate examination of police decisions? Why as soon as we make a study in a realistic way using real police officers and a shooting simulator, the anti black bias disappears? This is something we find quite a bit in the scientific literature on racism. Poor studies done under unrealistic conditions find an anti-black bias. But as soon as we make realistic studies of better quality, the anti black bias disappears

James, Vila, and Daratha (2013) found that cops were more likely to shoot white and Hispanic suspects than they were to shoot black suspects.

Use of their simulator in research is relatively new, the relatively small sample was limited in geographic scope, and the paradigm gave participants a fairly long time to formulate (and potentially edit) their responses to each trial. Replications are needed to more gain confidence in these results.

James, James, and Vila (2016) found that when it comes to how fast officers are at shooting at a suspect, officers were faster at shooting white suspects than they were black suspects. When looking at aggressing individuals, 14% of whites were shot while only 1% of blacks were shot. So, even when the offender is aggressing, whites are more likely to be shot than blacks are.

But the county of officers in which the study took place also disproportionately kills African Americans, 2% of pop. but ~15% of all police shootings are deaths. Some methodological problems arise from this paper as well. First, the sample size was small (n=80, nowhere near representative) yet there were a lot (1,517) of observations so the results they found we’re probably only due to type I errors. Second, the study was a simulation and didn’t take actual data from police. It lacks ecological validity. That is, when you observe people their behavior tends to change (this is called the Hawthorne effect. Essentially it means that because the police are being watched and they know they are being researched they may alter their behavior to be consistent with how they want to be viewed.). Considering that this paper's findings are inconsistent with real world statistics, it would say that it is not representative of real cop behavior. A more accurate Harvard study found that black people are actually more than 3 times more likely to be killed during a police encounter. Other studies have shown black people experience police use of force. Third, the study had flaws in its controls. When performing experiments, in order to control for all factors, everything must be equal. However, when looking at figure 1, the durations for the video scenarios differed with each race (black duration = 43 s, white duration = 36 s. For figure 2, it was 42 s vs 31 s, respectively.) They don’t even have the same clothing on either. So we see differences in these variables yet they don’t even try to control for them at least statistically. When it comes to the results, we see that in figure 1, the confidence intervals were larger for black suspects which means the results are not more unreliable and inconclusive. The LCL was lower than the white suspect’s HCL indicating a lower reaction time to pull out a weapon for black suspects. Also, we even see a data point of a negative reaction time for black suspects (-0.01) so they pulled a gun out on black suspects before the black suspect could even draw his weapon. Additionally, their comparing means which doesn’t work because distributions are uneven and different. In fact, we see that the maximum reaction time for black suspects was 9.44 s, however the standard deviation was larger for black suspects vs white suspects (1.66 vs 0.98, respectively). This means the distribution for black suspects is more unevenly spread out for black suspects therefore the results are more biased against black suspects. In my opinion, I'm not sure what this study is attempting to prove? In an argument on police racism I would never make a claim for something that is so hard to prove with little evidence. This study also has many limitations in its way of acting it out; this study exists solely in the confines of research labs. It is an empirical fact that unarmed black men are killed at a higher rate than white men per capita and by proportion. The speed at which they are killed is an irrelevant variable to me. Rousell et al (2017) also refutes the paper and James 2016 even corrects for it.

Correll et al. (2014) gives a longer discussion on experimental studies, with them finding police not to be biased against blacks. One criticism of these studies might be that we might not find these results in the real world, but evidence suggest we do. For example, Worall et al. (2020) found that black suspects are 33% less likely than white suspects to have a gun drawn on them by the Dallas police force. Thus, we should expect simulation studies to reflect the real world, especially when looking at the evidence cited prior to the simulation data.

That’s not what the table shows or what the study says. It actually says “the key takeaway from Table 2 is that black suspects were less likely to have weapons drawn against them relative to those in other racial/ethnic groups.” The appendix actually shows black suspects have a 14% increased odds of getting a gun drawn on them compared to white suspects.

Fagan et al. 20
A data analysis on 3933 killings to examine the intersection of race and reasonableness in police killings
They find that, across several circumstances of police killings and their objective reasonableness, Black suspects are more than twice as likely to be killed by police than are persons of other racial or ethnic groups; even when there are no other obvious circumstances during the encounter that would make the use of deadly force reasonable
They suggest that the addition of training components that specifically address the role of race in officers’ perceptions of risk and their decision-making in potentially dangerous interactions with citizens may remediate both the incidence of police shootings and their apparent racial and ethnic disparity.
Fagan and Campbell (2020), who used data from The Washington Post’s police killings database, found that once the population was adjusted for, blacks were more likely to be killed by police than whites were. There are strong reasons to doubt the validity of this study. First, only population and not crime was adjusted for.

This is a complete lie on part of FB.
As can be seen by Fagan’s regression model they include “Si” which they say is “the violent-crime victimization rate”. They then confirm that “neither the violent-crime rate in a county nor the population distribution for any racial or ethnic group is a significant predictor of police killings.”

As has been shown above, population is not a proper benchmark because it assumes people of all races have a similar risk rate of coming into contact with police. Johnson et al. and Cesario et al. also found that once crime and population were adjusted for, blacks were less likely to be shot and killed by police. Furthermore, The Washington Post’s database has many flaws. For example, Washington Post’s “unarmed” category is misleading because people included in this database often lost their weapons while in the altercation, and thus were counted as “unarmed” even though they originally were armed (Mac Donald 2016).

I’ve read and reread this article about FB saying “the Washington Post counts those initially armed but later unarmed” but I can’t find anything in the article that even mentions that.

These were also the findings in other studies looking at the unarmed category. In an investigation into The Washington Post’s database from over a year ( January 2, 2015, to December 29, 2016 ), of the 153 cases they looked into, 112 of them actually had a weapon but lost it during an altercation (Shane, Lawton, and Swenson 2017).

This paper also doesn’t mention what FB specifically said? It does criticize the Washington Post, however, it’s critiques are purely suppositional and not supported by evidence.

When analyzing the Washington Post fatal police shooting dataset, Wang and Fan (2021) found no significant evidence to conclude that racial discrimination occurred during fatal police shootings.

Wang and Fan used “hotspots”. This matters when you define “hotspot”. Important factors determine who is oppressed by the state including: who’s in the hotspots, the size of the police force, average detainment rates in the hotspots by race, repeat offenders, average police misconduct rate, etc. The methodology is mid at best. Figure 3 is looking at a 3 year period vs 5 years (which is more historical, more data) without any regard for measurement error/convergence errors. In fact, from the BJS, which is what they cited to say when looking at violent incidents there’s no disparity in police shootings, it even cautions them from comparing groups like that: “caution must be used when comparing one estimate to another or when comparing estimates over time. Although one estimate may be larger than another, estimates based on a sample have some degree of sampling error.” Even then, the violent incidents suffered from errors not adjusted for as the BJS points out: “In an effort to improve the quality and accuracy of NCVS estimates, BJS used direct-variance estimation instead of GVFs for tables 1 and 2. (but not for table 13 which is violent incidents)” Looking at the figure itself, fatal police shootings is still higher than violent incidents for the black category (26.4% vs 24.1%).
This is also based on police shootings (just like the rest of the studies the author referenced) and judging it just based on police shootings seems stupid given many high-profile cases haven’t included shooting e.g. George Floyd. And this isn’t even going into stuff like arrests or stops etc.

Furthermore, Maguire (2020) reports that among those who had force used against them, African American suspects were significantly less likely than white suspects to be injured. The risk of injury for other racial and ethnic groups is about the same as the risk for white suspects.

If FB actually read the study it says “This study cannot be used as evidence in favour of or against the idea of racial disparity in other police-related outcomes, including whether minorities are stopped, searched, arrested or subjected to force at greater rates than non-minorities.” So the evidence FB is using literally cautions against using it to substantiate his argument. The study also noted other limitations (other than the fact it’s only from one department. In fact they say “this is a single-site study using data from a police department in one US city. The findings cannot be generalised to other cities”): “The data used in this study are based on administrative data collected and made available to the public by a police department. It is difficult to gauge the quality and integrity of the data set without knowing more details about how it was produced…the data used here are missing key information that would be useful to have. Additional information about the suspects, the officers and the encounters would allow for more nuanced analyses…It would also be useful to know more about the nature and severity of the suspects’ injuries rather than simply whether the suspect was injured or not.” There’s more limitations in their methodology such they they excluded 75 use of force cases. Their results aren’t conclusive because they haven’t provided confidence intervals. Again, they haven’t benchmarked on population so these aren’t per-capita estimates so this study is useless.
There may have been certain suspects who were more prone to injury than others due to factors such as age or physical condition, and if these suspects were overrepresented in one racial group, it affects the results of the study.
Additionally they use logistic regression. Logistic regression assumes that the relationship between the independent variables (such as race) and the dependent variable (such as injury status) is linear and additive. If there are non-linear or interactive effects at play, then logistic regression may not be the most appropriate method for analyzing the data. Additionally, logistic regression assumes that the observations are independent of each other, which may not be true if there are multiple encounters involving the same suspects or officers. In these cases, more complex statistical methods may be needed to account for the clustering of observations within individuals or groups.

*One study worth mentioning is Ross (2020) which Rose cited which has been discussed in Last (2020) and Cesario (2021).

Cesario 2021 doesn’t really make an argument. In fact Ross and Cesario more or less agree. Cesario just says that racial differences in encounter rates can differ between black and white unarmed non criminals. However using this encounter denominator is just guilty of collider bias. If there’s bias in who gets stopped in the first place, then looking at discrepancies in the resulting interactions won’t give you the full picture.

In conclusion, there doesn’t seem to be strong evidence suggesting the existence of racial bias in police shootings against blacks, but the evidence is stronger for an anti-white bias. Whether it be looking at real data or simulation studies, there is no racial bias against blacks in police shootings and police shooting deaths.

Employment

This section will solely focus on employment bias through resume data.
Bertrand 04
“To manipulate perceived race, resumes are randomly assigned African-American- or White-sounding names. White names receive 50 percent more callbacks for interviews. Callbacks are also more responsive to resume quality for White names than for African-American ones”
“The racial gap is uniform across occupation, industry, and employer size”
“We also find little evidence that employers are inferring social class from the names”
Pager et a.l 09
“Applicants were given equivalent résumés and sent to apply in tandem for hundreds of entry-level jobs”
“Our results show that black applicants were half as likely as equally qualified whites to receive a callback or job offer”
“In fact, black and Latino applicants with clean backgrounds fared no better than white applicants just released from prison”
Quillian et al. 17
Meta-analysis of “every available field experiment of hiring discrimination against African Americans or Latinos” – adding up to 55,842 applications submitted for 26,326 positions
Found that since 1989, there has been no change in hiring discrimination against blacks, though hiring discrimination against Latinos has decreased over that time
The quick kill to this, which makes this section the shortest, is that there is publication bias in the field.
Zigerell (2017)
There is a lot of publication bias in the field, making the issue of resume bias hard to take seriously.

Yet, FB uses a blog post and we should take that seriously? Anyway this blog uses a funnel plot to say there is publication bias but, high precision studies are different from low precision studies with respect to effect size (e.g., due to different populations examined) so a funnel plot gives a wrong impression of publication bias Joseph Lau, John P. A. Ioannidis, Norma Terrin, Christopher H. Schmid & Ingram Olkin (2006). The appearance of the funnel plot can change quite dramatically depending on the scale on the y-axis, whether it is the inverse square error or the trial size Jin-Ling Tang; Joseph LY Liu (2000) Researchers have a poor ability to visually discern publication bias from funnel plots Terrin, N.; Schmid, C. H.; Lau, J. (2005). Additionally Zigerell conceded that correspondence studies show no publication bias but in person audits do. Zigerell tried to prove this yet again with another funnel plot however it was only marginally significant (.04) meaning it’s probably a type I error.

[1] It doesn't help that FB loves to cite books, and when it does cite studies it vaguely refers to them as "O'Brien, 2001" without any additional information, which leads me down a rabbit hole where I maybe think this might be the paper they're referring to (which makes no mentions of black, African American, or any relevant mentions of race or racial differences from the search terms I used). From what I can tell, O’Brien is sourced internally which is quite annoying.

Alternatively^[f]^[g]^[h], it will provide a link, only to link to an irrelevant study (as was the case with the Becky Tatum quote as far as I can tell). I think this is because the author of the article is basing some part of their knowledge from a 2004 book titled "Race and Crime: A Biosocial Analysis". I have found that FB tends to cite books and studies which use arguments similar to those found in Race and Crime: A Biosocial Analysis." (with quotations for exactness) "The overrepresentation of African Americans in crime and delinquency is reaffirmed by victimization". One of those is this book. Random side note time: for some reason, when they quoted Becky Tatum, they removed the part where Tatum allegedly said "and self-report" and replaced it with ellipses. Assuming they got that quote from this book, maybe they thought that it sounded less convincing if they mentioned "self-report" data in their quote? I have no idea to be honest. It just seemed like a really odd omission.

[2] https://www.jstor.org/stable/23366871 Maybe someone will find this useful as well. I mainly read pages 32-34 (they also say that there is some bulk of disproportionate black arrests that is left to differential enforcement; no specific numbers are cited, but I guess the studies they cite would provide those numbers ... hopefully).

[3] Id

[4] Other models have presented an incident-level logistic regression analysis of co-offending incidents. The results are similar, indicating that black co-offenders are significantly less likely than white co-offenders to be arrested (OR = .578), such that the odds of arrest for black co-offenders is roughly 42% less than the odds for white co-offenders. However, there is significant potential for omitted variable bias in the traditional regression approach. Using the same sample of co-offenders, other models have presented the results from the quasi-experimental analysis of within-partnership differences in arrest.

[5]While a relatively small percentage of NIBRS incidents have multiple victims and/or offenders, the impact of excluding those cases is unknown.

[6] I couldn’t find a full version of this book. It was also kind of vague. "For more' ? Like what exactly does that mean and where exactly do I find it in the book since the book seems to cover unrelated topics.

[7] Familial risk factors can themselves reflect socioeconomic factors so it would potentially be spurious to disregard the effect of poverty/neighborhood disadvantage on the basis of factors that could themselves be a result of the latter. It’s multicollinear and there should be a warrant for the variance-covariance matrix of the estimated coefficients. The setup of these studies does not seem to justify causal conclusions.

[8] For some reason FB is using swedish/European crime trends in order to make prescriptions of American crime. Sweden is dissimilar to the US because of the strength of its welfare state. In fact it literally states that in Sariaslan et al. (2013):

as previously argued by Brännström,10 the relatively modest neighbourhood differences found in Sweden could be due to the country’s comprehensive welfare-state programs, which aim to actively diminish social inequalities

Sariaslan et al. (2018) said, along with other limitations

it could be that Sweden’s comprehensive welfare state actually mitigates the possible adverse effects of growing up with limited material resources. Aaltonen M, Kivivuori J, Martikainen P. (2011)

Studies found that not only did a very small portion of the population commit the majority of crime regardless of race/ethnicity, but the leading factors in crime committing were “poor school attendance, prior violent conviction, theft/drug/traffic conviction, mental disorders, and drug use”. FB attempts to debunk Vaush's claims about why black crime rates are higher by first providing a study that says: “just being poor isn't the only thing that causes crime, it's also home life”, which FB reads as “being poor doesn't contribute to crime.” Not to mention how Nordic countries are much more equal societies compared to the US. Using the Gini coefficient, Nordic countries have been consistently ranked among the most equal societies in the world (World Bank, 2021). The US, on the other hand, is among the more unequal societies. Thus, the range of differences between neighborhoods is far narrower and neighborhood disadvantages less extreme in the Nordic countries as compared to the US. As a consequence, the possible effects of neighborhood disadvantage might not be as overt.

[9] https://www.ojp.gov/pdffiles1/nij/grants/232084.pdf This goes more in depth.

[10] The Rushton study has numerous flaws. Even before looking at this study we should realize that rushton and templer are pseudoscientists (see rushton rebuttal in the SL) The first issue is when they cite Lynn & Vanhanen 06. In a meta-analysis of studies of IQ estimates in Sub-Saharan Africa, Wicherts, Dolan & van der Maas (2009), concluded that Lynn and Vanhanen had relied on unsystematic methodology by failing to publish their criteria for including or excluding studies. They found that Lynn and Vanhanen’s exclusion of studies had depressed their IQ estimate for sub-Saharan Africa, and that including studies excluded in "IQ and Global Inequality" resulted in average IQ of 82 for sub-Saharan Africa, lower than the average in Western countries, but higher than Lynn and Vanhanen’s estimate of 67. Wicherts at al. conclude that this difference is likely due to sub-Saharan Africa having limited access to modern advances in education, nutrition and health care. A 2010 systematic review by the same research team, along with Jerry S. Carlson, found that compared to American norms, the average IQ of sub-Saharan Africans was about 80. The same review concluded that the Flynn effect had not yet taken hold in sub-Saharan Africa. Lynn & Vanhanen 06 have been criticized to no end by people like Nisbett and Wicherts. Furthermore, the European Human Behavior and Evolution Association issued a statement opposing the book on grounds of faulty methodology. Next he mentions r/K which has been thoroughly debunked

(see in Rushton rebuttal section and

https://notpoliticallycorrect.me/2017/09/28/rk-selection-theory-rebuttals/

https://notpoliticallycorrect.me/2017/12/09/more-r-k-selection-theory-rebuttals/

Next he cites Templer and Arikawa to basically claim skin color is correlated with IQ. This is false: https://notpoliticallycorrect.me/2017/04/09/the-evolution-of-human-skin-variation/

Even HBDers admit this (see the first comment on this post)

http://inductivist.blogspot.com/2011/12/skin-color-and-desirable-traits.html?showComment=1323237742022#c4456747326487257693

Then he cites ducrest which they use to posit that darker skinned individuals are more aggressive, sexually active, and and resistant to stress. But here's one quote they conveniently miss:

In this respect, it is important to note that variation in melanin-based coloration between human populations is primarily due to mutations at, for example, MC1R, TYR, MATP and SLC24A5 [29,30] and that human populations are therefore not expected to consistently exhibit the associations between melanin based coloration and the physiological and behavioural traits reported in our study.

He then cites Lynn on "a north–south gradient in IQ predicted differences in income, education, infant mortality, stature, and literacy." This is a laughable finding see:

https://www.gwern.net/docs/iq/2010-cornoldi.pdf

also to add to that, rushton has a section at the end that goes on about brain size and IQ but like everything else this is all false:

https://pubmed.ncbi.nlm.nih.gov/26449760/

.14 corr regarding education

https://journals.sagepub.com/doi/full/10.1177/0956797618808470

.19 corr.

EVEN BEFORE THEY CORRECTED FOR ANYTHING THE CORRELATION WAS .21

https://www.pnas.org/content/97/9/4932

Show that brain size does not predict general cognitive ability within families so rushton's study in 2009 is wrong. Even worse for rushton, even THESE studies have been criticized:

see:

https://developmentalsystem.wordpress.com/2019/11/09/critical-commentary-on-nave-et-al-2018/

Gignac in 2015 might have found a correlation of .4 but the pietschnig evidence above is a meta-analysis also the final link exposes flaws in general methods so this would be seen as an indirect response.They make extremely dubious claims and while trying to shove them under the rug by mentioning how one should be "cautious" in interpreting their results. Ex. They state

Placing darker versus lighter pigmented individuals with adoptive parents of the opposite pigmentation did not modify offspring behavior....The genes that control that balance occupy a high level in the hierarchical system of the genome.

This is incorrect as genes as not some blueprint see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2223161/
Furthermore skin color doesn't "cause" IQ https://notpoliticallycorrect.me/2017/04/09/the-evolution-of-human-skin-variation/
They also cite studies without realizing that those very same studies contradict their own claims. From Ducrest 2008:

Some other stupid quotes include:

A first examination of whether melanin based pigmentation plays a role in human aggression and sexuality (as seen in non-human animals), is to compare people of African descent with those of European descent and observe whether darker skinned individuals average higher levels of aggression and sexuality (with violent crime the main indicator of aggression).

This is ridiculous because it may be the case that African countries commit more crime than European nations but skin color is not the cause of this. ANOTHER stupid point is them asking black and white people how many times they have sex:

asked married couples how often they had sex each week. Pacific Islanders and Native Americans said from 1 to 4 times, US Whites answered 2–4 times, while Africans said 3 to over 10 times.

They further cite outdated, non-representative, non-random samples from Lynn 1989 to support using data from Alfred Kinsey
see: https://www.sciencedirect.com/science/article/pii/0191886988901365
They then entertain the penis size debate. I should not have to explain why none of that means anything. Furthermore, their invocation of r/k is wrought with issues
see: https://notpoliticallycorrect.me/2017/06/24/rk-selection-theory-a-response-to-rushton/
It has been refuted and replaced with age-specific mortality. Next, his testosterone arguments are also wrought with issues. I have added an entire testosterone section to the SL please look over that. In sum testosterone doesn’t do what he believes it does and the levels of testosterone between the races are not as high as believed/non-existent. His citations of Nyborg and himself ultimately lead back to Ross et al. 1986 which RR has critiqued in depth (and I have added to the SL) So ultimately his claims on IQ and skin color are taken from animal studies, stretched assertions, and bad data that mean nothing
Templer and Ruston 2011 is laughably stupid not even hereditarians cite this paper. So it's interesting that the author of these blogs would cite them. Either they haven't read the study or they are purposefully being stupid.

[11] What follows is a response to Sean Last’s blog post Slavery and Modern Black Poverty. I’ll refer to the author, Sean Last, as “SL”.

[12] Macdonald just talks about Wright et al. which is debunked way further below.

[13] The paper shows a spike of results right around zero that's not symmetrical. That is, there's a clump of results just above zero and not very many just below. They conclude that this suggests publication bias, which is possible. But it might also just be an example of p-hacking, where researchers putter around in their data until they can nudge it just above zero. This is also a known problem, but it would have only a small effect on the reported effect of lead on crime.

[14] As it happens, the paper doesn't show this. There's a kinda sorta spike around 0.33, but only if you squint. For the most part, there's just a smooth array of results from 0 to 1. If this is accurate, it is a bit odd, but hardly inexplicable since every study of lead and crime is working with entirely different datasets.

[a]This is probably somewhat obvious, but paragraph is very long, and there are plenty of points in it, some of which are stronger than others. Considering splitting these paragraphs to highlight the stronger points.

[b]Is this Francis Black saying this, or you? So far most first-indentation points have been Francis Black. It looks like you just didn't respond to this point maybe?

[c]I didn't. I couldnt find a pdf/full version of the book.

[d]It was also kind of vague. "For more' ? Like what exactly does that mean and where exactly do I find it in the book since the book seems to cover unrelated topics.

[e]Move this to the bibliography section, and include the name of the paper. This just looks like a random suspicious link, and it's not even clear you're using it in your other arguments.

[f]I also verified what you meant by "vaguely refers to them as O'Brien, 2001". In the case of this very specific citation, it's not just vaguely referred. The text is literally hyperlinked to the source, which is this:

Criminology: An Interdisciplinary Approach

https://www.amazon.com/Criminology-Interdisciplinary-Approach-Anthony-Walsh/dp/1412938406

So, uh, gotta be careful with that I guess.

EDIT: Oh shit, this is not a source by O' Brien. Meaning... this book must cite some O' Brien source internally. So it's a second-degree citation. What the hell?!?! This issue was not clearly explained in your paragraph here. That's actually quite annoying, and really obfuscates the sources.

[g]_Marked as resolved_

[h]_Re-opened_

Extensive document on racial biases in our criminal justice system.

Studies seem to indicate about 61-80% of black overrepresentation in prisons can be explained by higher black crime rates, with the unexplained portion largely attributable to racial bias.

Reasons for Black Crime

Generational Poverty

Slavery

Redlining

Bad Schooling

Per Pupil Spending

Class Size

Teacher Quality

Class Offerings

Student Quality

Bias at Universities

Alternative Explanations for Black Crime

Stops, Searches, and Arrests

This ACLU report reviews 5 months’ of data from DC police stops & searches by race and outcome.

Judges, Juries & Prosecutors

Government aggregate of data on plea and charge bargaining.

The Urban Institute analyzed the histories of four probation offices and found black people were 18-39% more likely than similarly-situated white people to have their probation revoked.

A study of bail in 5 large counties found blacks received significantly higher bail than whites who had committed similar crimes.

Death Penalty Sentencing

Analysis of 33 years of data from Washington State to determine which characteristics best predict the decision to implement a death sentence.

Analysis of the relationship between racial stereotyping and death sentence convictions.

Implicit Bias

Policing and Racial Profiling

Employment