by Rachel Thomas
The term “Artificial Intelligence” is a broad umbrella, referring to a variety of techniques applied to a range of tasks. This breadth can breed confusion. Success in using AI to identify tumors on lung x-rays, for instance, may offer no indication of whether AI can be used to accurately predict who will commit another crime or which employees will succeed, or whether these latter tasks are even appropriate candidates for the use of AI. Misleading marketing hype often clouds distinctions between different types of tasks and suggests that breakthroughs on narrow research problems are more broadly applicable than is the case. Furthermore, the nature of the risks posed by different categories of AI tasks varies, and it is crucial that we understand the distinctions.
One source of confusion is that in fiction and the popular imagination, AI has often referred to computers achieving human consciousness: a broad, general intelligence. People may picture a super-smart robot, knowledgeable on a range of topics, able to perform many tasks. In reality, the current advances happening in AI right now are narrow: a computer program that can do one task, or class of tasks, well. For example, a software program analyzes mammograms to identify likely breast cancer, or a completely different software program provides scores to essays written by students, although is fooled by gibberish using sophisticated words. These are separate programs, and fundamentally different from the depictions of human-like AI in science fiction movies and books.
It is understandable that the public may often assume that since companies and governments are implementing AI for high-stakes tasks like predictive policing, determining healthcare benefits, screening resumes, and analyzing video job interviews, it must be because of AI’s superior performance. However, the sad reality is that often AI is being implemented as a cost-cutting measure: computers are cheaper than employing humans, and this can cause leaders to overlook harms caused by the switch, including biases, errors, and a failure to vet accuracy claims.
In a talk entitled “How to recognize AI snake oil”, Professor Arvind Narayanan created a useful taxonomy of three types of tasks AI is commonly being applied to right now:
- Perception: facial recognition, reverse image search, speech to text, medical diagnosis from x-rays or CT scans
- Automating judgement: spam detection, automated essay grading, hate speech detection, content recommendation
- Predicting social outcomes: predicting job success, predicting criminal recidivism, predicting at-risk kids
The above 3 categories are not comprehensive of all uses of AI, and there are certainly innovations that span across them. However, this taxonomy is a useful heuristic for considering differences in accuracy and differences in the nature of the risks we face. For perception tasks, some of the biggest ethical concerns are related to how accurate AI can be (e.g. for the state to accurately surveil protesters has chilling implications for our civil rights), but in contrast, for predicting social outcomes, many of the products are total junk, which is harmful in a different way.
The first area, perception, which includes speech to text and image recognition, is the area where researchers are making truly impressive, rapid progress. However, even within this area, that doesn’t mean that the technology is always ready to use, or that there aren’t ethical concerns. For example, facial recognition often has much higher error rates on dark-skinned women, due to unrepresentative training sets. Even when accuracy is improved to remove this bias, the use of facial recognition by police to identify protesters (which has happened numerous times in the USA) is a grave threat to civil rights. Furthermore, how a computer algorithm performs in a controlled, academic setting can be very different from how it performs when deployed in the real world. For example, Google Health developed a computer program that identifies diabetic retinopathy with 90% accuracy when used on high-quality eye scans. However, when it was deployed in clinics in Thailand, many of the scans were taken in poor lighting conditions, and over 20% of all scans were rejected by the algorithm as low quality, creating great inconvenience for the many patients that had to take another day off of work to travel to a different clinic to be retested.
While improvements are being made in the area of category 2, automating judgement, the technology is still faulty and there are limits to what is possible here due to the fact that culture and language usage are always evolving. Widely used essay grading software rewards “nonsense essays with sophisticated vocabulary,” and is biased against African-American students, giving their essays lower grades than expert human graders do. The software is able to measure sentence length, vocabulary, and spelling, but is unable to recognize creativity or nuance. Content from LGBTQ YouTube creators was mislabeled as “sexually explicit” and demonetized, harming their livelihoods. As Ali Alkhatib wrote, “The algorithm is always behind the curve, executing today based on yesterday’s data… This case [of YouTube demonetizing LGBTQ creators] highlights a shortcoming with a commonly offered solution to these kinds of problems, that more training data would eliminate errors of this nature: culture always shifts.” This is a fundamental limitation of this category: language is always evolving, new slurs and forms of hate speech develop, just as new forms of creative expression do as well.
Narayanan labels the third category, of trying to predict social outcomes, as “fundamentally dubious.” AI can’t predict the future, and to label a person’s potential is deeply concerning. Often, these approaches are no more accurate than simple linear regression. Social scientists spent 15 years painstakingly gathering a rich longitudinal dataset on families containing 12,942 variables. When 160 teams created machine learning models to predict which children in the dataset would have adverse outcomes, the most accurate submission was only slightly better than a simple benchmark model using just 4 variables, and many of the submissions did worse than the simple benchmark. In the USA, there is a black box software program with 137 inputs used in the criminal justice system to predict who is likely to be re-arrested, yet it is no more accurate than a linear classifier on just 2 variables. Not only is it unclear that there have been meaningful AI advances in this category, but more importantly the underlying premise of such efforts raises crucial questions about whether we should be attempting to use algorithms to predict someone’s future potential at all. Together with Matt Salganik, Narayanan has further developed these ideas in a course on the Limits to Prediction (check out the course pre-read, which is fantastic).
Narayanan’s taxonomy is a helpful reminder that advances in one category don’t necessarily mean much for a different category, and he offers the crucial insight that different applications of AI create different fundamental risks. The overly general term artificial intelligence, misleading hype from companies pushing their products, and confusing media coverage often cloud distinctions between different types of tasks and suggest that breakthroughs on narrow problems are more broadly applicable than they are. Understanding the types of technology available, as well as the distinct risks they raise, is crucial to addressing and preventing harmful misuses.
Read Narayanan’s How to recognize AI snake oil slides and notes for more detail.