
Artificial Intelligence and Machine Learning + Accessibility
Introduction
Technologists have envisioned machines as intelligent as humans for centuries. Stories describe computers that listen to you speak, reply in natural language, perceive the world around them, fluidly switch modalities to whatever the situation calls, and are able to complete or guide their human companions through myriad tasks of everyday life. For most people, such computing systems have mostly been confined to science fiction. But, people with disabilities have been using systems with many of these features for decades, placing people with disabilities at the forefront of artificial intelligence.
The field of accessibility is thus deeply connected to artificial intelligence and, through this connection, is often a window into the future of computing. Screen reading technologies have long converted visual information to voice or refreshable Braille, automatic speech recognition (ASR) has turned aural speech to visual text, and computer vision perceives visual information in the world and reads it aloud. People with disabilities are the early adopters of AI technology, and we can learn a lot about the future of AI by studying accessibility. Similarly, many leaders in artificial intelligence have gotten their start in applications in accessibility.
Because people with disabilities have been using these technologies, accessibility keeps the promises of futuristic AI technology grounded. The practical utility demanded for adoption of technology in people’s everyday lives means that the cost of errors (however rare) cannot be ignored, and solutions need to work in the real world and not only in the lab. Adoption of intelligent technologies by people with disabilities has been notoriously low (although this is changing), not because technology wasn’t technologically impressive but because oftentimes the value it provided was misaligned with the real needs of people. Just as accessibility can provide a glimpse into technical futures, it also can provide a perspective on design of AI products – just because something is impressive does not mean it will be adopted. This is a lesson proven over and over again in accessibility; the lesson applies broadly and continues to be realized even today.
As mainstream AI technologies have started to become commonplace, from voice assistants to the beginnings of driverless cars, in accessibility we have found that these technologies often leave out or fail to adequately consider people with disabilities. Voice, quickly becoming a default mode of interaction in everything from our cars to hotel rooms, fails to recognize the speech of people with even mild speech impairments, e.g., people who stutter6 . Self-driving cars have been found to not recognize pedestrians in wheelchairs as pedestrians10 , potentially leading to dangerous consequences. Mainstream technology often doesn’t consider people with disabilities, and this can be especially concerning and frustrating with technologies that “just work” for many people but don’t work for people who move or look or interact differently, which means we need to consider accessibility not only in features designed specifically to support accessibility but also for mainstream features intended to support everyone.
This chapter explores accessibility from the interconnected technical, design, and product perspectives. It is organized into four sections. First, we will briefly discuss what artificial intelligence and machine learning are, and how their history is intertwined with accessibility. Second, we will discuss sub-areas of AI and how they connect to specific application areas in accessibility. Third, we will discuss design considerations for AI/ML technologies for accessibility. And, finally, we will discuss machine learning in particular and consideration for accessibility, broadly.
Artificial Intelligence and Machine Learning
Despite its ubiquity, there is no agreed upon definition of what “artificial intelligence” means. Informally, I have found it useful to think of “artificial intelligence” as anything built by humans that seems intelligent. In the early days of the field (e.g., the 1960s) artificial intelligence was a combination of deterministic rules and algorithms applied to problems that people seemed to think required intelligence to solve. For instance, the ELIZA computer therapist was simply a clever set of if/then rules defined by its creator8 , and yet at the time people thought it was highly intelligent.
Another useful distinction to make is between intelligence and perception. Many subfields of artificial intelligence relevant to accessibility are defined on the surface by notions of perception, e.g., computer vision or speech recognition. While the details are a bit more complicated than this, intelligence is difficult to separate from perception, as what one perceives with their senses is inevitably influenced by their interpretation of it. This, in turn, causes some challenges in accessibility, where artificial intelligence researchers sometimes make the inverse mistake of thinking because a person does not have the ability to perceive in a particular way that this is related to their intelligence. A recent example is an article that argued through thought experiment that large language models (LLMs) could be intelligent even if they do not have senses by relating them to people who are deaf-blind2 . This sort of argument misses a whole host of experiential intelligence that artificial intelligence cannot yet match, but is nevertheless misguided. Human beings are remarkably adaptable and people with a wide variety of perceptual and other (non-cognitive) disabilities are as intelligent as everyone else.
In the last few decades, artificial intelligence has been dominated by “machine learning”. In this approach, systems are made to act intelligent not by humans directly coding a decision for every possible input, but rather by providing training data that the system then “learns” from. The advantage of this approach is that the knowledge captured in this data would be too difficult to manually program, and so such approaches to machine learning now perform better than alternative approaches on nearly every problem. Recent advances in neural network architectures and increasingly powerful computer systems have allowed models to be trained with and effectively learn from massive amounts of data. Large language models (LLMs) trained on massive amounts of text have shown surprising adaptability to a variety of problems of human concern, with a number of potential opportunities and concerns for accessibility14 .
Applying machine learning to accessibility is not so different than applying machine learning to other use cases, although it is worth noting some common challenges that are perhaps magnified when working in accessibility. When designing features that use machine learning, it is especially useful to think about not only when the system works but also when it fails. A common problem is that a system is designed to work in a particular narrow situation and fails in the real world. In the machine learning community this is sometimes referred to as “distribution shift”, which is meant to indicate that the distribution of data that the machine learning model operates over is different during real use than during training. Oftentimes, the data from people with disabilities are the statistical outliers and thus represent a “shift” in the distribution, because they weren’t included in the original training data or the designers of an ML feature didn’t think to include data from them10 .
Within accessibility, there has been a resulting push to make sure that people with disabilities are appropriately represented in datasets. People with disabilities can be under-represented in datasets for a variety of reasons. Sometimes people collecting the datasets specifically exclude people with disabilities from their datasets, e.g., they might recruit only people who use ten fingers to type on the keyboard in a text entry data collection. Other times, people with disabilities are under-represented in datasets because insufficient care was made to make sure they were represented. For instance, datasets collected in the context of something else that is inaccessible will under-represent people with disabilities – imagine a voice assistant that does not recognize the speech of people with speech impairments, data collected from the natural use of this assistant will be very unlikely to include data from people with speech impairments. Subtle issues can also cause disparities in representation, e.g., if a data collection requires people to travel to a particular lab location, a general call might be less likely to reach people with disabilities who could find it more difficult to travel and be less likely to do so.
Even representative data can be insufficient. People with disabilities are both more rare than people without disabilities and also vary much more in how they look, move, etc. Thus, to build machine learning models that work as well for a variety of abilities can require even more data from people with disabilities than other users (oversampling from people with disabilities), which can mean recruiting even more people with disabilities than others, sometimes substantially more. One method for this is called stratified sampling, where participant data is divided into groups based on characteristics, such as their disability, and then sampled.

So far we have largely ignored performance, as if choosing appropriately represented data will always result in a machine learning system that works well. This is not the case! While a focus on data and representation is paramount, we also need to measure performance of our resulting systems. Within machine learning, performance is usually measured according to metrics that can be automatically calculated, although best practice is to validate these automatic measures with more costly human evaluation. The simplest metric is how many of the predictions made by the model are correct versus incorrect. In the dataset, which is collected and annotated for purposes of training a machine learning system, oftentimes the data contains a “ground truth” label that is the correct label for a particular instance.
Imagine a computer vision system designed to detect bus stops (which could be useful to a person who is blind)—a dataset might consist of images that contain bus stops and images that don’t contain bus stops. Each image is an instance, and the ground truth label is a binary value indicating whether the image contains a bus stop. Often, this ground truth is determined by asking people to label (or annotate) the image. A machine learning approach might learn a model from this data that is then able to predict for a new image whether or not it contains a bus stop. This type of machine learning is called supervised machine learning, whereas machine learning done without the aid of a ground truth label is called unsupervised.
A simple metric might be how many of the images were correctly labeled as containing or not containing a bus stop, e.g., an approach might score 78%. In general, real systems are not correct 100% of the time (you might even be reasonably skeptical of one that claims to be), and so we need to think about what kind of performance is reasonable and useful for our intended use. One way to better understand this in practice is to differentiate between different kinds of errors—maybe the designer of the system, having worked with potential users, has learned that it’s better for the system to predict a bus stop when there isn’t one than to miss the bus stop altogether (this is just an example, maybe it’s the opposite!). We might then talk about false positives (examples that don’t contain the target but are falsely predicted as containing the target, e.g., images that don’t contain a bus stop that are predicted to contain one) and false negatives (examples that contain the target but are falsely predicted not to contain the target, e.g., images that contain a bus stop that are predicted not to contain one). Other common ways to refer to error rates include precision, recall and sensitivity, and specificity, defined in Table 1.
Term | Definition |
---|---|
precision | true positives / (true positives plus false positives) |
recall and sensitivity | true positives / (true positives plus false negatives) |
specificity | true positives / (false positives + true negatives) |
In some cases the ground truth answer is not easily determined. This can be the case for systems that generate text, for example. Text generation is used in a number of accessibility tools, such as in automatically producing textual descriptions of images for people who are blind, or automatically suggesting text to say for someone using an Augmentative Alternative Communication (AAC) device to vocalize speech. A common metric to compare two text strings is Word Error Rate (WER), which is the number of fixes (insertions, deletions) required to turn one string into another, divided by the overall length of the string. This metric is intended to capture how different two strings are, and it seems reasonable. It has also proven useful in practice on a wide variety of tasks involving text.
While a reasonable starting point, this metric has at least two problems:
- It assumes there is a single ground truth answer, which is usually not the case for language, and
- It can score text with very different meanings as the same.
As an illustration of the first problem, consider the two sentences, “They drove away in their car” and “The people jumped into a Chrysler and sped off” – these sentences have zero words in common and yet mean roughly the same thing. As an illustration of the second problem, consider the sentences, “I am happy” and “I am not happy” – these mean opposite things, but according to WER are quite close (it requires just one insertion to turn the first string into the second one).
WER is one of the simplest measures of text similarity, and measuring the quality of text or other generated output is an active field of research. Although not perfect, one simple way to improve performance is to explicitly include multiple correct answers in the ground truth16 . Other methods include comparing text not word by word, but rather according to scores intended to better capture semantic meaning (generally using another model). Large language models have also been used to assess the output of other models. All of these are attractive options and can be useful, yet care must be taken in their application, as none of these methods can fully guard against the semantic errors we observe in basic WER and their benefits come with less ability to understand how the metric was calculated.
Oftentimes, human evaluation (asking humans to rate output) is the best that can be done. This is great, but is difficult to scale. A question to always ask, especially in the context of accessibility, is who the people doing the ratings are and whether that group is inclusive of the target user group. In the case of accessibility, we likely want to ensure that people with disabilities or at least their perspectives are included in the data annotation and evaluation.
A final question we might ask is “what is a fair metric?” This seems straightforward, but defining what fair means for a particular context is at the crux of many design decisions that go into AI/ML products. It also evolves along with technological progress. If a speech recognition system does not work at all for people with a certain kind of speech, are metrics that are reported on data that does not include them fair? We likely would agree that they are not, but what about for people for whom nobody has yet produced working speech recognition systems or who do not produce speech that other people can understand? Ultimately, we should be deliberate and transparent about the tradeoffs made during design and development of systems, ensure we are not leaving out people for whom there is a known (or achievable) solution, report who the system does not currently work for and provide alternative solutions in those cases, and strive to push toward mainstream ML systems that everyone can use.
How AI/ML Subfields Map to Accessibility Problems
Opportunities in accessibility map onto subfields of AI/ML because of their close connection. For each subfield, this section lists example applications that use technologies related to the subfield and highlight some ways that the particular subfield illustrates concepts introduced in this chapter. Many of these subfields are covered in other chapters of this book in detail (e.g., Computer Vision + Accessibility). One exciting development over the past handful of years is the arrival of powerful language and multimodal transformer models that have quickly changed the landscape of many of these subfields of artificial intelligence1 .
Despite quickly progressing technological advances, describing these areas remains a useful way to think about the different application areas, how they relate to accessibility, and what is important for making them work well. As it often does, accessibility helps us to unpack what it means to have truly “solved” a problem for humans. As one example, even a computer vision system able to recognize and describe everything in a space would not solve visual accessibility for a person who is blind—there is too much stuff to speak! The interaction problem posed by accessibility is thus how to get efficient access to visual information and enable interactive exploration.
Computer vision
Computer vision is the field in AI/ML broadly tasked with computationally understanding visual information. Like many subfields of artificial intelligence, early computer vision work involved substantial manual work by programmers, matching inputs of pixels to previously defined templates, and then later manually creating feature descriptors that could be used to more robustly match previous types of inputs. Most computer vision systems these days use machine learning, and thus concerns about the data used to train the model (what it may leave out, what biases might exist in that data, etc.) are paramount.
At a high level, computer vision maps directly onto technologies that might be useful for people who are blind or have low vision, although computer vision can also be used to support other accessibility use cases as well.
Text identification and recognition
As you go about your day, think about all of the text you see—visual text is everywhere! It is on our computer screens and in our books. It is on signs and labels buttons. Menus in restaurants are text, and so are the departure times on train station screens. Visual text embodies language in a particular modality of access, but once converted into digital form text can be conveyed in a number of ways, e.g., speech or refreshable Braille or just made bigger so it’s easier to see. These transformations require text to be in a machine readable form and not raw pixels.

The sub-area of computer vision focused on recognizing text is called “Optical Character Recognition” or OCR for short, and the goal is to turn visual representations of text into machine readable ones. This is one of the oldest areas of computer vision applied to accessibility (see Figure 2 of futurist Ray Kurzweil and one of his early “reading machines”). Early approaches assumed access to the particular font that was being recognized so the letters of that font could be directly matched as templates. A big advance in the field came with approaches that were robust to multiple fonts. These days OCR works like other machine learning approaches where text recognition is powered by large datasets of text that the system is trained to recognize. Surprisingly, OCR is far from a solved problem in practice – text appears in so many different ways, in so many different contexts, etc., although generally recognizing text written in standard formatting with a standard font on a plain background has gotten to be nearly perfect.
While recognizing text written in a standard font on a plain background has gotten to be nearly perfect, OCR is far from a solved problem in practice. It remains challenging to recognize text in the world that might appear at extreme angles in natural scenes or when it is obscured or blurred. Related problems important to accessibility include grouping recognized text together into semantically relevant units and recognizing structural elements, e.g., headings, titles, footnotes, columns, etc.11 . While OCR is available and high quality for certain languages, e.g., English, current technology performs much worse on other languages (sometimes called “low resource languages”), generally because of less investment in those languages.
Automatically describing images
With the rise of the Internet, visual content began to be shared at an unprecedented rate. A number of approaches have been developed to make those images accessible non-visually. The most basic is the introduction of the “alt attribute” in HTML 1.222 , which allows images to be provided a text alternative to describe them. Unfortunately, despite thirty years of attempts to advocate for people to provide such descriptions, the reality is that a very small fraction of images on the web (and other platforms) are assigned alternative descriptions5,12 .
Given that very few images are assigned descriptions by people, computer vision might be a natural solution for providing descriptions automatically. Only a few years ago, such systems could only provide a few keywords20 , but these days systems powered by multimodal language models can provide rich descriptions15 . Nevertheless, open problems remain about what is important to describe in a particular photo and how to describe it. For instance, describing people can be complex given that many of the ways that people are often naturally described assume identity information that cannot be determined from pixels alone3 .
Enabling interaction with graphical user interfaces
Computer vision can also be used to make graphical user interfaces (GUIs) accessible. The main idea here is that developers often target only making use with assumed modalities (e.g., high visual acuity and high motor dexterity), and so if we can create technologies that are able to perceive and interact with these modalities then they can often make interfaces accessible that otherwise would not be. While it is paramount that developers provide appropriate metadata to enable access technologies to work with these interfaces (as has been covered elsewhere in this book), and this remains the gold standard for accessibility, it unfortunately remains the case that many (if not most) developers fail to make accessible interfaces, generally because they do not know to or have not bothered to code their interfaces properly. This problem gets worse every year as more and more inaccessible interfaces are produced and become legacy.
Using computer vision to make GUIs accessible maps onto existing tasks that are popular within the computer vision community. The first major task is to identify the user interface components (e.g., textboxes, buttons, text areas, radio buttons, etc.) – this is essentially “object detection” where the objects are those user interface components. To make high-quality accessible experiences, access technologies also need to know about hierarchy of visual elements and make visual text accessible (OCR), which are also problems addressable by computer vision. Screen Recognition on Apple iOS released in 2021 is one example of using computer vision to make available the user interface elements and UI hierarchy of a graphical user interface to screen reader users24 .
Computer vision and computer graphics (the field within computer science that focuses on generating visual information rather than understanding it), can be combined to reflow and personalize existing graphical user interfaces to make them less cluttered, easier to use, or work better for people using various access technologies. For instance, someone who needs to see screen content larger (using, for instance, zoom) might benefit from a more intelligent zoom feature that recognizes which of the visual information can be compressed without losing necessary information9 . This can allow zoom features that make content larger while reducing making the overall interface larger, which can reduce negative side effects of zoom like the introduction of horizontal scrolling. More radically, researchers have explored methods for recognizing screen content and graphically reshaping the contents to match the current user’s abilities and context, directly from pixels13 . Despite this work, the imagined fully fluid user interface that can be developed once and then personalized to each person remains elusive.
Speech recognition
Speech recognition (or Automatic Speech Recognition, or Speech to Text) converts aural speech to text. This functionality is useful in a variety of use cases, from directly providing access to speech to enabling control of computing systems using speech.
Automated captioning
Most of us now have familiarity with automated captioning, which is a primary use case for speech recognition technology. Oftentimes, automated captioning is used in virtual meetings and displays the recognized words on screen so that someone who cannot hear the words or who would benefit from both seeing and hearing the words can read them there. The quality of automated captioning has gotten remarkably better over the past few years, and it has become a vital and important accessibility tool.
Given this, it might surprise some to learn that even today the gold standard in captioning is human-provided captioning (real-time stenography). A popular method for stenography or steno-type is CART (Communication Access Realtime Translation) and involves a highly trained stenographer typing on a special keyboard. Stenographers have trained to be able to map a large number of character sequences to a single chorded key, enabling them to type at speeds exceeding 200 words per minute.
The main reasons why human captions remain better even while the automated captioning has gotten much better is the human’s ability to adapt to contexts. A human can more easily understand what they are captioning, which words are likely or unlikely, and be taught new words (new vocabulary) on-the-fly. For a long time, a major weakness of automated approaches was the “out of vocabulary” problem, which refers to the inability of automated captioning systems to produce words that were not in their training data. This was a big problem for lots of very important use cases. For instance, automated captioning of a computer science class might not be able to recognize the technical terms introduced in the class and thus make frequent errors. Current approaches operate at a “token” level17 , which are sub-word pieces, and can handle this problem better. Unfortunately, even though token-based approaches can represent even previously unseen words, automated captioning still tends to assign lower probabilities to token sequences that weren’t observed in its training data and thus can still struggle to accurately produce rare words. A promising approach to handle this is systems that learn on-the-fly from presentation material7 , which is now in popular widely-used presentation programs such as Microsoft Powerpoint.
The most common metric for measuring the performance of speech recognition is Word Error Rate (WER), which is the edit distance between the recognized speech and the ground truth. It is calculated by dividing the minimum number of word deletions, insertions, and substitutions necessary to turn the recognized speech into the ground truth, divided by the number of words in the ground truth. Metrics like this can be somewhat deceiving. High numbers can make it seem like the system is working well, but it might still be insufficient if it is your only way of accessing the content. Imagine a system with 95% WER — that means that at least 1 out of 20 words is wrong. Also, given the above explanation, the words that are most likely not to be recognized correctly are the words that might be the most important or new to someone learning about a new area. Meanwhile, the WER for human created captions might be artificially low due to reinterpretations of the content. For instance, we once did a study where a human substituted “someone” for “somebody” — while this difference probably isn’t meaningful for the user, it’s counted as an error the same as if the captioning had read “potato” instead of “somebody”. Many attempts have been made to try to improve upon WER23 , but most require human interpretation and thus haven’t been adopted at scale.
Dictation
Another use for speech recognition is for text entry. This is a vital accessibility feature for people who cannot type or who find it difficult to type. Unpacking the utility of transcribing dictated speech as a text entry method is similar to that for captioning, and many of the same challenges, with errors, out of vocabulary words, etc., exist in this application as well. An added challenge in dictation is that users will want to fix the errors made by dictation tools. Thus, speech is often also used to fix the recognized text. One way to think about whether dictation is useful for a person is thus not only about the performance of the speech recognition system, but also about how long it takes to correct errors. Sometimes slower but more reliable methods are preferred.
Speech for control
Speech and non-speech vocalizations can be used to control devices, including graphical user interfaces. These applications also depend on the accuracy of speech recognition, but oftentimes performance is quite high because the range of what someone will want to say is much lower. In speech for control applications, there is often a small fixed vocabulary. For instance, the command “click the Submit button” only needs to differentiate between the buttons that are currently on screen and presumably only one button will say “Submit.” An alternative way of interacting with screens via voice is to specify locations on screens and the action that the user wants to perform. Oftentimes, this is done via grid refinement (Figure 3), where the user recursively selects a location by naming a grid square and then upon reaching a fine enough location issues a command, e.g, “click.“

Non-speech vocalizations can also be used to control user interfaces. The Vocal Joystick project enabled flexible analog control by exposing a continual 360 vocal range so that artists could freely control the virtual cursor19 . For those unable to voice words, sounds can be mapped onto actions. This is a feature exposed in products like Apple’s “Sound Actions” feature, which allows users to choose among 13 non-speech sounds to control their devices.
Recognizing everyone’s speech
Not everyone’s speech sounds the same. This can be due to a variety of reasons, such as accent, dialect, or disability. In accessibility, we recognize that speech recognition systems often don’t work nearly as well for people with various kinds of speech impairments, including stuttering and dysarthria. One answer is to of course include more people with disabilities in data collections. Interestingly, the technical design of speech recognition systems sometimes hardcode assumptions about abilities. For instance, most speech command systems include what is called an “end pointer”, which is the part of the system that determines when the system thinks that the user has stopped speaking. In one study6 , it was demonstrated that simply increasing the end pointer to allow more time for a user to finish speaking made the system more accessible to a wide variety of people, including people who stutter. Efforts are also underway in the research community to develop speech recognition systems that can better accommodate a wider variety of kinds of speech, e.g., dysarthric speech18 . Challenges in recognizing atypical speech include a lack of sufficient data, and wide variation in how speech is produced both across people and even by individual speakers.
Text to speech (speech generation)
Converting text to speech is a vital function for enabling a wide variety of accessibility technologies, from screen readers for non-visual accessibility to reading tools for people with a variety of learning or cognitive disabilities. Text-to-speech (TTS) systems have evolved significantly over the years, progressing from early concatenated speech synthesis to more recent machine learning-based approaches. In the early days of TTS, systems relied on a method known as concatenated speech synthesis, where pre-recorded snippets of human speech were stitched together to form words and sentences. While this approach provided somewhat intelligible speech, it often lacked naturalness and flexibility, as the system’s output was constrained by the available recorded segments and the listener inevitably hears where the sounds were connected together. Improvements in speech synthesis algorithms and computer processing power led to the development of rule-based and parametric TTS systems, which allowed for more dynamic and expressive speech generation by modeling the various aspects of speech production.
In recent years, the field of TTS has witnessed a transformative shift towards machine learning-based approaches, specifically deep learning techniques like deep neural networks (DNNs) and recurrent neural networks (RNNs). Modern TTS systems are often referred to as neural TTS. Neural TTS models learn to generate speech directly from text inputs by training on large datasets of text and corresponding speech recordings. This approach has several advantages, including the ability to generate more natural and expressive speech, adapt to various speaking styles, and even mimic specific voices. Moreover, neural TTS systems can be fine-tuned to accommodate different languages and dialects, making them highly versatile. As technology continues to advance, we can expect further improvements in TTS systems, enabling more lifelike and human-like speech synthesis for a wide range of applications, from accessibility services to voice assistants and entertainment media.
It is worth noting that even as TTS has improved, those improvements have not always been in line with what accessibility use cases require. For instance, naturalness is not necessarily what a screen reader is most concerned about in their TTS voices – instead, they often prioritize the ability for the voice to be sped up to facilitate faster information consumption and user interface navigation, which hasn’t always been possible with neural TTS voices. While this too is improving, it’s an interesting reminder of how accessibility and real use may focus attention on different metrics.
Personalized text-to-speech
A powerful application of TTS technology is replicating the sound of a person’s voice. This is often done for someone who is losing their voice, which can be caused by degenerative diseases, surgeries, or other conditions. Personalized TTS is designed to emulate the specific timbre, intonation, and idiosyncrasies of an individual’s voice, usually based on recordings of the person’s voice while it was still intact. The process of collecting and storing voice recordings for use later is called “voice banking.”
One of the most noteworthy trends in this area is that the amount of audio data required to create a realistic TTS voice has been going down dramatically. Initially, voice banking required hours of recorded speech to generate a convincing synthetic voice, which could be difficult and onerous for people to provide (especially, before they needed it). However, advances in machine learning have substantially optimized the process. These days, thanks to neural network-based models and techniques like transfer learning, it is possible to generate lifelike TTS voices with just a few minutes of recorded speech. This has not only made voice banking more accessible and convenient but has also opened the door for spontaneous personalization, where users can quickly generate customized TTS voices on-the-fly.
Natural language processing
Natural language process (NLP) is the subfield of artificial intelligence that considers language. A number of accessibility applications involve language directly – for instance, a variety of tools have been developed to support people with dyslexia, to simplify language for people with reading or cognitive disabilities, and map language to visual or other modalities to support people with aphasia. As with computer vision, oftentimes NLP is useful in accessibility applications that aren’t as directly connected with language. For instance, it can be useful to summarize content on a complicated screen for a person using a screen reader to provide a quick overview. Common challenges when working with text and language include understanding whether the text that is produced is high-quality. For instance, does a text simplification system change the meaning of text while simplifying it?
Predictive text
A common use of NLP across many accessibility applications is predictive text, which is used in a wide range of features that suggest what a user might want to type next. This can be especially useful in accessibility use cases because many people with disabilities find it slower to type. As one example, as examples consider people who use a small number of switches to operate a software keyboard or people who use their eyes to type. In both of these cases, text entry is slow, and so predictive text was useful earlier. These days many of us use predictive text across a variety of applications, from email to texting. One issue that is of concern with predictive text is whether users are truly typing what they want to say or whether they might accept the predicted text because it is easier to choose21 .
Accessibility Considerations in Mainstream AI/ML Applications
The previous section described how different areas of AI/ML map onto accessibility. In this section, we consider the accessibility of AI/ML applications not specifically designed for accessibility. At a high level, making certain that applications that are released are accessible to everyone is the same regardless of whether one is developing a product with AI/ML or not. Substantial work has been devoted to thinking about how one might make software of various kinds accessible, which generally boils down to ensuring that software has the necessary metadata and API hooks to enable it to be described and controlled in a variety of ways. For applications involving AI/ML, we have further considerations about models that may not work for people with disabilities, or may not include information relevant to people with disabilities.
One way of looking at digital accessibility is through the lens of “ability assumptions’’, which are the assumptions built into a software program about what abilities someone needs to have in order to use that software program. In traditional software, this can be as simple as assuming someone will be using a mouse—a famous court case involving Target at its core was largely about a website where developers chose to use the onmousedown
Javascript event handler instead of the onclick
event handler for a button that was important for being able to use the site. A seemingly small difference that ends up having a big consequence in terms of how one can use the site – with the mouse-based event handler only people using a mouse can use it, whereas with the click handler (a higher-level concept) people using a variety of input devices, including a keyboard or screen reader, etc., could use the site.
In AI/ML the issues are related to this, although the solutions are not always as simple as changing the method used to attach an input event. Consider speech recognition (introduced in the accessibility context in the previous section). Many mainstream applications can be highly useful to people with disabilities – for instance, voice agents (e.g., Amazon Alexa, Apple Siri, etc.) can be connected to smart home devices to enable voice control of such technologies, which can be highly useful to someone for whom physically interacting with those things is difficult. Yet, as the modality moves from tangible to voice, the accessibility concerns change to who can and cannot use speech and whose speech is or isn’t recognized by speech assistants.
What does it mean to make speech recognition accessible? Current voice agents do not recognize everyone’s speech equally – this has been noted for accents and dialects, and also for different kinds of speech impairments related to disability, including stuttered speech and dysarthria. This difference in performance can be traced back to a number of technical decisions and limitations, and performance can often be made better through explicit consideration of different kinds of speech. As one example, speech command recognition can be improved by extending the “endpointer” (time the system allows after it last detects speech before deciding the command is complete)6 . Some speech is much more difficult for these systems to understand because it is highly varied, such as more severe dysarthric speech, and on-going data collection and research needs to be done to improve speech recognition for such speech. Ultimately, as one participant in a study once told us – if other people can understand me, then I think voice assistants should also understand me. That is a high bar in many cases, but also speaks to the standards people increasingly have for AI.
Design of AI/ML Technology for Accessibility
As covered already in this chapter, accessibility can be a fruitful window into the future of technologies in AI and machine learning; yet, it is often also ground for technologies that seem like solutions but which are misaligned with real user needs or insufficient for truly addressing problems. Examples of these have included stair-climbing wheelchairs (robotics), sign language recognition gloves (multimodal machine learning), and electronic canes (computer vision). All of these can seem on the surface to be good ideas, and recognizing their limitations and current misalignment between performance and user needs can be subtle.

It can be useful to think through why the reaction among people with disabilities to these technologies is often skepticism and why such technologies are often abandoned even when produced as products. As an example, a stair-climbing wheelchair might seem like a great idea, since many places are still only accessible via stairs, which are quite difficult to use for people who use wheelchairs. Yet, stair-climbing wheelchairs tend not to be the solution that they seem to be. First, creating technology to deal with what is an infrastructure problem strikes many as backwards – why should someone need to purchase an incredibly expensive and complicated stair-climbing wheelchair, when we could instead change the environment to add ramps and other accessible alternatives? The expense is a big issue, since fancy technologies like stair-climbing wheelchairs can cost many multiples of a regular wheelchair. Stair-climbing wheelchairs have other drawbacks as compared to other wheelchairs—they are likely much much heavier than a regular wheelchair, which means they might not be practical to transport from place to place. The complicated mechanisms that look so impressive climbing up the stairs are likely prone to breaking more often.
Reactions to sign language translation gloves and smart canes are similar for related, if slightly different reasons. Sign language translation gloves have rarely interpreted more than a small handful of signs, whereas sign languages are rich combinations of myriad signs, able to be modified, placed and combined spatially, with the sign language speaker’s face playing a vital role. Smart canes often recognize content in the world that is irrelevant or already knowable in a simpler way. In this case, projects have sometimes discounted the tremendous utility of the cane itself for gathering information and conveying it—a skilled cane user can understand a lot about the upcoming terrain, the surface of the ground, and the relative locations of all of this.
Perhaps most confusingly, it’s not that these technologies wouldn’t possibly be useful if they truly worked well enough and were made practical in terms of cost and maintenance. Recent progress in sign language translation riding on the coattails of large multimodal modeling and new large-scale datasets seems, for arguably the first time, quite promising. Some sort of sensing solution for people who are blind could help with important information that canes do not provide access to—e.g., overhanging obstacles above ground level. Ultimately, building accessible technologies with AI/ML requires understanding the potential users of the technology, appreciating the utility of existing practices and technology, and not overpromising regarding the capabilities of what the AI/ML project may be able to do relative to these identified user needs and existing solutions.
Pedagogical Content Knowledge
Artificial intelligence and machine learning is a difficult material to make useful things with. By definition, machine learning systems extend beyond our ability to fully understand or control. If we could write a rule for everything we wanted the system to do, we would not need or benefit from using machine learning to do it. Thus, teaching about designing useful tools that utilize machine learning requires conveying this obvious but also subtle point.
Accessibility adds further complexity. How does one convey the complexity of AI/ML and also the nuance and depth of disability in a single course or (gasp!) class? For most students, the idea that computers can be used in a way other than the assumed graphical user interface will be new.
When I am asked to teach a single course introduction to accessibility, I do roughly the following:
- I introduce accessibility through the concept of “ability assumptions” – it is remarkable how we take certain abilities, such as visual acuity and high motor dexterity for granted, and how those assumptions are built into the user interfaces that have become dominant.
- I then pick 3-5 videos from YouTube of people with disabilities using a variety of accessibility features to use computers and phones in a variety of different ways. To orient the introduction toward AI/ML technologies, one can highlight the various machine learning technologies used in these technologies – for instance, demos of screen readers will use text to speech, demos of captioning will use speech recognition, etc. One or two of the later videos could show using machine learning not to access a computing device but to access content in the world – an easy example to use here is image description applied to the camera view.
- Finally, with that background, I discuss one or more research challenges in accessibility to show how interesting, exciting, and technical the field can be. I try to discuss both projects that are obviously “cool” – e.g., personal TTS voices for someone who is losing their voice, and, also something that is a bit more subtle, e.g., using computer vision to recognize the UI elements in an inaccessible interface. I also introduce technologies that have become tropes of tech misaligned with user needs, e.g., smart canes, sign language gloves, stair-climbing robots. The goal here is to make tangible how user needs must drive technological advancement. The difficult concept here is to convey the nuance of why there would be negative reaction to these technologies, which can seem to be at worst neutral and could even be eventually useful. In HCI and Design courses, it is common to say “you are not the user” but when designing for accessibility students truly need to be aware of their assumptions, biases and lack of knowledge.
- In a longer course, I will introduce disability studies topics more deeply than is conveyed via the initial “ability assumptions” discussion in (1). Even in a technical course, it is important to convey the history of disability and disability rights. The history and on-going discrimination that people with disabilities face is part of understanding the deeper reaction and relationship to accessible technology that people with disabilities have. While the field is much deeper than this, I find introducing the social and medical model of disabilities to be a useful framing that directly translates to technology design. Many students who are new to accessibility will have a medical model framing (roughly, the goal being to “fix” a person). The social model, in contrast, is roughly that it is the world and the way we have designed infrastructure that causes inaccessibility. I have also found it useful to discuss the Interdependence framing for assistive technology4 , generally following an introduction to the notion of Independence.
Conclusion
The practice and history of accessible technology is deeply entwined with that of artificial intelligence. People with disabilities have often been the earliest adopters of AI technology, and thus it is interesting to learn from their experiences as a way to understand the future. Every subfield of artificial intelligence has applications in accessibility and some are core to enabling access technologies to work at all. At the same time, AI technology is taking off in the mainstream and we already have examples of how people with disabilities are not being included in these new technologies. Given the dominant method of creating AI technologies is now to use machine learning, we must take special care to ensure that systems are trained and evaluated on the diversity of abilities and experiences that people with disabilities have. Machine learning, with its potential to adapt to myriad new situations, offers one of our best chances yet to develop technologies that truly work for everyone; yet, even as we embrace this potential we should not lose sight of the real people and the real uses they have for this technology. Ultimately, AI/ML technologies grounded the experiences and needs of real people have enormous potential to make our interactions with computers, the world around us, and other people more accessible to everyone — let’s work together to build that future!
Resources
References
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. .
Blaise Aguera y Arcas (2021). “Do large language models understand us?”. Medium.
Cynthia L. Bennett, Cole Gleason, Morgan Klaus Scheuerman, Jeffrey P. Bigham, Anhong Guo, and Alexandra To (2021). “It’s Complicated”: Negotiating Accessibility and (Mis)Representation in Image Descriptions of Race, Gender, and Disability. https://doi.org/10.1145/3411764.3445498.
Cynthia L. Bennett, Erin Brady, and Stacy M. Branham (2018). Interdependence as a Frame for Assistive Technology Research and Design. In Proceedings of the 20th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '18).
Cole Gleason, Patrick Carrington, Cameron Cassidy, Meredith Ringel Morris, Kris M. Kitani, and Jeffrey P. Bigham (2019). “It's almost like they're trying to hide it”: How User-Provided Image Descriptions Have Failed to Make Twitter Accessible. In The World Wide Web Conference (WWW '19). Association for Computing Machinery, New York, NY, USA, 549–559. .
Colin Lea, Zifang Huang, Jaya Narain, Lauren Tooley, Dianna Yee, Dung Tien Tran, Panayiotis Georgiou, Jeffrey P Bigham, and Leah Findlater (2023). From User Perceptions to Technical Improvement: Enabling People Who Stutter to Better Use Speech Recognition. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23). Association for Computing Machinery, New York, NY, USA, Article 361, 1–16..
Hiroki Yamazaki, Koji Iwano, Koichi Shinoda, Sadaoki Furui and Haruo Yokota (2007). “Dynamic Language Model Adaptation Using Presentation Slides for Lecture Speech Recognition. ” INTERSPEECH.
Joseph Weizenbaum (1966). "ELIZA—a computer program for the study of natural language communication between man and machine. " Communications of the ACM 9.
Jeffrey P. Bigham (2014). Making the web easier to see with opportunistic accessibility improvement. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST '14). Association for Computing Machinery, New York, NY, USA, 117–122..
Jutta Treviranus (2018). “Sidewalk Toronto and Why Smarter is Not Better. ” October 30, 2018.
Jeremy T. Brudvik, Jeffrey P. Bigham, Anna C. Cavender, and Richard E. Ladner (2008). Hunting for headings: sighted labeling vs. Source.
Jeffrey P. Bigham, Ryan S. Kaminsky, Richard E. Ladner, Oscar M. Danielsson, and Gordon L. Hempton (2006). WebInSight: making web images accessible. In Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility (Assets '06)..
Jason Wu, Titus Barik, Xiaoyi Zhang, Colin Lea, Jeffrey Nichols, Jeffrey P. Bigham (2022). “Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements. https://doi.org/10.48550/arXiv.2207.07712.
Kate S Glazko, Momona Yamagami, Aashaka Desai, Kelly Avery Mack, Venkatesh Potluri, Xuhai Xu, and Jennifer Mankoff (2023). An Autoethnographic Case Study of Generative Artificial Intelligence's Utility for Accessibility. In Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '23). Association for Computing Machinery, New York, NY, USA, Article 99, 1–8..
OpenAI (2023). Retrieved 11/9/2023. https://openai.com/customer-stories/be-my-eyes.
Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Eskenazi, Jeffrey P. Bigham (2019). “Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References. ” SIGDIAL.
Rico Sennrich, Barry Haddow, Alexandra Birch (2015). “Neural Machine Translation of Rare Words with Subword Units. ” August 31,.
Joel Shor, Dotan Emanuel, Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau, Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, Avinatan Hassidim, Yossi Matias (2019). Personalizing ASR for Dysarthric and Accented Speech with Limited Data. INTERSPEECH 2019.
Susumu Harada, James A. Landay, Jonathan Malkin, Xiao Li, and Jeff A. Bilmes (2006). The vocal joystick: evaluation of voice-based cursor control techniques. In Proceedings of the 8th international ACM SIGACCESS conference on Computers and accessibility (Assets '06). Association for Computing Machinery, New York, NY, USA, 197–204. .
Shaomei Wu, Jeffrey Wieland, Omid Farivar, and Julie Schiller (2017). Automatic Alt-text: Computer-generated Image Descriptions for Blind Users on a Social Network Service. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW '17). Association for Computing Machinery, New York, NY, USA, 1180–1192..
Stephanie Valencia, Richard Cave, Krystal Kallarackal, Katie Seaver, Michael Terry, and Shaun K. Kane (2023). “The less I type, the better”: How AI Language Models can Enhance or Impede Communication for AAC Users. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23). Association for Computing Machinery, New York, NY, USA, Article 830, 1–14..
Tim Berners-Lee and Daniel Connolly (June (1993). "Hypertext Markup Language (HTML) Internet Draft version 1. 2".
Tom Apone, Brad Botkin, Marcia Brooks and Larry Goldberg (2011). “Research into Automated Error Ranking of Real-time Captions in Live Television News Programs”. September.
Xiaoyi Zhang, Lilian de Greef, Amanda Swearngin, Samuel White, Kyle Murray, Lisa Yu, Qi Shan, Jeffrey Nichols, Jason Wu, Chris Fleizach, Aaron Everitt, and Jeffrey P Bigham (2021). Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI '21). Association for Computing Machinery, New York, NY, USA, Article 275, 1–15. .