Image Describer

Inspiration

Digital applications are a game changer for people with disabilities, when they are built with accessibility in mind. A meal ordering app; accessible banking that avoids the hassle of travel; audio in video games that allow blind people to play -- and win! These are all real scenarios that enrich everyone's lives, but especially people with disabilities.

None of this has been possible without well crafted UI. However, even when product teams take into account accessibility, there are still sometimes gaps or roadblocks. To date, blind people have relied on sighted assistance and support, either through a colleague, customer support desks, or volunteer services like Be My Eyes. These are effective, but also burdensome, and can compromise a person's privacy.

AI can bridge these gaps in accessible content by providing an "assist" for blind people using the web, while maintaining privacy, convenience, and ease of use. I ventured to help solve this by integrating Gemini Vision and Chrome. Google Chrome is ubiquitous in the blind community, with over 50% using it as their primary browser according to the most recent Webaim screen reader survey

What it does

Image Describer is a Chrome Extension that uses AI to describe content in the browser for blind and low vision people. A person can get a description either through the content menu or with a keyboard shortcut. The description is announced back to the user, and is also available as text for review.

The extension also returns followup questions with pre-loaded responses, which support a call-and-response style interface without long delays between requests.

How I built it

I created a Chrome Extension using CRXJS that sends base64 encoded image data to a service running on Google Cloud Platform. The service proxies images to the Gemini Vision API, along with a prompt to split out the description into bite sized chunks. The background script in the Chrome extension then parses the response into and announces it to the user via Chrome's TTS engine. The response also gets presented to the user in the extension popup, where they can choose from a list of followup questions, also generated by Gemini.

I use a simple single shot prompt like so:

    const prompt = `This image was submitted by a blind person to get an image description, with the question "What's in this image?".
    Provide a succinct response in under 10 words that will be announced back to the user.
    Also provide a list of 5 questions that the user could ask about the photo based on the provided description,
    along with the answers to those questions. Return you response as valid machine-readable JSON with 2 keys, 'description' and 'followups'
    with an array of values whose keys 'question' and 'answer'. For example:

    Prompt: [image]
    Response: {"description":"add a short description here","followups":[{"question": "Add a question here", "answer": "Add an answer here"}, ...]}
    `;

Challenges we ran into

At first the descriptions were very long and detailed, which was useful but I heard feedback that it could be too tedious for blind users. To address this, I added a "quick description" option, and eventually moved to a progressive disclosure pattern, where the initial description is more brief, and there is an option to present followup questions.

Also, initially the extension could only be triggered with the context menu, which made it hard for screen reader users to invoke the description. In response, I added a keyboard shortcut to describe the full tab instead of just a single image.

Accomplishments that we're proud of

I'm especially proud of the insight to use progressive disclosure approach to reduce the verbosity of the messages without losing detail.

What we learned

I started by using OpenAI's Vision API, but switched to Gemini because the responses we more succinct and to the point, which was important for this application. It was enlightening to see the differences between the 2 models.

I was surprised that many people who are not blind have been using the extension to help support their teams in creating initial alt text for images. This was an unexpected but welcome use case.

What's next for Image Describer

I'm looking forward to continuing to respond to feedback from the blind community through Discord and Reddit groups to refine features. I'm also eager to make the followup questions more relevant and useful by more refined prompt engineering, and generate more detailed followup responses and even conversation trees, likely across multiple Gemini API calls.

Built With

Updates

Cameron Cundiff started this project — Apr 12, 2024 07:54 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.