I’ve done more testing of multimodal GPT-4, (the model powering Be My AI) in the last few days, most of it with sighted friends, and my impressions are thus: The thing is pretty accurate when describing memes, but the descriptions are often far too long, far too verbose, and the facts are presented in an order that makes the meme less funny than it should be. There’s a fair bit of christian-fundamentalist pruderity being applied, and people who aren’t wearing any clothes are described as “cropped.” Same goes for faces, which are "blurred out for privacy", even if they actually aren't. This would be a sensible privacy precaution, if not for the fact that the bluring also occurs for the faces of famous people, making many images meaningless. The algorithm can sometimes notice details that sighted people don't until they're actually pointed out to them. However, it's pretty bad when it comes to actually useful stuff such as diagrams, figures etc. To give just one example, we gave it a run-of-the-mill diagram of a chessboard, and it described the chess positions in vivid detail, while being absolutely wrong about what these positions actually were, as these AI models tend to do. It's even worse with text, especially foreign-language text. Unlike many OCR algorithms, which produce text containing many typos, Be My Ai's output is almost always free of those, it makes grammatical sense and is contextually related to what the image actually contains, but the text that it claims is in the image actually isn't there. For example, when we gave it a page of a coffee machine manual in Polish, with pictures and descriptions of the various kinds of coffee that the machine can make, it got the coffee names right and the coffee descriptions were pretty accurate, but they were completely different descriptions from those on the actual page! It's also pretty clear that the tokenizer Be My AI uses was primarily trained on English text. This causes foreign-language output to need more tokens for the same amount of characters, which lengthens generation time and, more crucially, often causes the text to be cropped prematurely. In conclusion, I stand by my opinion that this tool holds great promise for the future, but in its current incarnation, has very limited use for a blind person and is barely more than an occasionally useful but fun toy.
And to be abundantly clear here, I’m not personally anti-AI by any means. My job heavily involves AI, I heavily use Copilot and Chat GPT in my daily life, am pretty convinced by the “AI training should be treated as fair use” arguments and support the unaligned / open source AI efforts. I think AI is the future, and models like these will probably get good enough for most uses eventually, but that time has not yet come.
And re: AI and the inability to deal with material that contains nudity. There are actually good arguments on both sides to be made here. While I agree that not letting us recognize such pictures is a form of discrimination and treating disabled people as “second class”, not training AI on such images does actually make some sense. AI does make mistakes, and there are large swaths of society that would be very unhappy if their children had access to a tool that would occasionally produce very erotic responses, particularly when fed completely “innocent” images or prompts.
@miki Agreed 100 percent! If, for instance, I take a photo of my digital piano's Control Panel and buttons, it talks about the piano screen as if the screen were on where, in fact, the piano is off. So both the placement of the buttons and the stuff about what the piano displays are wrong.
@miki You'll find the piece just published here helpful about other languages as well. It reinforces my findings as well as yours.
Lack of training data and optimization for other languages to blame?
To see how #ChatGPT does with languages that are spoken by millions of people but aren’t common online, we tested it in Bengali, Tamil, Tigrinya, and Kurdish. It failed.
https://restofworld.org/2023/chatgpt-problems-global-language-testing/
@miki This is a very hard problem to solve. There's the task of just describing the image accurately which is Hurculian in itself, but also, what details are the relevent ones the AI should emphasize? I think they will need to train a specific model for each category of image they want described in order to focus on the right things.
@pitermach @miki Agree with all of this. Also wanted to add that it even blurs faces of animals... like why? Took a picture of my guide dog and it blurred her face.
@miki Agreed.
@miki @CoffeeHolic88 This is very interesting. I’m excited of where the technology can take us, but this is definitely the very beginning. What surprises me a little bit is that the AI always seems so confident, why can’t it just admit when there’s a low threshold of certainty about something? Also, I hope we can get to where it stops blurring out faces, not looking for it to identify people as such, but they should be describable.
@miki @CoffeeHolic88 Also, I recently figured out that even if it blurs the faces, it’s possible to query for other descriptors of the person. What color shirt is the person wearing, anything that doesn’t require the face. Super annoying, but helps make images of people a little less useless.