For those of you not keeping up, ChatGPT at the end of 2022 gave us a computer that can understand and generate English. Not just English, but human language in general. On the face of it, that can seem like a small step. It’s not.
Yes, it struggles with facts and up-to-date data, but that part is solvable and often not as critical as you’d think.
If that wasn’t revolutionary enough for you, the next shift is generative artificial intelligence (GenAI) going multi-modal. ChatGPT started out as text-to-text. Multi-modal is the other stuff:
- Text-to-Audio. Generating voice from text. Companies like Elevenlabs, Google and Amazon can now create both pre-trained voices and custom cloned voices that are in many cases indistinguishable from human voice. Everybody has access to these tools - it’s really easy.
- Audio-to-Text. Apple, Amazon and Google have been doing this for years. Good luck if you have an accent. Today, via GenAI it's vastly improved - even with that accent. Otter.ai and OpenAI’s Whisper are two of my favorites. ChatGPT now has Audio-In > Audio-Out as an option so you can bypass the text altogether.
- Text-to-Image. This area has been improving rapidly in the last year. OpenAI’s Dall-E was fun to play with a year ago. Today, the free Bing.com/create is really close to creating usable photo-realistic images. MidJourney is also right there. SHOULD you be generating images for your travel marketing campaigns? I’m staying out of that argument.
- Image-to-Text. This just happened in the last couple of weeks as part of ChatGPT. Take a photo, upload a sketch, whatever you like, and ask the GenAI to describe, deduce, tell a story … anything you want.
- Text-to-Video and Video-to-Text. These aren’t really any different to the technology for images, except there’s a lot more data (more compute, more $) to process. These are not ready for prime time yet, but they’re coming. It’s fun to play with. Just like images were a year ago.
So if you took any of these modes in isolation - they’re all significant. But when you connect them all together seamlessly, it starts to get really interesting.
Subscribe to our newsletter below
Consider the new Ray-Ban Meta. They’re basically a multi-modal large language model (LLM - think ChatGPT) you wear on your face. You can do the basics like listen to music or take photos and videos. But really, they’re a frictionless way to connect what you see, hear and say to the world’s most powerful (in a practical sense) computer, an LLM.
When you take all of the human senses, it’s a fair argument that vision and hearing are the ways we consume most data. Voice and text are the ways we communicate most data. These are all now functioning and connected.
Today, we’re already using many of these tools, but they often don’t function very cohesively. It’s still a bit patched together. This will change.
The end of the phone?
To be clear, I’m not really saying it’s the end of the phone. Yet. We still need somewhere to view cat photos or play Candy Crush. But this user experience (UX), the screen that we all stare at for most of our working day for many tasks, will become obsolete. The screen connected to a keyboard seems very natural to us. But it was only the best way to interact with the technology we had at the time it was invented. The only practical way to communicate before GenAI was via words typed into a box or by clicking buttons and drop-down filters.
For many tasks, it’s just not a very efficient way of communicating. It’s a mistake to point out that this was promised to us 10 years ago with the smart speaker, and therefore it’s been tried and doomed to forever fail. The smart speakers weren’t very smart back then. They had no GenAI, and that makes it an entirely different proposition. Today, it’s really easy to search an entire database with a natural language search and return accurate matching results. A year ago, that was almost impossible.
So now, with my sunglasses on, I can look at Alcatraz out of my hotel window, and just say “Hey Meta, what’s that, and how do I visit?” Without ever looking at a screen, I can be given a set of options, prices and availability. And book it only using voice.
The sunglasses? I’m not sure if that’s the future. But again, save the Google Glass failure comments. It’s not much different to just pointing your phone at something. The seamlessness is important. One fewer click, or two fewer actions like grabbing your phone and navigating to the correct app - these things make a difference.
Fifteen years ago, we heard stories of folks in Asia booking hotels, or even flights, on a smartphone. None of that seemed plausible at the time. Surely, that’s a step too far? It’ll never catch on in the west. These things take time. Humans are great at stalling progress in the short-term. But eventually, better is just better, and humans find the easiest way to get a task done with the tools they have access to.
For many legacy companies (my new definition is any company over two years old), they face an innovator's dilemma. Their UX is hard-coded into everything they do. It’s not just loyal customers using their current UX, but it’s also the company designers, data analysts, marketers, engineers and more whose roles stem from the current flow and current UX. It’s quite difficult to move away from that.
This just might open the door for some new players. There are a lot of startups coming after that market share. Most are trip planners. Most are hard to differentiate from each other right now. But one of them may create the next killer app and turn everything upside down. Exciting times to come.
About the author …
Christian Watts is CEO of
Magpie.
Updates from OpenAI DevDay and the latest on generative AI for travel
Join us Tuesday, November 7, for the latest
PhocusWire LinkedIn audio event as we discuss this topic with Priceline’s Toby Korner, Microsoft’s Pablo Laucirica and Tripadvisor’s Rahul Todkar.