Uncategorized

Bearing the cross of LLM costs

These days, the only thing worse than failing to leverage generative AI is to succeed at it. Soon after the celebration will come the curse of success: new APIs to maintain and costs to manage. The best protection is hands-down a good set of evaluations. But that could be easier said than done because organization momentum may take you elsewhere. Here’s what and why I’m doing about it. See if that resonates and let me know if I’m missing something obvious!

Two irresistible forces make us all want to upgrade our LLMs. The first, is the great token depreciation we’ve been experiencing and which is likely to continue. Some claim that improvements in the performance of newer small and medium models paired with their lower costs have reduced the price of a unit of intelligence by a factor 1,000x. Maybe. What is undeniable is that, at the time of this writing, the cost of gpt-4o-mini is 66.6x cheaper than not-so-old gpt4-turbo. That’s devilishly difficult to pass up if your constitution allows it!

The second force that’ll bend your knees is when a maker calls one of its models to the great yonder (e.g. PaLM2/bison). Like it or not, LLM API upgrades are as inevitable as death and taxes.

This is where evaluations come in. Or at least, that’s when their importance dawned on me. See, when teams first experiment with generative AI, they are just trying to make it work. They are looking to “build the right thing”. Final judgment, of course,  is from users via either qualitative feedback or measured impact. But when it’s time to upgrade the guts of a product or process already in production with a new LLM, you may be less excited to set up another beta group or run an A/B test. Certainly, I’d rather avoid the time and complexity where possible. If only the thing was “built right” and that a label swap was all I needed. In a good set of evaluations lies the salvation.

In places where LLMs are paired with the other two members of the generative AI holy trinity, prompts and data (input and output), you implicitly have the ability to compare the performance of the new LLM to the old one and, ergo, your eval.

Where that does not exist, one must become the creator. That’s what we’re doing in places like content destined for 3rd party platforms with high-consequence policies or chatbots designed to support precise personas or… The list goes on. Yes, there are tons of awesome and thoughtful public eval sets but they weren’t created for your use case. Some of them may correlate with your use case, but that, too, will have to be tested and discovered.

I wish I had asked the teams sooner to incorporate the creation of an evaluation in their capability designs. I believe that a good eval should now join the set of other things that satisfy the production ilities. Until we atone for this omission, the seductive new batch of models will remain just out of grasp. Don’t say I didn’t tell you so.

How we got started on generative AI

Per a16z, 2024 is the year when enterprises are getting serious about putting generative AI in production. And as I learned a few months ago via an MIT Sloan publication, it’ll be the majority of enterprises. Indeed, according to their published survey results, only about 5% of companies had any sort of GenAI in production in 2023. That figure was surprisingly low to me, given how much the world talks about LLMs. Nonetheless, it derives that many organizations are just now getting started on their AI transformation. As I did years ago about Business Agility, I want to share a few experiential learnings in case they help others. Also, the sooner I create a contemporary body of written work, the sooner I can train an AI agent to write in my stead.

Preamble

For my/our work experience to make sense, I believe some background is needed. If you disagree, feel free to skip over (link) this rather lengthy preamble.

Another potential bone of contention is my credibility on the subject. That’s in the eye of the beholder, I suppose. So here’s some context so you can assess. Today at JustAnswer, our GenAI-powered chatbots help our customers find professional services experts – the living, breathing kind – via 7M interactions per month. Moreover, we have deployed GenAI just about everywhere in our operations, from online ad management to the aforementioned chatbot infrastructure to a more personalized customer experience. Finally, employees across all departments use AI creatively, we leverage LLMs from all the big vendors, and our corporate strategy has had an AI pillar for over 15 months.

As far as my credentials are concerned, it’s admittedly less quantitative. However, I am the company CTO and the executive sponsor for a group of machine learning specialists that drive broad education for all employees, joint experimentation with all functions, and purpose-built systems and services for others to use. So you know that I’m surrounded by early adopters, that I’ve had a first-row seat to the process, and even, perhaps, a little influence.

Before I describe the steps we took, it might be useful to understand JustAnswer’s impetus for action when the power of LLMs became obvious in late 2022 because your situation might call for a different approach.

By now, the market consensus is that some companies will ride the GenAI wave and benefit from it while others will have that wave crash upon them to their chagrin. If there are companies altogether outside the path of the wave, I don’t know of them. But we can agree that not all organizations are strategically equidistant from it (again, context matters), so the urgency for action may differ.

JustAnswer is pretty close to that wave. See, the company seeks to democratize access to professional services by enabling consumers to have their issues resolved by qualified pros across a wide range of domains like medical, auto mechanic, veterinary, antique appraisal, legal, home improvement, legal, and so forth. JustAnswer broadens that access via online matchmaking and lowers the cost by primarily offering a lightweight chat-based interaction.

Questions. Answers. Chat interface. See where I’m going?

ChatGPT coming online was a very loud confirmation of what our data scientists had been saying: AI is getting pretty good at answering written questions. Given the nature of our product, should we not be paying attention? They had been poking around the OpenAI Playground since inception but hadn’t yet been able to convince the managerial class I represent to dial up our investments in AI. But from that point on, it wasn’t a matter of if we were going to embrace LLMs, but rather how.

Enough with potatoes, here’s the meat

When ChatGPT said “Hello world”, we had low double-digit employees sufficiently familiar with generative AI to understand its potential and fewer with any hands-on experience. I cannot claim to have been part of that group. Rather, at the time, I was with the majority of our 1,000 employees. So if we were going to take this seriously, we were going to have to grow our number of employees skilled in AI by two orders of magnitude, starting with yours truly.

The first thing we did was an internal education campaign on what GenAI is and why it matters. Thought leaders and titled leaders started posting regularly on our internal channels. We wrote about what GenAI is, what it can do, and how it could apply to us. Our goal was to create interest. We then created a few dedicated internal channels where technical and non-technical communities could go deeper into their area of interest. This was important because a publication bar set where a post must be of interest to all employees would be discouraging to authors. A bit like I’m feeling now writing this.

After the “tell”, came the “show”. Our most advanced users hosted demonstrations. We started with the basics, such as how to use ChatGPT, what prompt engineering is, what sort of tasks LLMs can readily do, etc. JustAnswer’s employees are a smart and curious bunch, so these demos triggered demand for more hands-on workshops, which our early experts – many in our ML group – so adroitly setup. We owe these folks a debt of gratitude for bootstrapping our GenAI adoption.

By early 2023, the potential of the LLM wave was clear enough that we declared learning to surf it a corporate priority. That declaration trickled down through the management operating system: goals and objectives, investment choices, rewards and recognition, etc. I believe this executive support was essential in overcoming our internal inertia. After all, our leaders and teams were already busy working on plans they liked and which we’d previously approved. Also, it was now clear that AI was going to be more transformative than the run-of-the-mill adoption of a new tool or business process. So we’d be faced with our own internal customer adoption curve, like any transformation. As I wrote years ago (ref), I believe that real transformations require exec support. Luckily, my colleagues on the exec team saw the same need so we were able to move swiftly.

Of particular importance was the decision we made to dedicate an agile team to creating a generative AI “paved path” for others to use. We did not mandate the usage of this emerging corporate capability, but we estimated that while a proof-of-concept is trivial for someone to create with, say, the OpenAI API, production systems are much more complex, what with all the ilities (ref). I may go into detail another time, but in short, we created an LLM proxy service anyone could use. That proxy service became a major enabler. I don’t always make great decisions, but this one has paid off in spades.

Just as important and not much less time-consuming was sorting out the Legal governance for the utilization of LLMs. At the time, it wasn’t immediately clear whether data sent to the big LLM vendors would be used to train future models. Some companies chose to to be cautious as a result (ref Samsung). Being in very close proximity to the wave, we decided to paddle hard instead so we tasked our legal team and ML leaders to generate guidance for the company. I wish I had been more closely involved earlier on because this turned out to be harder than I anticipated. In hindsight, I should have known that two tribes with very different skillset collaborating to clarify an emerging domain was going to require a little get-to-know-you.

Today, LLM API providers have much clearer terms of service and data agreements. Back then, we had to spelunk into each website, proactively request said ToS, and consult with external counsel to pressure-test our conclusions. R&D teams spent quite a bit of time articulating our use cases in terms that made sense to the legal team so they could provide nuanced risk assessments. Indeed, there are different concerns between a marketer creating online content, an engineering refactoring code, and a chatbot asking customers intake questions. I’m not sure why I had the expectation that I had, but it utlimately took several weeks for us to have an articulation that was both agreed by legal and understandable by employees.

So, short story long, by mid-2023, we had LLMs used by employees to refine their craft and deployed in production in a customer-facing manner to great effect. We were now beginner surfers. End-to-end, or rather, from seeing the wave to standing up on the board in public, it took about six months.

Innovation culture and hackathons

This is squarely a humblebrag, but I feel compelled to give a shout-out to the JustAnswer R&D Innovators.

So Proud CTO Alert (📢) out of the way, I want to shine a big spotlight on the JA R&D team for stunningly inventive demos during our 42nd quarterly hackathon. That’s right, 42! It’s a true testament to the company’s dedication to exploration and professional growth, and, according to HHGTTG (ref), perhaps even the meaning of life.

This quarter, we devoted our creative energy to going big with Large Language Models (LLMs). In an era where AI offers immense opportunities and complex challenges, focusing on LLMs was natural and timely. The results? Absolutely inspiring – 79 of our talented team members came together to create 37 mind-blowing demos. Seeing so many of these ideas getting ready to make their mark in our day-to-day operations will brighten anyone’s day.

Having joined JustAnswer a little over a year ago, I’ve been consistently wowed by our agile, hypothesis-driven approach to product development, what with its rigorous multi-variate testing and rapid iterations. But you know what I think the real jewel in our crown is? Our unwavering commitment to learning, brought to life in the company’s lean philosophy and practices like these hackathons.

For employees, a hackathon isn’t just an event – it’s a reflection of a decade-long journey in embracing agile and making it a core part of our DNA. This latest LLM chapter of the journey started with our ML specialists exploring the OpenAI Playground as soon as it launched. Then came the educators, shining a bright light on the crazy potential of this technology via company-wide lunch-and-learn sessions. The innovators followed, crafting prototypes, how-to guides, templates, and documentation to give everyone a running start. Our legal team then chartered the waters to ensure we could navigate safely and wisely. And let’s not forget the early adopters who delivered stunning production results and the decision-makers who bravely prioritized these groundbreaking experiments. It’s been a remarkable team effort all the way.

A huge round of applause to everyone who turned this cutting-edge technology from mere concepts into borderline-magical solutions! Navigating the AI wave requires skill and grace, and you’ve all shown that in spades. My next challenge? Selecting the hackathon winners. Though in the grand scheme, that’s a great problem to have.

Here’s to continued innovation, relentless learning, and more groundbreaking hackathons!