These days, the only thing worse than failing to leverage generative AI is to succeed at it. Soon after the celebration will come the curse of success: new APIs to maintain and costs to manage. The best protection is hands-down a good set of evaluations. But that could be easier said than done because organization momentum may take you elsewhere. Here’s what and why I’m doing about it. See if that resonates and let me know if I’m missing something obvious!
Two irresistible forces make us all want to upgrade our LLMs. The first, is the great token depreciation we’ve been experiencing and which is likely to continue. Some claim that improvements in the performance of newer small and medium models paired with their lower costs have reduced the price of a unit of intelligence by a factor 1,000x. Maybe. What is undeniable is that, at the time of this writing, the cost of gpt-4o-mini is 66.6x cheaper than not-so-old gpt4-turbo. That’s devilishly difficult to pass up if your constitution allows it!
The second force that’ll bend your knees is when a maker calls one of its models to the great yonder (e.g. PaLM2/bison). Like it or not, LLM API upgrades are as inevitable as death and taxes.
This is where evaluations come in. Or at least, that’s when their importance dawned on me. See, when teams first experiment with generative AI, they are just trying to make it work. They are looking to “build the right thing”. Final judgment, of course, is from users via either qualitative feedback or measured impact. But when it’s time to upgrade the guts of a product or process already in production with a new LLM, you may be less excited to set up another beta group or run an A/B test. Certainly, I’d rather avoid the time and complexity where possible. If only the thing was “built right” and that a label swap was all I needed. In a good set of evaluations lies the salvation.
In places where LLMs are paired with the other two members of the generative AI holy trinity, prompts and data (input and output), you implicitly have the ability to compare the performance of the new LLM to the old one and, ergo, your eval.
Where that does not exist, one must become the creator. That’s what we’re doing in places like content destined for 3rd party platforms with high-consequence policies or chatbots designed to support precise personas or… The list goes on. Yes, there are tons of awesome and thoughtful public eval sets but they weren’t created for your use case. Some of them may correlate with your use case, but that, too, will have to be tested and discovered.
I wish I had asked the teams sooner to incorporate the creation of an evaluation in their capability designs. I believe that a good eval should now join the set of other things that satisfy the production ilities. Until we atone for this omission, the seductive new batch of models will remain just out of grasp. Don’t say I didn’t tell you so.