In a recent Forbes article, our CTO discussed how to use LLMs to tackle a specific problem: labeling in-domain messages quickly to support nuanced incident detections such as sexual harassment. As machine learning is core to the GGWP platform, our team frequently evaluates how to leverage the latest ML developments, such as in foundational language models, for product features that solve challenging problems while conforming to cost, latency, and privacy boundaries. Choosing an appropriate model and/or partner is fundamental to the labeling use case and many others. Below we discuss our evaluation process during product development.
Before getting into the details, a common question we are asked is – if LLMs are so smart, why not just deploy them everywhere in content moderation? As the adage goes, we want to use the right tool for the job. For moderation specifically, there are a few important tradeoffs to consider:
In many components of our platform, LLMs are too heavy, too slow, or too costly for practical use. However for certain high-value, high-complexity tasks, they are irreplaceable tools that enable previously unattainable outcomes.
There is no shortage of model options between newer entrants into the space such as OpenAI and Anthropic, and tech incumbents like Microsoft, Meta, and Google / Deepmind. While all are focused on long-term visions of foundational general intelligence that can deploy into a broad range of tasks, each company must also prioritize and differentiate in the short-term, leading to important decision points when choosing how to build AI-based products.
Broadly speaking a core product tradeoff is between functionality and safety / alignment, as evident in the lively discussions between those advocating for fast AI development for humanity’s sake versus those advocating for measured AI development also for humanity’s sake.
For illustrative purposes we may view OpenAI as leaning more into the former and Anthropic as leaning more into the latter (though both companies are obviously multifaceted). OpenAI ignited interest in the LLM space and has been developing functionality at breakneck pace ever since, introducing different modalities (text / code, image, audio, now video), general purpose developer tools (API, function calling, data plugins, custom GPTs), and ecosystem support (GPT store, enterprise partnerships). While Anthropic has caught up quickly in text through its Claude family of models, its focus has been on safety, steerability, and alignment rather than expanding its breadth of capabilities. Technical developments such as embedding and reinforcing pre-defined guiding principles (Constitutional AI), research into decomposing model outputs to the feature (patterns of neuron activations) level, and red teaming their own models to uncover alignment gaps all feature prominently in Anthropic’s research goal of building AI capable of sensitive and nuanced tasks.
Though it is clear OpenAI still highly values safety & alignment and Anthropic is actively working on capabilities such as multimodality, as users we must consider how each AI package fits with our own product vision and concerns. Building an AI-based travel agent may require binding principles less so than the ability to process images, access external information, and leverage a wider developer ecosystem. Conversely an intelligent caselaw assistant is already text-focused and becomes much more reliable if it can adhere closely to the law firm’s guidelines (not to mention having longer context windows, which we will get to soon). However as with any company, vision and results may not always align, so it is best to benchmark these models directly in your work.
Another important strategic divide is closed versus open source, with companies like Meta and Mistral seeing an opening and releasing some of their foundational models directly to the community. There are technical and business tradeoffs on both sides, with important implications for the end users.
On the open source front, experimentation is happening rapidly with strong focuses in efficacy (crowdsourced RLHF), efficiency (a necessity given compute constraints), and portability (GPT in C, GPT on a Raspberry Pi). For those working with sensitive proprietary data or requiring custom deployments (e.g. in-house or on device), using an open source model base like LLaMa or Mixtral affords much greater control over privacy and resource tradeoffs.
On the flip side, model size matters as certain capabilities such as instruction-following emerge and improve through scaling up the number of model parameters. In a recent study comparing BloombergGPT, a smaller but purposefully trained financial model, to GPT-3.5 and GPT-4, larger generically trained OpenAI models, researchers found that the OpenAI models outperformed BloombergGPT on most financial tasks with simple few-shot learning, suggesting that even pricy proprietary models trained on the best in-domain data can fall short of their larger generically-trained peers. And as the size suggests, these hundreds of billions to trillion parameter models are only viable for the largest platforms and are unlikely to be open sourced (nor can anyone in the community afford thousands of Nvidia GPUs). Along these lines, state-of-the-art performance will go to the closed platforms, not to mention costly cutting-edge capabilities such as video generation (OpenAI Sora). These centralized services will likely achieve better economics too, as inference batching strongly impacts unit economics, and higher user volume begets more efficient batch sizes.
Most users with limited technical background and desire to use the best (and possibly cheapest) outputs should consider the closed providers. Those requiring strong controls, customization, and privacy constraints should consider the leading open source models, which fortunately are maturing rapidly. And in some cases, such as when wanting robust high-value decisions, it may be best to use an ensemble of closed services alongside open source deployments.
Having considered the broad strokes, we may dig into other key technical details that differentiate certain models:
Building great AI products has evolved from requiring pure technical expertise to a healthy mix of technical, product, and business reasoning. Presently there are myriad incredible companies building model platforms, and though they all plan to get to powerful, cheap, safe, and easy to use AI, it is important to compare what they prioritize today and how that aligns with your own product requirements.
Brian is the VP of AI/ML at GGWP.