Why I invested in Arcee AI & Mark McQuade
How Arcee’s Tech is Differentiated and Why Arcee Will Eat Large Model Providers for Lunch
Friends - I will resume my narrative about my travels and work but thought some of you might find it cool to see what my internal Investment memo looks like a highly technical AI/ML company, this is from my pre-seed/seed investment in Arcee 11 months ago! See many of you in NY this fall!
Fundamentally, Large Language Models (LLMs) are in a bad spot, completely screwed if you will; they are a race to the bottom via API cost and have a tremendous cost related to compute load and training. Large parameter language models hallucinate too frequently to be helpful in precision industries such as manufacturing, healthcare, insurance, legal, and accounting. While impressive in their width of knowledge, depth does not scale proportionally; LLMs are equally shallow across the board. Their advent was not for commercial applications; a group of AGI fanatics wanted to create superhuman intelligence at all possible costs, including billions in R&D costs.
The cycle of AGI-related LLM development created two significant externalities: an intense focus on the cost of development for AI/ML systems from investors and companies deploying the models. The second externality is that the demand for accurate AI/ML services has exploded, with companies looking to own highly secure, personalized, accurate, and domain-specific models. A company does not need its language model to be able to speak about the history of the Arctic Circle when the company makes auto parts. A basic understanding of the English language and specialty of auto-manufacturing is needed.
As I have met with Arcee's technical team and specifically their CTO, Jacob Solawetz, I've come to understand something that most VCs do not (except for Lee & Joe <3): Arcee is the leading model developer in the world for the downsizing and managing the internals of complex language models, notice how the emphasis is not just on the size of the models here. This is not a simplistic thesis that Large Language models are expensive, so we need to switch to Small Language models; it's about how the internals of models are being modified in ways previously unimaginable, allowing for better, smaller models to exist without having to pay exorbitant development and inference costs, and the models are better... they are commercially viable and cost-effective for precision industries such as finance, manufacturing, healthcare, insurance and legal, breaking out of the OpenAI conundrum of being too expensive and generally aware but not smart enough to do precision tasks.
Arcee's strength lies in its ability to develop original IP and combine multiple SOTA (state-of-the-art) techniques into a single commercially viable model. Charles Goddard, a member of the Arcee technical staff, created MergeKit, the leading AI/ML tool that allows multiple Open-source and OS adapted language models to be merged into a single model. Mergekit allows for the pruning of redundancies across models, deprioritization of vectors (aka elements of the knowledge base) that are not useful for the stated task, and the prioritization of highly applicable and relevant information. The benefit here is that exceptionally large language models, 500B+, do have huge amounts of valuable knowledge, but they are buried across 500B parameters and need to be transferred to a smaller, more accurate model built for a known task, not artificial general intelligence. The core transformations here happen in two states, the first one being the merging of models that takes the base Large Language model out of a generalized state and transforms it into a domain-specific model, the second being the distillation of that model into a smaller parameter model, both are novel transformations that Acree has pioneered while kneecapping OpenAI in a very overt manner.
Fundamentally, we want some of the power of the large model but not its lethargy and bloat from both an accuracy and cost perspective. To put a model in production for an aerospace manufacturer, Arcee will take a high parameter 70B+ model that is general, deprioritize values not related to aerospace manufacturing, and merge in a cheap 3B model specialized in aerospace manufacturing. While it sounds simple, figuring out the correct vectors to deprioritize and then reconciling model internals from different encoders, corpus's of data, and internal transformations is incredibly complex. Arcee owns Mergekit under a LGPL license, meaning that the core technology for merging is theirs; this is quite important given that the vast majority of high-performing language models are merged models, including Google's latest Gemma-2 model family, in which they cite Arcee and Mergekit in their paper; needless to say, Arcee has quietly become ubiquitous. Check the HuggingFace Leaderboard; it's Mergekit over and over again.
This was V1 of Arcee; it has Mergekit, which is a billion-dollar breakthrough to me; the value here is exceptionally clear. That being said, Arcee has been working on a framework called Distillkit, which is both a compliment to Mergekit and a step forward, depending on how you plan to use Distilkit. The problem with AI today is that it is fundamentally too big and inaccurate. While we can create domain-specific and highly accurate Large models for the cost of a small one with Mergekit, those models are still the size of the largest model in the merge. For example, the models could be 3B, specialized in aerospace manufacturing, and 70B general; while we would get an accurate 70B aerospace manufacturing model for the cost of only building a 3B one, we still have to RUN the model, and the size is still 70B, as I said, when you merge you get the largest size model that you merged with. You still benefit from not paying the massive development costs of training large parameter models, but you are still hit hard with the inference (aka running the model on a GPU/CPU/TPU, etc.).
There are techniques like quantization-aware training that are used in some models, but they fundamentally do not address size and accuracy; they mainly downsize the models without making them more accurate, which is a huge issue. Distillkit does not pair down a model like Quantization; it uses the domain-specific very large parameter merged model to train a smaller version of itself with only the very best attributes. This can be conceptualized as a teacher-student model where the student only learns the most relevant information from the teacher. This applies to cross-model architectures, too.
To review in simpler terms, a Llama 70B general model can be merged with a 3B domain-specific aerospace manufacturing model, creating a highly accurate merged model called 70B Llama-Aero, then using Distilkit, the 70B Llama-Aero will start a new training run where it creates a smaller version of itself in either the Llama format or it can format the new student model to resemble the architecture of another model provider, for example Stable LM 2.16B. Now you have a domain-specific powerful model that has many of the best parts of the expensive, very large language parameter models but took on none of the training costs for the large parameter model's original development and training, Meta paid for that. The model you end up with is now small enough to run on your laptop, meaning you no longer have massive compute bills. This is not hypothetical; Arcee is currently #1 on the relative leaderboard with Arcee-Lite, which outperforms Qwen2 1.5B and is currently the best 1.5B model; Lite is beating some models that are more than 10X its size on relevant benchmarks, an also soundly beats Apples new 3B parameter model coming to the iPhone 16. I believe that Arcee will soon own the mobile-local LLM/SLM space, and given their recent results, it's hard to see it any other way.
While Distillkit and Mergekit are highly effective, Arcee recently launched a re-training protocol called Spectrum, which allows their internal R&D costs to drop dramatically. Within transformer-based models, multiple "layers" have signal-to-noise ratios; those ratios indicate the usefulness of the layer. When training a model, Arcee uses Spectrum to detect poor signal-to-noise ratios and freeze those layers during training or retraining. By doing this across multiple training runs, "Spectrum consistently reduced the training time by an average of 35%, with some training pipelines seeing a 42% increase, allowing... faster iterations and quicker deployment of models. Spectrum's selective layer training reduced memory usage by up to 36%, enabling the training of larger models or increased batch sizes without additional hardware resources. Despite the reduced training time and memory usage, models trained with Spectrum showed no significant degradation in performance. Some models demonstrated improved performance due to the targeted training of high-impact layers." -Acree. This means that Arcee can build their already small models faster, better, and cheaper using Spectrum, Mergekit, and Distillkit.
At this point, Arcee is merging larger parameter models into small ones, making them domain-specific, and then cutting down their size without major accuracy loss, as seen in Quantization via utilizing a teacher-student method in Distillkit and saving on development costs via Spectrum. You can now run a state-of-the-art language model on your laptop, and it is likely soon to be locally on your phone, meaning the security question about re-training models has been solved by Arcee, as the models can run locally. It's exceptionally clear that Arcee pioneered the SLM; what you just read is precisely how they did it. If you don't believe that they will eat the larger players' lunch, just remember that they already ate their models, chewed them up, and sold them.
—John Villa, September 10th, 2024