In today’s column, I explore the exciting and rapidly advancing realm of faster and slimmer generative AI that is being devised via the latest advances in so-called 1-bit large language models (LLMs). No worries if you don’t know what those are. I’ll be walking you step by step through what these emerging 1-bit LLMs are all about.
The topline aspect is that by devising generative AI based on this relatively new form of technological capabilities, you can astutely craft AI that works well under low-resource situations.
What is a low-resource situation?
Imagine wanting to run a full-scale generative AI app entirely on your smartphone, doing so without having to engage any online or Internet connections. Envision a specialized edge device running standalone in a factory, fully loaded with a full-sized generative AI app tailored to doing factory work. And so on.
Exhilarating times, for sure.
Let’s talk about it.
This analysis of an innovative proposition is part of my ongoing Forbes.com column coverage on the latest in AI including identifying and explaining various impactful AI complexities (see the link here).
Where We Are About Generative AI Sizing
Before we get to the bits and bytes, I’d like to provide some fundamentals about modern-day AI.
The rush to develop better and more robust generative AI has tended to balloon the size of the rapidly advancing AI. Everyone talks about scale these days. Make the AI larger so that hopefully it will do increasingly amazing feats of pattern-matching. Use tons more data while data training so that perhaps the AI will answer more questions and be further rounded out.
Bigger seems so far to be better.
Some are aiming to see if we can have our cake and eat it too.
Here’s what that means.
First, be aware that today’s large-scale LLMs require superfast computer servers that reside in data centers and the pricey hardware must be coddled to keep it functioning properly. The amount of computer storage or memory is equally humongous. The cost of these computing resources is enormous. Plus, the electricity consumed, along with sometimes water consumed if using water cooling, makes your head spin when you see the volumes involved.
Second, the only realistic means you have to use those large-scale LLMs is by connecting online to them. Thus, when armed with your smartphone, you will need to find a reliable Internet connection to successfully use the AI. Any glitches in your connection are bound to foul up whatever usage of the AI you are trying to undertake. Plus, your smartphone is almost nothing more than a simpleton communication device and not especially leveraging the computer processing power that it has.
Okay, the conundrum is this:
- How can we somehow get those LLMs to work on smaller devices, take less energy, avoid the need for an Internet connection, and yet at the same time achieve a modicum of similar results?
The beauty of that question is that it is principally a technological consideration. We just have to find clever technological approaches or solutions that can bring about that hoped-for dream. Get the best tech wizards on top of this and keep their noses to the grind. This brings up the proverb of wanting to have your cake and eat it too. We desperately want smaller LLMs that function essentially the same as running on large-sized resources.
The answer so far is the advent of small language models (SLMs).
Yes, SLMs are being devised and gradually being adopted for use on a wide array of handheld devices. I recently conducted an overview of the emergence of SLMs as an alternative to conventional LLMs, see my analysis at the link here.
Small Language Models Entail Clever Solutions
The idea underlying SLMs is that you can potentially compress or compact full-size LLMs into a smaller overall package. It is the proverbial dilemma of trying to cram ten pounds of potatoes into a five-pound sack.
Unfortunately, many of the techniques and procedures for compaction tend to lose something along the way, namely that the SLM is often less capable than the bigger brethren LLM. An agonizing choice must then be made. You have AI that runs with less demand for resources, but will the AI still do the things you want done?
There is a huge appetite to make generative AI a lot less resource-hungry while also pushing to the limits of keeping the desired capabilities.
A plethora of approaches are vying right now to see what works best. No one can say for sure which technique or method is the winner-winner chicken dinner. One method suggests stripping out portions of the AI’s internal structure and leaving just bare bones. Generally, that tends to also undercut the capabilities of AI. It is smaller but with a lot less proficiency.
I’ll go ahead and explore here a method that shows great promise and uses a quite clever premise. The belief is that maybe we can use fewer bits and bytes, doing so by switching from full-sized numbers to something more in keeping with the world of binary notation, befitting everyday computing hardware.
The Low-Bit Or 1-Bit Solutions
For those of you unfamiliar with the binary world of computing, I shall provide a handy rundown. Let’s start with the macroscopic 30,000-foot level view and make our way to the ground-level binary bits viewpoint.
Regular numbers such as 1,840.56 or 0.5423 are stored inside a computer in what is known as a floating-point format. To do this, the number of bits used is usually either 16 bits or 32 bits. The computer keeps track of where the decimal points. Then, when adding, subtracting, multiplying, and dividing, the numbers are typically kept in the floating-point format. Each number consumes those 16 bits or 32 bits. Ergo, if you were storing a thousand numbers, you would need to use 16,000 bits (that’s 1,000 numbers each consuming 16 bits) or 32,000 bits (that’s 1,000 numbers each consuming 32 bits).
The more bits you use, the more memory or storage you need.
Various efforts have been going on since the start of the computer field to compress things so that maybe in some circumstances 8 bits could be used instead of 16 bits or 32 bits. For example, if you restrict the numbers such as saying that no decimal points can be used or stipulating that the numbers can’t be larger than a particular maximum, you can squeeze things down. Depending upon the situation, you might even be able to reduce this down to 4 bits.
The advantage is that the fewer bits you use, by and large, the less computer memory or storage you need. There is also a solid chance that the number of computing cycles needed to crunch the numbers and perform arithmetic operations on them will be faster.
Slimmer and faster, that’s the goal.
Modern digital computers end up at the keystone machine level using individual bits that consist of only two possible states. Conventionally, a single bit would either be a value of 0 or a value of 1. There’s not much you can practically do with just one bit. Sure, you could keep track of whether something is on or off by assigning a single bit, but the possibilities are limited to just keeping track of two states of something.
That leads us to the advent of SLMs. With SLMs, we want to compact or compress LLMs. If we could achieve LLMs by using fewer bits, that would certainly be a form of compaction. The size would be lessened, and the speed of processing would tend to hasten. There have been approaches that went the low-bit approach by leaning into 8 bits or 4 bits.
The loftiest desire would be to get things down to 1-bit. You can’t get much better than that (other than essentially tossing stuff out or finding new forms of representation). Getting 1-bit is the dream goal.
Let’s see how that could be accomplished for generative AI and LLMs.
Example Of Getting To 1-Bit Solutions
Assume that you are using a generative AI app and opt to enter a prompt that says just one word, let’s go with the word “Dream.” You have already told the AI that it should respond with just one word in return, preferably a word that would normally follow the word that you’ve entered. For example, we’d be anticipating that the AI might say “big” as in “Dream big” or maybe emit “well” as in “Dream well.”
The text that you entered is first converted into a numeric format. This is known as tokenization, see my detailed explanation at the link here. I will depart from the customary complex form of tokenization just to help illustrate things.
Suppose the AI has a dictionary of words and looks up the word in the dictionary to see what position it is. Let’s pretend that the word “Dream” is at position 450 in the internal dictionary. Okay, so the processing of your entered word is going to be done with the number 450 throughout the AI internal number crunching.
At some juncture, the word, or now the number 450 is going to be multiplied by other numbers that reflect various aspects associated with the word “Dream”. You see, during the initial data training, the pattern-matching had seen the word “dream” many times and statistically kept track of what words normally follow that particular word. These statistical relationships are numerical.
Suppose the pattern-matching indicates that we should multiply the 450 by the value 0.8592, which represents a statical relationship based on the established pattern-matching.
That’s an easy calculation, but it does require that we make use of something akin to a floating-point representation. Rather than using the number 0.8592, suppose we decided to round the number to either 0 or 1. If the number is closest to 0, we will round down to zero. If the number is closer to 1, we will round up to the number 1. It is apparent that the 0.8592 would be rounded up to the value of 1.
In recap:
- We had this: 450 x 0.8592
- Now we have this: 450 x 1.
You can directly see that the multiplication by 1 is going to be a lot less time-consuming. Everybody would certainly rejoice at doing that kind of multiplication. Easy-peasy. The same is true if we had been faced with say the number 0.0123, which would have been rounded down to 0. We would have 450 x 0. That’s super easy to calculate.
Here’s the deal.
Maybe we can take a lot of the statistical relationships and instead of keeping them in their original floating-point values, we convert them instead to be a series of 0’s and 1’s. After we’ve done that conversion, which is a process we need to do only once, the rest of the time we would be multiplying with a 0 or 1. Happy face.
Heavens, we just did something incredible. If we have many millions or possibly billions of those floating-point values, all of which were each consuming 16 bits or 32 bits, the whole kit-and-kaboodle has been dramatically reduced to just 1 bit for each number. On top of this, the multiplications are going to be incredibly easy to do since we are only always multiplying by either 0 or 1.
Nice.
Based on that use of those binary values, it turns out that the word ultimately chosen by the AI is “team” as in “Dream team.” I thought you’d want to know.
The Here And There Usage Of 1-Bit
I’ll do some more unpacking just to give you an additional flavor of how this works.
Those statistical relationships that I mentioned are typically kept internally via something referred to as a weight matrix or sometimes as an activation matrix. As noted, there are millions or billions of these values. The remarkable reduction in space and the speed-up of time can be enormous when converted into single-bit values.
I’m sure you are wondering whether converting a value such as 0.8592 to a value of 1 is going to be reasonable. The same can asked about converting 0.0123 to a value of 0. We seem to be losing a lot of crucial information. Basically, any value in the matrix that is below 0.5 is going to become zero and any value above that threshold is going to become a 1 (side note: the exact value of 0.5 is a special case and the AI developers would provide a rule whether to round up or down).
Yes, you are losing information by lumping things into one of two buckets. No doubt about that. The question is whether this makes a material difference or not. It all depends. Other factors come into play. Sometimes you can do this without much material loss, while sometimes it is so bad that you have to regrettably abandon the 1-bit approach.
There is also the idea of using the single-bit method in some places of the AI and not in others. Just because an LLM is referred to as 1-bit doesn’t mean that everything is entirely captured in elements of a single bit. Major parts might be, while other considered minor parts might not be. You can even decide to split major parts into portions that do use 1-bit while other portions do not.
It is a complicated stew and requires mindful planning and analysis.
The Values Of 0 And 1 Are Reconstituted
Here’s something else you might find of keen interest.
Often, even though the actual binary values are 0 and 1, we pretend that they are construed as -1 and +1. In other words, assume that if the matrix holds a 0 it really means we intend to have -1 there. The value of +1 is still just a positive one.
The reason for this pretense is that -1 and +1 are generally better at representing the values we are replacing. You see, using -1 and +1 tends to center the weight values around zero, which helps with the data training and tends to reduce the chances of skewed values (referred to as the gradients or gradient flow). Overall, using -1 to +1 allows each bit to essentially represent a greater range of values (positive and negative), preserving more information than 0 and 1.
Returning to the example of the word “Dream,” which we said is the value of 450, in a -1 or +1 scheme, the result would be that we either compute 450 x (-1) or compute 450 x (+1). That’s still quick and easy, and we are still using just 1-bit. There is a heated debate about whether to represent 0 and 1 as their true selves of 0 and 1, or instead go the route of -1 and +1. Worthy arguments exist on both sides of the debate.
Squeaking Beyond 1-Bit To Nearly 2-Bit
Another twist will perhaps catch your fancy.
Since we are giving up some semblance of accuracy by doing the rounding to just a single binary value, an alternative is to go with 2-bits rather than 1-bit. An emerging approach uses a ternary value system of -1, 0, +1. Those three values won’t fit into just one bit, so you are forced toward two bits.
But you can potentially arrange things so that you sometimes use 1 bit and sometimes use 2 bits, which might average over thousands or millions of values to end up using approximately 1.5 bits. You can’t really have half a bit per se, and this is just saying that since you have a mixture of 1-bits and 2-bits, the average of how many bits are used altogether comes out in a fractional calculated way.
An interesting research paper closely examined the ternary approach, in a piece entitled “The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits” by Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei, arXiv, February 27, 2024, which made these salient points (excerpts):
- “Vanilla LLMs are in 16-bit floating values (i.e., FP16 or BF16), and the bulk of any LLMs is matrix multiplication. Therefore, the major computation cost comes from the floating-point addition and multiplication operations.”
- “Compared to full-precision models, 1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint. This can significantly reduce the cost and time of loading weights from DRAM, leading to faster and more efficient inference.”
- “In this work, we introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits in the binary system.”
- “Firstly, its modeling capability is stronger due to its explicit support for feature filtering, made possible by the inclusion of 0 in the model weights, which can significantly improve the performance of 1-bit LLMs.
- “Secondly, our experiments show that BitNet b1.58 can match the full precision (i.e., FP16) baselines in terms of both perplexity and end-task performance, starting from a 3B size, when using the same configuration (e.g., model size, training tokens, etc.).”
If the topic of 1-bit LLMs interests you, I’d recommend reading the above-noted study as a handy means of venturing into this exciting realm.
Kicking The Tires On 1-Bit LLMs
I’ll cover a few more essentials and then offer some closing remarks.
Most LLMs and generative AI apps make use of an artificial neural network (ANN) as the crux of their internal structure, see my detailed explanation about ANNs at the link here. When you seek to convert portions of an ANN to a 1-bit approach or any low-bit method, this is commonly known as quantization. Thus, we might have an artificial neural network for a generative AI that has been “quantized,” meaning that some or all of it has been low-bit or 1-bit converted.
Get yourself mentally ready for a challenging question:
- Should the artificial neural network be converted to 1-bit at the get-go when initially data training the generative AI, or should we wait until after the data training is completed and then do the conversion?
The approach of doing so at the get-go is typically referred to as quantization-aware training (QAT), while the alternative approach of doing so afterward is known as post-training quantization (PTQ). There are notable tradeoffs involved between which to use. If you want to get an AI researcher or AI developer engaged in a lively debate about 1-bit or low-bit LLM, go ahead and bring that question to their attention.
If you dare do so, please prepare yourself for a lengthy discourse and possibly some curse words. As a bonus to juice the conversation, yet another perspective entails combining the two approaches, whereby you do a mixture of QAT and PTQ. An intriguing and still being figured out consideration.
Now, my final comments.
The use of 1-bit or low-bit is not only valuable for SLMs but can benefit LLMs as well. Imagine how much bigger we can go with LLMs if we can reduce their footprint. You can stretch the underlying vast resources even further. Maybe this would help us toward reaching the vaunted goal of attaining artificial general intelligence (AGI).
Go, 1-bit, go.
Lastly, you might remember the TV show where you had to try and name a tune by hearing a few of the starting notes (many variations of that show still exist today). One contestant would say they could do it in five notes. A challenger would say they could do it in three notes. A brave soul would speak up and say they could do it in one note. Nerve-racking and gutsy.
What do you think, can we achieve full-on LLMs in 1-bit modes or are we leaning too far over our skis and being a bit overconfident?
Time will tell.