From the abstract: “Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.”
Would allow larger models with limited resources. However, this isn’t a quantization method you can convert models to after the fact, Seems models need to be trained from scratch this way, and to this point they only went as far as 3B parameters. The paper isn’t that long and seems they didn’t release the models. It builds on the BitNet paper from October 2023.
“the matrix multiplication of BitNet only involves integer addition, which saves orders of energy cost for LLMs.” (no floating point matrix multiplication necessary)
“1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint”
Edit: Update: additional FAQ published
So are more bits less important than more paramters? Would a higher paramter or higher bit count matter more if the models ended up the same size?
They claim it performs at 1.56 bit about as good as something with 16 bits. I don’t quite get your question. Seems we can do with less precision / different maths and arrive at the same quality. The total count of parameters isn’t affected. But the numbers now don’t take 16 bits each, but less.
They said their’s is “comparable with the 8-bit models”. Its all tradeoffs. It isn’t clear to me where you allocate your compute/memory budget. I’ve noticed that full 7b 16 bit models often produce better results for me than some much larger quantied models. It will be interesting to find the sweet spot.
I can’t find that mention of “8-bit models” anywhere in the paper, just by skimming it again I only see references and comparisons to FP16.
I know these discussions from llama.cpp and ggml quantization. With that you can quantize a model more and more and it becomes worse the lower the precision gets. You can counter that by using a larger model that was more “intelligent” in the first place… With that you can calculate the sweet spot and what gives you the best quality at a certain compute cost or size… A more degraded bigger model, or a less degraded smaller model…
But we don’t have different quantization levels here, just one. And it’s also difficult to compare, as with ggml you take the same model and quantize it to different levels… We also don’t have that here, you can’t take an existing model with this approach and quantize it and compare it to another… You have to train a new model from scratch. And then it’s a different model.
I can’t find a good analogy here… Maybe it’s a bit like asking if the filesize of an JPEG image is more important than the resolution… It’s kind of the wrong question. You can compare different compression levels of the JPEG image, or compare the size of the JPEG to a BMP file… It’s really not a good analogy, but a BMP file with 20 times the size looks exactly like a smaller JPEG file on the screen. And you can also have a 7B parameter LLM model give better answers than a poor (or older) 13B model. It’s neither just parameter count nor presision alone.
So if they say they can do with less than a third of the RAM and compute time and simultansously score a tiny bit higher in the benchmarks, I don’t see a tradeoff here.
Generally speaking you can ask the question: What delivers the best results with at a given compute cost. Or the other way around: What has the lowest cost to arrive at a certain point. But this is kind of a different technique, same parameter count, same results, but significantly lower computing cost on inference.
(And reading all the speculation elsewhere: There might be a different tradeoff. The authors didn’t talk about training and just made very small models. A more complex and expensive training process could be a tradeoff.)
Apparently I am an idiot and read the wrong paper. The previous paper mentioned that “comparable with the 8-bit models”
https://huggingface.co/papers/2310.11453