With Llama kicking things off, development has been ridiculously fast in the self hosted text model space. The requirements are getting better, but still fairly steep. You can either have painfully slow CPU generation, or if you have 24gb+ of VRAM, you really open up the GPU options.
7b models can sorta run in 12gb, but they’re not great. You really want at least 13b, which needs 24gb VRAM… Or run it on the CPU. Some of them are getting close to ChatGPT quality, definitely not a subset to sleep on, and I feel as though the fediverse would appreciate the idea of self hosting their own chat bots. Some of these models have ridiculous context memory, so they actually remember what you’re talking about ridiculously well.
A good starting point is this rentry: https://rentry.org/local_LLM_guide
I’m admittedly not great with these yet (and my GPU is only 12gb), but I’m fascinated and hope there can be some good discussions around these, as the tech is really fascinating