This is my step-by-step guide on how to replicate fine tuning of the example datasets using axolotl.

Last I checked, the bitsandbytes library copy was still needed and open-llama-3b was still problematic for quantizing, but hopefully those issues are solved at some point.

What I didn’t know when I first wrote the post was that it was possible to load the finetuned LoRA file in a frontend like text-generation-webui. I have since updated the text to account for that. There are performance side-effects of just loading the qlora adapter in the webui besides just the penalty to load time. This should show how fast text inference was with little context in tokens/p while using the transformers library and source model in f16 or quantized 8-bit & 4-bit and how fast I can run a merged q4_0 quantization.