Please wait to make the next version based on my losslessdataset
If you planned on using my version 2 dataset (losslessmegacodetraining) for your next model, i would highly reccomend waiting for a little while because a version 3 is coming out very soon. I already have all the data necessary to make it, i just have to do a little bit of editing to compile it when i have time. It will be closer to 2m lines of data in a 50%-50% coding non coding split, as opposed to the current losslessmegacodindatset which is at 1m lines and only 25%-75% coding-noncoding split
@juyongjiang I dont know if you are active anymore, but I made a few new datasets I would recommend you finetuning either the codellama-python-13b, or wizardcoder-python-13b models on to create your next model. I personally would use the wizardcoder model to start with since it already has great coding performance, and go up from there with my dataset.
For code only:
For code + Non code instructions in a 80%/20% split: