model_shp4_dpo1

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.0922	2.67	100	1.1410	-4.1724	-4.0602	0.5600	-0.1122	-287.3470	-271.3348	-0.9400	-0.9462
0.0014	5.33	200	1.6279	-8.0256	-7.9377	0.5400	-0.0879	-326.1222	-309.8669	-1.2061	-1.2156
0.0001	8.0	300	1.6781	-7.8271	-7.7492	0.4900	-0.0780	-324.2366	-307.8824	-1.1931	-1.2019
0.0001	10.67	400	1.7244	-8.2046	-8.1268	0.5100	-0.0778	-328.0134	-311.6574	-1.1773	-1.1864
0.0001	13.33	500	1.7449	-8.3826	-8.3126	0.5100	-0.0701	-329.8707	-313.4376	-1.1689	-1.1774
0.0001	16.0	600	1.7522	-8.4707	-8.4001	0.5100	-0.0706	-330.7461	-314.3180	-1.1649	-1.1729
0.0001	18.67	700	1.7553	-8.5177	-8.4517	0.5200	-0.0659	-331.2625	-314.7882	-1.1626	-1.1704
0.0001	21.33	800	1.7608	-8.5360	-8.4723	0.5200	-0.0637	-331.4679	-314.9713	-1.1608	-1.1692
0.0001	24.0	900	1.7653	-8.5361	-8.4664	0.5200	-0.0697	-331.4087	-314.9720	-1.1617	-1.1693
0.0001	26.67	1000	1.7562	-8.5308	-8.4695	0.5200	-0.0613	-331.4397	-314.9189	-1.1613	-1.1692