LiyuanLucasLiu
commited on
Commit
•
f8b8aeb
1
Parent(s):
70f36dd
uploaded tech report and revised readme
Browse files- .gitattributes +1 -0
- GRIN_MoE.pdf +3 -0
- README.md +4 -6
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.pdf filter=lfs diff=lfs merge=lfs -text
|
GRIN_MoE.pdf
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:39e878f28a2bdd362f0bbe0bc0fa2ef9b827551d74e9a617a18e2b3923abb322
|
3 |
+
size 1971199
|
README.md
CHANGED
@@ -17,16 +17,14 @@ library_name: transformers
|
|
17 |
<h1 align="center"> 😁 MoE</h1>
|
18 |
<h4 align="center">GRIN: <em>GR</em>adient-<em>IN</em>formed MoE</h4>
|
19 |
<p align="center">
|
20 |
-
<a href="https://huggingface.co/microsoft/GRIN-MoE">Hugging Face</a>  |   <a href="https://
|
21 |
<br>
|
22 |
|
23 |
-
GRIN MoE
|
24 |
-
It achieves exceptionally good performance across a diverse set of tasks, particularly in coding and mathematics tasks.
|
25 |
-
Comparing to conventional MoE training, GRIN MoE differs in mostly two ways:
|
26 |
|
27 |
-
- GRIN uses SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
|
28 |
|
29 |
-
- GRIN scales MoE training with neither expert parallelism nor token dropping
|
30 |
|
31 |
## Intended Uses
|
32 |
|
|
|
17 |
<h1 align="center"> 😁 MoE</h1>
|
18 |
<h4 align="center">GRIN: <em>GR</em>adient-<em>IN</em>formed MoE</h4>
|
19 |
<p align="center">
|
20 |
+
<a href="https://huggingface.co/microsoft/GRIN-MoE">Hugging Face</a>  |   <a href="https://huggingface.co/microsoft/GRIN-MoE/blob/main/GRIN_MoE.pdf"> Tech Report</a>  |   <a href="https://huggingface.co/microsoft/GRIN-MoE/blob/main/LICENSE">License</a>  |   <a href="https://github.com/microsoft/GRIN-MoE">Github</a>   |   <a href="https://huggingface.co/microsoft/GRIN-MoE#usage">Get Started</a> 
|
21 |
<br>
|
22 |
|
23 |
+
- With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
|
|
|
|
|
24 |
|
25 |
+
- GRIN uses **SparseMixer-v2** to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
|
26 |
|
27 |
+
- GRIN scales MoE training with **neither expert parallelism nor token dropping**, while the conventional MoE training employs expert parallelism and deploys token dropping.
|
28 |
|
29 |
## Intended Uses
|
30 |
|