danielschnell commited on
Commit
9b39732
1 Parent(s): d2e439b

Copied from Clarin: http://hdl.handle.net/20.500.12537/227

Browse files

Use original Readme.txt => README.md

Signed-off-by: Daniel Schnell <[email protected]>

Files changed (3) hide show
  1. .gitattributes +1 -0
  2. 10_trials_optim_kenlm.scorer +3 -0
  3. README.md +89 -3
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ 10_trials_optim_kenlm.scorer filter=lfs diff=lfs merge=lfs -text
10_trials_optim_kenlm.scorer ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6082bcc551041a630d54c01746b8e8b6d4c2368d9ba7f1e774e32a4b6c95ab11
3
+ size 1043308192
README.md CHANGED
@@ -1,3 +1,89 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ -------------------------------------------------------------------------------
2
+ DeepSpeech Scorer for Icelandic 22.06
3
+ -------------------------------------------------------------------------------
4
+
5
+ Authors : Carlos Daniel Hernández Mena ([email protected]).
6
+
7
+ Language : Icelandic.
8
+
9
+ Recommended use : speech recognition.
10
+
11
+ -------------------------------------------------------------------------------
12
+ Description
13
+ -------------------------------------------------------------------------------
14
+
15
+ "DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers
16
+ based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file
17
+ used to perform language modeling. It is composed of two sub-components, a
18
+ KenLM language model and a trie data structure containing all words in the
19
+ vocabulary [2].
20
+
21
+ This scorer was originally created to be used with the following DeepSpeech
22
+ recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University
23
+ in 2022:
24
+
25
+ https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur
26
+
27
+ Nevertheless, due to the flexibility of this kind of resources and their
28
+ possible application in other tasks, systems or code recipes; it was
29
+ decided to publish this resource as an independent item.
30
+
31
+ -------------------------------------------------------------------------------
32
+ The Language Model
33
+ -------------------------------------------------------------------------------
34
+
35
+ The language model was created using the Icelandic Gigaword Corpus [3]. The
36
+ Gigaword corpus contains text from newspaper articles, parliamentary speeches,
37
+ adjudications, books, transcribed radio/television news and more. The
38
+ normalization process of the sentences utilized to generate the language
39
+ model includes to allowing only characters belonging to the Icelandic alphabet,
40
+ expanding numbers and abbreviations, and removing punctuation marks [4]. The
41
+ resulting text has a length of more than 44 million lines of text (5.3GB
42
+ approximately), and it was used to create the scorer.
43
+
44
+ -------------------------------------------------------------------------------
45
+ Citation
46
+ -------------------------------------------------------------------------------
47
+
48
+ When publishing results based on the models please refer to:
49
+
50
+ Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download.
51
+ Reykjavik University: Language and Voice Lab, 2022.
52
+
53
+ Contact: Carlos Mena ([email protected])
54
+
55
+ License: CC BY 4.0
56
+
57
+ -------------------------------------------------------------------------------
58
+ Acknowledgements
59
+ -------------------------------------------------------------------------------
60
+
61
+ This initiative was funded by the Language Technology Programme for Icelandic
62
+ 2019-2023. The programme, which is managed and coordinated by Almannarómur,
63
+ is funded by the Icelandic Ministry of Education, Science and Culture.
64
+
65
+ -------------------------------------------------------------------------------
66
+ References
67
+ -------------------------------------------------------------------------------
68
+
69
+ [1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
70
+ E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end
71
+ speech recognition in english and mandarin. In International conference
72
+ on machine learning (pp. 173-182). PMLR.
73
+
74
+ [2] Mozilla's DeepSpeech online documentation:
75
+ https://deepspeech.readthedocs.io/en/r0.9/Scorer.html
76
+
77
+ [3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
78
+ & Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
79
+ corpus. In Proceedings of the Eleventh International Conference on
80
+ Language Resources and Evaluation (LREC 2018).
81
+
82
+ [4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
83
+ J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
84
+ In Proceedings of the Eleventh International Conference on Language
85
+ Resources and Evaluation (LREC 2018).
86
+
87
+ -------------------------------------------------------------------------------
88
+ -------------------------------------------------------------------------------
89
+