Update README.md
Browse files
README.md
CHANGED
@@ -2,6 +2,7 @@
|
|
2 |
license: mit
|
3 |
---
|
4 |
|
|
|
5 |
# IEPile: A Large-Scale Information Extraction Corpus
|
6 |
|
7 |
This is the official repository for [IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus](https://arxiv.org/abs/2402.14710)
|
@@ -34,16 +35,145 @@ Based on **IEPile**, we fine-tuned the `Baichuan2-13B-Chat` and `LLaMA2-13B-Chat
|
|
34 |
</details>
|
35 |
|
36 |
|
37 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
|
44 |
|
45 |
|
46 |
-
## Cite
|
47 |
|
48 |
-
If you use the IEPile or the code, please cite the paper:
|
49 |
|
|
|
2 |
license: mit
|
3 |
---
|
4 |
|
5 |
+
|
6 |
# IEPile: A Large-Scale Information Extraction Corpus
|
7 |
|
8 |
This is the official repository for [IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus](https://arxiv.org/abs/2402.14710)
|
|
|
35 |
</details>
|
36 |
|
37 |
|
38 |
+
## News
|
39 |
+
* [2024/02] We released a large-scale (0.32B tokens) high-quality bilingual (Chinese and English) Information Extraction (IE) instruction dataset named [IEPile](https://huggingface.co/datasets/zjunlp/iepie), along with two models trained on `IEPile`, [baichuan2-13b-iepile-lora](https://huggingface.co/zjunlp/baichuan2-13b-iepile-lora) and [llama2-13b-iepile-lora](https://huggingface.co/zjunlp/llama2-13b-iepile-lora).
|
40 |
+
* [2023/10] We released a new bilingual (Chinese and English) theme-based Information Extraction (IE) instruction dataset named [InstructIE](https://huggingface.co/datasets/zjunlp/InstructIE) with [paper](https://arxiv.org/abs/2305.11527).
|
41 |
+
* [2023/08] We introduced a dedicated 13B model for Information Extraction (IE), named [knowlm-13b-ie](https://huggingface.co/zjunlp/knowlm-13b-ie/tree/main).
|
42 |
+
* [2023/05] We initiated an instruction-based Information Extraction project.
|
43 |
+
|
44 |
+
|
45 |
+
## 2.2Data Format
|
46 |
+
|
47 |
+
In `IEPile`, the **instruction** format of `IEPile` adopts a JSON-like string structure, which is essentially a dictionary-type string composed of the following three main components:
|
48 |
+
(1) **`'instruction'`**: Task description, which outlines the task to be performed by the instruction (one of `NER`, `RE`, `EE`, `EET`, `EEA`).
|
49 |
+
(2) **`'schema'`**: A list of schemas to be extracted (`entity types`, `relation types`, `event types`).
|
50 |
+
(3) **`'input'`**: The text from which information is to be extracted.
|
51 |
+
|
52 |
+
|
53 |
+
We recommend that you keep the number of schemas in each instruction to a fixed number, which is 6 for NER, and 4 for RE, EE, EET, EEA, as these are the quantities we used in our training.
|
54 |
+
|
55 |
+
|
56 |
+
```json
|
57 |
+
instruction_mapper = {
|
58 |
+
'NERzh': "你是专门进行实体抽取的专家。请从input中抽取出符合schema定义的实体,不存在的实体类型返回空列表。请按照JSON字符串的格式回答。",
|
59 |
+
'REzh': "你是专门进行关系抽取的专家。请从input中抽取出符合schema定义的关系三元组,不存在的关系返回空列表。请按照JSON字符串的格式回答。",
|
60 |
+
'EEzh': "你是专门进行事件提取的专家。请从input中抽取出符合schema定义的事件,不存在的事件返回空列表,不存在的论元返回NAN,如果论元存在多值请返回列表。请按照JSON字符串的格式回答。",
|
61 |
+
'EETzh': "你是专门进行事件提取的专家。请从input中抽取出符合schema定义的事件类型及事件触发词,不存在的事件返回空列表。请按照JSON字符串的格式回答。",
|
62 |
+
'EEAzh': "你是专门进行事件论元提取的专家。请从input中抽取出符合schema定义的事件论元及论元角色,不存在的论元返回NAN或空字典,如果论元存在多值请返回列表。请按照JSON字符串的格式回答。",
|
63 |
+
|
64 |
+
'NERen': "You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.",
|
65 |
+
'REen': "You are an expert in relationship extraction. Please extract relationship triples that match the schema definition from the input. Return an empty list for relationships that do not exist. Please respond in the format of a JSON string.",
|
66 |
+
'EEen': "You are an expert in event extraction. Please extract events from the input that conform to the schema definition. Return an empty list for events that do not exist, and return NAN for arguments that do not exist. If an argument has multiple values, please return a list. Respond in the format of a JSON string.",
|
67 |
+
'EETen': "You are an expert in event extraction. Please extract event types and event trigger words from the input that conform to the schema definition. Return an empty list for non-existent events. Please respond in the format of a JSON string.",
|
68 |
+
'EEAen': "You are an expert in event argument extraction. Please extract event arguments and their roles from the input that conform to the schema definition, which already includes event trigger words. If an argument does not exist, return NAN or an empty dictionary. Please respond in the format of a JSON string.",
|
69 |
+
}
|
70 |
+
split_num_mapper = {'NER':6, 'RE':4, 'EE':4, 'EET':4, 'EEA':4}
|
71 |
|
72 |
|
73 |
+
import json
|
74 |
+
|
75 |
+
task = 'NER'
|
76 |
+
language = 'en'
|
77 |
+
schema = ['person', 'organization', 'else', 'location']
|
78 |
+
split_num = split_num_mapper[task]
|
79 |
+
split_schemas = [schema[i:i+split_num] for i in range(0, len(schema), split_num)]
|
80 |
+
input = '284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )'
|
81 |
+
sintructs = []
|
82 |
+
for split_schema in split_schemas:
|
83 |
+
sintruct = json.dumps({'instruction':instruction_mapper[task+language], 'schema':split_schema, 'input':input}, ensure_ascii=False)
|
84 |
+
sintructs.append(sintruct)
|
85 |
+
```
|
86 |
+
|
87 |
+
<details>
|
88 |
+
<summary><b>More Tasks Schema</b></summary>
|
89 |
+
|
90 |
+
RE schema: ["neighborhood of", "nationality", "children", "place of death"]
|
91 |
+
EE schema: [{"event_type": "potential therapeutic event", "trigger":True, "arguments": ["Treatment.Time_elapsed", "Treatment.Route", "Treatment.Freq", "Treatment", "Subject.Race", "Treatment.Disorder", "Effect", "Subject.Age", "Combination.Drug", "Treatment.Duration", "Subject.Population", "Subject.Disorder", "Treatment.Dosage", "Treatment.Drug"]}, {"event_type": "adverse event", "trigger":True, "arguments": ["Subject.Population", "Subject.Age", "Effect", "Treatment.Drug", "Treatment.Dosage", "Treatment.Freq", "Subject.Gender", "Treatment.Disorder", "Subject", "Treatment", "Treatment.Time_elapsed", "Treatment.Duration", "Subject.Disorder", "Subject.Race", "Combination.Drug"]}]
|
92 |
+
EET schema: ["potential therapeutic event", "adverse event"]
|
93 |
+
EEA schema: [{"event_type": "potential therapeutic event", "arguments": ["Treatment.Time_elapsed", "Treatment.Route", "Treatment.Freq", "Treatment", "Subject.Race", "Treatment.Disorder", "Effect", "Subject.Age", "Combination.Drug", "Treatment.Duration", "Subject.Population", "Subject.Disorder", "Treatment.Dosage", "Treatment.Drug"]}, {"event_type": "adverse event", "arguments": ["Subject.Population", "Subject.Age", "Effect", "Treatment.Drug", "Treatment.Dosage", "Treatment.Freq", "Subject.Gender", "Treatment.Disorder", "Subject", "Treatment", "Treatment.Time_elapsed", "Treatment.Duration", "Subject.Disorder", "Subject.Race", "Combination.Drug"]}]
|
94 |
+
|
95 |
+
</details>
|
96 |
+
|
97 |
+
## Using baichuan2-13b-iepile-lora
|
98 |
+
|
99 |
|
100 |
+
```python
|
101 |
+
import torch
|
102 |
+
from transformers import (
|
103 |
+
AutoConfig,
|
104 |
+
AutoTokenizer,
|
105 |
+
AutoModelForCausalLM,
|
106 |
+
GenerationConfig
|
107 |
+
)
|
108 |
+
from peft import PeftModel
|
109 |
+
|
110 |
+
model_path = 'models/Baichuan2-13B-Chat'
|
111 |
+
lora_path = 'lora/baichuan2-13b-iepile-lora'
|
112 |
+
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
|
113 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
114 |
+
|
115 |
+
model = AutoModelForCausalLM.from_pretrained(
|
116 |
+
model_path,
|
117 |
+
config=config,
|
118 |
+
device_map="auto",
|
119 |
+
torch_dtype=torch.bfloat16,
|
120 |
+
trust_remote_code=True,
|
121 |
+
)
|
122 |
+
|
123 |
+
|
124 |
+
model = PeftModel.from_pretrained(
|
125 |
+
model,
|
126 |
+
lora_path,
|
127 |
+
)
|
128 |
+
model.eval()
|
129 |
+
|
130 |
+
sintruct = "{\"instruction\": \"You are an expert in named entity recognition. Please extract entities that match the schema definition from the input. Return an empty list if the entity type does not exist. Please respond in the format of a JSON string.\", \"schema\": [\"person\", \"organization\", \"else\", \"location\"], \"input\": \"284 Robert Allenby ( Australia ) 69 71 71 73 , Miguel Angel Martin ( Spain ) 75 70 71 68 ( Allenby won at first play-off hole )\"}"
|
131 |
+
sintruct = '<reserved_106>' + sintruct + '<reserved_107>'
|
132 |
+
|
133 |
+
input_ids = tokenizer.encode(sintruct, return_tensors="pt")
|
134 |
+
input_length = input_ids.size(1)
|
135 |
+
generation_output = model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_length=512, max_new_tokens=256, return_dict_in_generate=True))
|
136 |
+
generation_output = generation_output.sequences[0]
|
137 |
+
generation_output = generation_output[input_length:]
|
138 |
+
output = tokenizer.decode(generation_output, skip_special_tokens=True)
|
139 |
+
|
140 |
+
print(output)
|
141 |
+
```
|
142 |
+
|
143 |
+
|
144 |
+
If your GPU has limited memory, you can use quantization to reduce memory usage. Below is the inference process using 4-bit quantization.
|
145 |
+
|
146 |
+
```python
|
147 |
+
import torch
|
148 |
+
from transformers import BitsAndBytesConfig
|
149 |
+
|
150 |
+
quantization_config=BitsAndBytesConfig(
|
151 |
+
load_in_4bit=True,
|
152 |
+
llm_int8_threshold=6.0,
|
153 |
+
llm_int8_has_fp16_weight=False,
|
154 |
+
bnb_4bit_compute_dtype=torch.bfloat16,
|
155 |
+
bnb_4bit_use_double_quant=True,
|
156 |
+
bnb_4bit_quant_type="nf4",
|
157 |
+
)
|
158 |
+
model = AutoModelForCausalLM.from_pretrained(
|
159 |
+
model_path,
|
160 |
+
config=config,
|
161 |
+
load_in_4bit=True,
|
162 |
+
device_map="auto",
|
163 |
+
quantization_config=quantization_config,
|
164 |
+
torch_dtype=torch.bfloat16,
|
165 |
+
trust_remote_code=True,
|
166 |
+
)
|
167 |
+
|
168 |
+
model = PeftModel.from_pretrained(
|
169 |
+
model,
|
170 |
+
lora_path,
|
171 |
+
)
|
172 |
+
model.eval()
|
173 |
+
```
|
174 |
|
175 |
|
176 |
|
177 |
|
|
|
178 |
|
|
|
179 |
|