gbyuvd commited on
Commit
f65e301
1 Parent(s): f388626

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -6
README.md CHANGED
@@ -13,6 +13,56 @@ license: cc-by-nc-sa-4.0
13
  - Hands-on learning, research and experimentation in molecular generation
14
  - Baseline for ablation studies and comparisons with more advanced models
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Training Data
17
  - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
18
  - **Total**: 2,346,680 samples
@@ -28,12 +78,12 @@ license: cc-by-nc-sa-4.0
28
  ## Training Logs
29
 
30
 
31
- | Chunk | Training Loss | Validation Loss | Status |
32
- | ----- | ------------- | --------------- | --------- |
33
- | I | 1.346400 | 1.065180 | Done |
34
- | II | | | Ongoing |
35
- | III | | | Scheduled |
36
- | IV | | | Scheduled |
37
 
38
 
39
  ## Evaluation Results
 
13
  - Hands-on learning, research and experimentation in molecular generation
14
  - Baseline for ablation studies and comparisons with more advanced models
15
 
16
+ ## Use
17
+
18
+ ```python
19
+ from transformers import PreTrainedTokenizerFast, AutoModelForCausalLM
20
+ import torch
21
+
22
+ tokenizer = PreTrainedTokenizerFast(
23
+ tokenizer_file="gpt2_tokenizer.json",
24
+ model_max_length=512,
25
+ unk_token="<unk>",
26
+ pad_token="<pad>",
27
+ eos_token="</s>",
28
+ bos_token="<s>",
29
+ mask_token="<mask>",
30
+ )
31
+
32
+ model = AutoModelForCausalLM.from_pretrained("gbyuvd/chemfie-gpt-experiment-1")
33
+
34
+ # Generate some sample outputs
35
+ def generate_molecules(model, tokenizer, num_samples=5, max_length=100):
36
+ model.eval()
37
+ generated = []
38
+ for _ in range(num_samples):
39
+ input_ids = torch.tensor([[tokenizer.bos_token_id]]).to(model.device)
40
+ output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, do_sample=True)
41
+ generated.append(tokenizer.decode(output[0], skip_special_tokens=True))
42
+ return generated
43
+
44
+ sample_molecules = generate_molecules(model, tokenizer)
45
+ print("Sample generated molecules:")
46
+ for i, mol in enumerate(sample_molecules, 1):
47
+ print(f"{i}. {mol}")
48
+
49
+ ```
50
+
51
+ Tokenized SELFIES to SMILES:
52
+ ```python
53
+ import selfies as sf
54
+
55
+ test = "[C] [Branch1] [O] [=C] [C] [C] [C] [C] [C] [C] [C] [=Branch1] [=O] [O] [=C] [C] [C] [C] [Ring1]"
56
+ test = test.replace(' ', '')
57
+ print(sf.decoder(test))
58
+
59
+ """"
60
+ C(CCCCCCCCO)=CCC=C
61
+
62
+ """"
63
+ ```
64
+
65
+
66
  ## Training Data
67
  - **Source**: Curated and merged from COCONUTDB (Sorokina et al., 2021), ChemBL34 (Zdrazil et al., 2023), and SuperNatural3 (Gallo et al. 2023) database
68
  - **Total**: 2,346,680 samples
 
78
  ## Training Logs
79
 
80
 
81
+ | Chunk | Training Loss | Validation Loss | Status |
82
+ | :---: | :-----------: | :-------------: | :-------: |
83
+ | I | 1.346400 | 1.065180 | Done |
84
+ | II | | | Ongoing |
85
+ | III | | | Scheduled |
86
+ | IV | | | Scheduled |
87
 
88
 
89
  ## Evaluation Results