evantso commited on
Commit
9d21b7b
1 Parent(s): b6700b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -1 +1,105 @@
1
- Hello
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - accuracy
4
+ - precision
5
+ - recall
6
+ - f1
7
+ pipeline_tag: tabular-classification
8
+ tags:
9
+ - medical
10
+ - biology
11
+ - code
12
+ ---
13
+ # HCC TIIC Random Forest Model
14
+ **Developed by:** Yifu (Evan) Zuo
15
+
16
+ This is a Random Forest classifier for automatically classifying tumor-infiltrating immune cells in hepatocellular carcinoma tumor microenvironments in 40 categories based on expression data from 107 CD45+ genes.
17
+
18
+ ## How to use it
19
+
20
+ #### 1. Download the model from Files
21
+ This is pretty straight forward. Head to the Files tab of this repository and download the model. The size of the RF model in pickle format is 2.1G.
22
+
23
+ #### 2. Create a New Interactive Python Notebook
24
+ Open Jupyter Notebook or Google Colab, and create a new notebook file. This environment will allow you to interactively run Python commands and visualize outputs step-by-step.
25
+
26
+ #### 3. Import Required Libraries
27
+ Start by importing the required libraries in your notebook. This includes:
28
+ ```
29
+ import joblib
30
+ import pandas as pd
31
+ from sklearn.impute import SimpleImputer
32
+ import matplotlib.pyplot as plt
33
+ ```
34
+
35
+ These libraries are needed to load the model, handle the data, and create visualizations.
36
+
37
+ #### 4. Load the Downloaded Model
38
+ Use the following command to load the model into your notebook:
39
+ ```
40
+ loaded_rf_model = joblib.load('path_to_downloaded_model.pkl')
41
+ ```
42
+ Replace `'path_to_downloaded_model.pkl'` with the actual file path of the downloaded model.
43
+ #### 5. Load the Data in CSV Format
44
+ Load the Data in CSV Format:
45
+ `data = pd.read_csv('path_to_csv_file.csv')`
46
+
47
+ • Each row should represent a cell.
48
+
49
+ • Each column should represent a gene.
50
+
51
+ • The required genes must be present in the data (Check Step 9 to see the full list).
52
+
53
+ Before loading the data in CSV format, make sure the UMI counts for each gene is normalized. The UMI counts should be scaled to 10,000 as standard practice. R and Seurat are recommended for the conversion to CSV.
54
+
55
+ #### 7. Preprocess the Data for Model Compatibility
56
+ Prepare the data before feeding it to the model.
57
+
58
+ • Replace hyphens in column names with dots:
59
+ ```
60
+ data.columns = data.columns.str.replace('-', '.')
61
+ ```
62
+ • Drop irrelevant rows and columns:
63
+ ```
64
+ # Rename columns based on the mapping dictionary
65
+ data.rename(columns=feature_mapping, inplace=True))
66
+ ```
67
+ Ensure that the feature mapping is correctly defined in your code.
68
+
69
+ #### 9. Select the Required Features for Prediction
70
+ Define the list of genes to be used by the model:
71
+ ```
72
+ selected_features = ['CD3D', 'CD3E', 'CD3G', 'CCR7', 'LEF1', 'SELL', 'TCF7', 'S1PR1', 'ANXA1', 'ANXA2',
73
+ 'IL7R', 'CD74', 'TYROBP', 'CD4', 'HAVCR2', 'PDCD1', 'GZMB', 'ITGAE', 'CXCL13', 'FOXP3',
74
+ 'CTLA4', 'IL2RA', 'MKI67', 'STMN1', 'CMC1', 'CD8A', 'CD8B', 'CX3CR1', 'KLRG1', 'FCGR3A',
75
+ 'FGFBP2', 'GZMH', 'GZMK', 'CCL4', 'CCL5', 'NKG7', 'KLRD1', 'KLRF1', 'GNLY', 'IL32',
76
+ 'SLC4A10', 'KLRB1', 'ZBTB16', 'NCR3', 'NCAM1', 'CCL3', 'IFNG', 'CD69', 'HSPA1A',
77
+ 'XCL1', 'AREG', 'CD160', 'TIGIT', 'CXCR4', 'ZNF331', 'DNAJB1', 'HSPA1B', 'HSPA6',
78
+ 'TUBB', 'CST3', 'LYZ', 'CD14', 'VCAN', 'S100A9', 'RNASE2', 'S100A12', 'FCER1G', 'LST1',
79
+ 'AIF1', 'IFITM3', 'CD1C', 'FCER1A', 'CLEC10A', 'VEGFA', 'IRF4', 'RGS2', 'CLEC9A',
80
+ 'IRF8', 'IDO1', 'CLNK', 'XCR1', 'LAMP3', 'CD274', 'LTB', 'CCL19', 'CCL21', 'CD68',
81
+ 'THBS1', 'S100A8', 'CD163', 'SIGLEC1', 'C1QA', 'SLC40A1', 'GPNMB', 'APOE', 'SAT1',
82
+ 'HLA.DQB1', 'S100A4', 'HLA.DRA', 'HLA.DQA1', 'MARCO', 'CD79A', 'CPA3', 'KIT', 'CD19',
83
+ 'MS4A1', 'CD22']
84
+ X_test_data = data[selected_features]
85
+ ```
86
+ #### 10. Handle Missing Values in the Data
87
+ Replace missing values (NaN) with the mean of each column using SimpleImputer:
88
+ ```
89
+ imputer = SimpleImputer(strategy='mean')
90
+ X_test_data = imputer.fit_transform(X_test_data)
91
+ ```
92
+ #### 11. Make Predictions with the Loaded Model
93
+ Use the model to make predictions:
94
+ ```
95
+ predictions = loaded_rf_model.predict(X_test_data)
96
+ ```
97
+ ##### 12. Add Predictions to the Data and Display the Updated Data
98
+ ```
99
+ data['label'] = predictions
100
+ print(data.head())
101
+ plt.figure(figsize=(10, 4))
102
+ plt.title('Predicted Cell Type Distribution')
103
+ data['label'].value_counts().plot.bar(rot=0)
104
+ plt.show()
105
+ ```