fasttext-med-en-zh-identification

[中文] [English]

This model is an intermediate result of the EPCD (Easy-Data-Clean-Pipeline) project. It is primarily designed to accurately distinguish between Chinese and English samples in medical pretraining datasets. The model framework uses fastText.

Data Composition

Medical English Pretraining Datasets

The above datasets are high-quality, open-source datasets, which can save a lot of effort in data cleaning. Many thanks to the developers for their contributions to the open-source data community!

Data Cleaning Process

Initial dataset processing:
- For the Chinese training datasets, the pretraining corpus is split by \n, and any leading or trailing spaces are removed.
- For the English training datasets, the pretraining corpus is split by \n, all letters are converted to lowercase, and any leading or trailing spaces are removed.
Word count statistics:
- For Chinese, the jieba package is used for tokenization, and stopwords and non-Chinese characters are further filtered using jionlp.
- For English, the nltk package is used for tokenization, with built-in stopwords for filtering.
Sample filtering based on word count (heuristic thresholds):
- For Chinese: Keep only samples with more than 5 words.
- For English: Keep only samples with more than 5 words.
Dataset splitting: 90% of the data is used for training and 10% for testing.

Model Performance

Dataset	Precision	Recall
Train	0.9987	0.9987
Test	0.9962	0.9962

Usage Example

import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))

fasttext-med-en-zh-identification

[中文] [English]

该模型为EPCD(Easy-Data-Clean-Pipeline)项目的中间产物，主要用来区分医疗预训练语料中中文与英文样本。模型框架使用fastText。

数据组成

中文通用预训练数据集

Skywork/SkyPile-150B

中文医疗预训练数据集

ticoAg/shibing624-medical-pretrain

英文通用预训练数据集

togethercomputer/RedPajama-Data-V2

英文医疗预训练数据集

上述数据集均为高质量开源数据集，可以节省很多数据清洗的工作，感谢上述开发者对开源数据社区的支持！

数据清洗流程

数据集初步整理
- 对中文训练数据集，按\n分割预训练语料，去除开头和结尾可能存在的空格。
- 对英文训练数据集，按\n分割预训练语料，将所有字母全部变为小写，去除开头和结尾可能存在的空格。
统计词数量，具体的：
- 对中文，使用jieba包进行分词，并利用jionlp进一步过滤停用词和非中文字符。
- 对英文，使用nltk包进行分词，并利用内置停用词进行过滤。
根据词数量进行样本过滤，具体的（经验数值）：
- 对中文：仅保留词数量大于5的样本。
- 对英文：仅保留词数量大于5的样本。
切分数据集，训练集比例为0.9，测试集比例为0.1。

模型表现

Dataset	Accuracy
Train	0.9994
Test	0.9998

Usage Example

import fasttext
from huggingface_hub import hf_hub_download

def to_low(text):
    return text.strip().lower()

model_path = hf_hub_download(repo_id="ytzfhqs/fasttext-med-en-zh-identification", filename="model.bin")
model = fasttext.load_model(model_path)
model.predict(to_low('Hello, world!'))

ytzfhqs
/

fasttext-med-en-zh-identification

fasttext-med-en-zh-identification

Data Composition

General Chinese Pretraining Dataset

Medical Chinese Pretraining Dataset

General English Pretraining Dataset

Medical English Pretraining Datasets

Data Cleaning Process

Model Performance

Usage Example

fasttext-med-en-zh-identification

数据组成

中文通用预训练数据集

中文医疗预训练数据集

英文通用预训练数据集

英文医疗预训练数据集

数据清洗流程

模型表现

Usage Example

Datasets used to train ytzfhqs/fasttext-med-en-zh-identification