This is an automated archive made by the Lemmit Bot.

The original was posted on /r/machinelearning by /u/Seankala on 2024-04-01 05:12:58.


I recently made custom BERT and ELECTRA models for the fashion domain that could also handle English and my own native language (I’m not in the US). I noticed that performance wasn’t as good as I anticipated and felt that it wasn’t worth it.

Are there any papers or resources regarding when it’s worth it to create your own pre-trained LM from scratch? I recall reading a paper for the biomedical domain a long time ago titled Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art (Lewis et al., 2020) that seems to show that pre-training from scratch can help with biomedical and clinical tasks but am not sure if there are any other papers out there.

Also, are there any tips or good-to-know things when assessing a newly pre-trained LM? For example, checking OOV rate etc.

Thanks in advance.