Real-world hate speech datasets as evaluation
18. 4. 2023 | Human Rights and Minorities
Andraž Pelicon, our colleague from the Jožef Stefan Institute recently presented a paper titled “Don’t Start Your Data Labeling from Scratch: OpSaLa – Optimized Data Sampling Before Labeling” at the Symposium on Intelligent Data Analysis (IDA). His work focuses on text classification tasks with severe class imbalance that limits the ability to train high-performance models. This is partly due to the small number of instances in the minority class, so that the minority class patterns are not well-represented. A common approach in such cases is to resort to data augmentation techniques; however, these have shown mixed results on text data. The proposed solution is to Optimize the data Sampling prior to Labeling (OpSaLa) to obtain overrepresented minority class(es) in the training dataset. The approach was evaluated on three real-world hate speech datasets and compared to four commonly used approaches: training on the “natural” class distribution, a class weighting approach, and two oversampling approaches: minority oversampling and backtranslation. The results confirm that the OpSaLa approach yields better models while the labeling budget stays the same.
Reference: Pelicon, A., Montariol, S., Kralj Novak, P. (2023). Don’t Start Your Data Labeling from Scratch: OpSaLa – Optimized Data Sampling Before Labeling. In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham. >> https://doi.org/10.1007/978-3-031-30047-9_28