期刊
APPLIED SCIENCES-BASEL
卷 13, 期 17, 页码 -出版社
MDPI
DOI: 10.3390/app13179837
关键词
unsupervised learning; clustering; multimodal; network mapping
With the proliferation of internet technology, the rise of illicit websites such as gambling and pornography has become a serious concern due to the threats they pose to people's well-being and financial security. Current governance measures rely on manual detection, but the need for effective and efficient solutions is urgent. This paper proposes a method that utilizes web mapping engine big data to perform unsupervised multimodal clustering for the discovery of illicit websites, achieving high accuracy in identification and classification.
With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people's physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.
作者
我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。
推荐
暂无数据