4.6 Article

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

期刊

APPLIED SCIENCES-BASEL
卷 13, 期 17, 页码 -

出版社

MDPI
DOI: 10.3390/app13179837

关键词

unsupervised learning; clustering; multimodal; network mapping

向作者/读者索取更多资源

With the proliferation of internet technology, the rise of illicit websites such as gambling and pornography has become a serious concern due to the threats they pose to people's well-being and financial security. Current governance measures rely on manual detection, but the need for effective and efficient solutions is urgent. This paper proposes a method that utilizes web mapping engine big data to perform unsupervised multimodal clustering for the discovery of illicit websites, achieving high accuracy in identification and classification.
With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people's physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据