4.6 Article

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

Journal

APPLIED SCIENCES-BASEL
Volume 13, Issue 17, Pages -

Publisher

MDPI
DOI: 10.3390/app13179837

Keywords

unsupervised learning; clustering; multimodal; network mapping

Ask authors/readers for more resources

With the proliferation of internet technology, the rise of illicit websites such as gambling and pornography has become a serious concern due to the threats they pose to people's well-being and financial security. Current governance measures rely on manual detection, but the need for effective and efficient solutions is urgent. This paper proposes a method that utilizes web mapping engine big data to perform unsupervised multimodal clustering for the discovery of illicit websites, achieving high accuracy in identification and classification.
With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people's physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available