4.6 Article

Developing Software Signature Search Engines Using Paragraph Vector Model: A Triage Approach for Digital Forensics

Journal

IEEE ACCESS
Volume 9, Issue -, Pages 55814-55832

Publisher

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/ACCESS.2021.3071795

Keywords

Software; Digital forensics; Search engines; Tools; File systems; Indexes; Metadata; Digital forensics; triage solution; software signature; forensic differential analysis; search engine; paragraph vector

Ask authors/readers for more resources

With the advancement of information and communication technology, digital crimes have become more prevalent. This paper introduces a software signature search engine (S3E) to identify software on the system, aiming to address the challenge faced by digital forensic investigators. Experimental results demonstrate the good performance of S3E models on controlled systems and pseudo-real systems.
Today, with the growth of information and communication technology, digital crimes have also spread. Advanced storage technologies and their low cost have led to a significant increase in their use. Therefore, the high volume of digital data to be analyzed is a challenge facing digital forensic investigators. Digital forensic triage solutions aim to alleviate the forensic backlog. A promising triage technique is to quickly find the software packages run on the target system to narrow down the search space. In this paper, we propose a software signature search engine (S3E) to identify software on the system under investigation. The document collection of this search engine consists of software signatures, and the query is the features extracted from the system's hard disk. We propose a forensic differential analysis model to build software signatures. Besides, we use the paragraph vector model to construct the corresponding vectors of each software signature and find similarities between the query vector and the signature vectors. Different design parameters are involved in making software signature search engines, and distinct values of these parameters lead to different models. We have measured the performance of these S3E models against several controlled systems and some pseudo-real systems. The experimental results on both datasets show that some S3E models achieve perfect recall, and many of them have a recall of more than 90%. Besides, we find that the recall rate of the S3E models in both datasets is higher than the averaged word2vec model and the TF-IDF model.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available