Analysing and Carving MS Word and PDF Files from RAM Images on Windows

: In this study, a piece of software has been developed to recover the readable data by carving MS Word and PDF files from the RAM image. String searching, signature scanning, and data carving methods are used in the design of the software. The analysis was performed on a RAM image of 14 GB by using the software that was developed. The success rate for each file was determined by comparing the recovered data to the data in the original file. It was determined that the rate of data recovery decreases as the size of the MS Word or PDF files loaded onto RAM increases. Consequently, it is aimed to be an important example of obtaining electronic evidence from volatile data in forensic informatics with the proposed study.


INTRODUCTION
In the realm of cybercrime, data about the crime are obtained using digital evidence.The first step in cybercrime is to acquire the image of the RAM and hard disk in order to investigate it later.Particularly in regard to collecting evidence from RAM, the data must be obtained before being deleted [1].Therefore, the data in the RAM needs to be copied by using an image acquisition software.
In the Windows operating system, only active processes in virtual memory can be accessed.Full access to RAM is only available at the kernel mode level in Windows.Therefore, RAM image acquisition software runs at the kernel mode level.There exist open source and commercial software that acquires RAM images for the Windows operating system [2][3][4].The RAM images acquired by these software packages are used in RAM analysis and data carving operations.
Various data such as user passwords, images, documents, installed programs, and web addresses that have been visited can be acquired from the RAM by a RAM image analysis [3][4][5][6][7].String searching, signature scanning, file carving, and data structure analysis methods are used to recover data from the RAM image.Which method is to be used in the analysis process depends on the desired data.The string search technique is used to access the user's social media or application passwords [7].Data about the processes that are running or have been loaded in the system are kept in the data files on RAM.A data structure analysis method is used to access these data [8].In order to obtain the data belonging to the files in the RAM, the file first needs to be recovered.The files in the image are recovered using signature search and file carving techniques, respectively [9].Different data recovery techniques are applied depending on the file type in order to access the data in the recovered files [10][11].
In this study, a piece of software was developed to scrape readable MS Word and PDF data from RAM images.Data carving from a RAM image of 14 GB was carried out using the developed software separately for the MS Word and PDF files on a Windows 10 64 -bit operating system.At the end of the analysis, 10 MS Word files with the .docextension were recovered in 41 minutes and 10 seconds, 10 MS Word files with .docxextension in 37 minutes 45 seconds, and 10 PDF files were recovered in 45 minutes 1 second.Each PDF and .docfile was decoded to access its data.The data in the MS Word files with .docxextension were accessed using the string search method.When the recovered data were examined, it was determined that the average success rate was 40% for MS Word files and 16% for the PDF files.

RELATED WORK
Digital Forensics is a multi-disciplinary scientific field that deals with the collection, examination and preservation of existing data as evidence that is given to courts.The first Digital Forensics work is understood to be the identification of when and how someone entered a system administrator's system without permission in 2001 [12].Digital evidence, rather than physical evidence, is needed to shed light on crimes in Digital Forensics.It must be proved that the digital evidence has not been changed since it was collected.The hash signatures of the digital data can be acquired using MD5 and SHA1 algorithms.It is possible to determine whether the evidence has been changed after collection through these signatures [13].
Much of data that can be considered as digital evidence are temporarily stored in RAM.It was observed that the data stored in the RAM was not deleted for a certain period of time after the power is turned off.First, crypto keys in the RAM were found without using any special hardware by an attack scenario called Cold Boot [14].
KnTTools, which was developed in 2005, is understood to be the first RAM image acquisition from the operating system and analysis application.The search analysis for running processes and threads was carried out in the RAM image by using KnTTools [4].Image acquisition, using external hardware, was carried out by using the application AfterLife.The system is restarted after plugging the USB memory with the AfterLife application into the system.The application that controls the system copies the content of the RAM to the free space in the USB memory during the boot.Since USB memory is used in this method, the BIOS boot settings of the target system must be changed to USB memory.In addition, the application cannot acquire RAM images greater than 4 GB [15].
Accessing RAM from the user mode in the operating system was restricted after Windows Vista [16].Because of these restrictions, RAM image acquisition software must be run at kernel mode by a RAM driver [17].
Belkasoft Live RAM Capturer, DumpIt, FTK Imager, and WinEn software run in the kernel mode [8][9][10].These software packages have been developed as commercial or open source software.There are no studies on developing kernel mode RAM image software in the literature.In general, RAM image analysis and data recovery techniques have been used in the current studies [10][11][12].
The most comprehensive analysis of RAM images can be carried out with Volatility, an open source software.Data about the registry files, the running process, network, and malware detection can be accessed within the image at the end of image analysis by using Volatility [18].In addition, digital evidence about the files is obtained by carrying out RAM image analysis.Access to images and document files within the image is carried out using the signature scanning method [19].The image files in the RAM were accessed using signature scanning method in the studies [8].
PDF files are widely used in operating systems.At this point, many important data are stored in PDF files.PDF files have an important place in the process of obtaining digital evidence.For this reason, PDF files are also examined during the RAM image analysis.Methods for recovering PDF files from RAM have been proposed in studies [16].The recommended methods are used with the operating systems preceding Windows Vista.PDF files from RAM cannot be accessed through the operating systems used in recent times such as Windows 7, 8, 8.1, and 10.
One of the file types used to obtain digital evidence from RAM is MS Word.Al-Sharif et al. accessed MS Word files with .docxextension in the RAM image by using string searching method in their 2017 study.This study was conducted on the Windows 7 operating system and the average success rate was 6% [17].

PDF (Portable Document Format)
PDF is a digital method developed for creating portable and printable documents that are independent of software, hardware, and operating systems.The PDF file format was first developed in 1992 by Camelot, one of Adobe's founders.Today, this format is used as an open standard managed by the International Organization for Standardization (ISO).PDF files contain text, audio, video, and image data [20].The data in the PDF files are encrypted with the techniques given in Tab. 1.The encryption used for the PDF files is given in the reference labels in the file itself.Today, PDF programs encrypt PDF files with FlateDecode.The encrypted data in the PDF files are saved between the stream and endstream blocks [21].As shown in Fig. 1, the algorithm used to encrypt the data is shown on the filter label.Word files with the .docxextension consist of compressed XML files [22].Data compression programs are used to parse Word files into XML format.When the Word document is parsed into XML files, the textual data is stored in a document.xmlfile.The image and video files used in Word are extracted into the media folder.The style and layout styles used in the document are stored in the styles.xmlfile [23].Fig. 3 shows the parsed structure of a sample MS Word file with the .docxextension.

File Management in RAM
When a new file is opened in the operating system, it is first loaded into RAM.The address of the opened file is kept in the EPROCESS (Executive Process) unless it is terminated.The device data, file name, and the data for the open file are kept in the FILE_OBJECT structure.FILE_OBJECT keeps the data for the file loaded into RAM in the DeviceObject, SectionObjectPointer, and Filename structures.The DataSectionObject, SharedCacheMap, and ImageSectionObject sections in the SectionObjectPointer structure are used to access readable data about the file.This situation is shown in Fig. 4. When a file is loaded into RAM, the DataSectionObject structure is first triggered by the SectionObjectPointer [24].Then the SharedCacheMap and ImageSectionObject structures are called.When the file is deleted or terminated, the record within the EPROCESS is deleted.But FILE_OBJECT keeps the deleted or terminated file until a new address is assigned [25].Therefore, access to a file in RAM may be possible even if the process has been terminated.

RAM Image Analysis Methods
After acquiring the RAM image, various methods are then used to obtain digital evidence.The data in the divided or undivided areas of the RAM are extracted by using the signature, string, and header-footer searching methods.Applications encrypt their data using certain techniques in RAM.Some applications store their data in RAM without using encryption [26].

String Searching
String searching is an analysis process carried out on the RAM image.In this method, there is no need to use the data structures in the RAM image.ASCII and Unicode strings with special typing formats can be searched by selecting various search options within the RAM image [14].
The applications running on the operating system keep the user's data in RAM for a certain period of time without encryption.These data can be accessed from the RAM image by using the string searching method.The search is thus carried out by determining different string methods for each application.Sample strings created to be used for scanning for the username and password for social media accounts are given in Tab. 2 [27].

File Carving
File curving is known as a search-and-recovery process for terminated or deleted files stored as binary in RAM.The files used in the Windows operating system are stored in RAM between the header and footer signatures.The files in the RAM are extracted by scanning the header and footer signature.The header and footer signatures can be different depending on the version of the file types [28].The header and footer signatures of various file types are given in Tab. 3. Some file types do not leave footer signatures in RAM.For files that lack a footer signature, during the RAM image analysis, termination is performed by entering the maximum size after the header signature [29].

IMPLEMENTATION OF THE SOFTWARE
The software implementation is composed of two parts.In the first part, binary scanning is carried out using signature traces in RAM.The address ranges of the files are determined at the end of this scan.The files whose addresses in the RAM have been specified are then extracted to the disk through file carving.In the second part, the content is extracted from the recovered files.The XML string search method is used for MS Word files with the.docx extension, and the stream decode method is used for the files with .pdfand .docextensions to extract content.The operating model of the implemented software is given in Fig. 5.
The RAM image used in the analysis process was acquired using the image acquisition software developed within the scope of this study.The study was carried out for a RAM image of 14 GB in the system given in Tab. 4. The address range of the image is 0 × 000000000000 h and 0 × 36EFFF904 h.

File Carving
The header and footer signatures in the signature file loaded into the system are used to identify the structure of the files and their addresses in the RAM.The developed software searches for signatures in each byte of the RAM image file.The beginning address for the header signature and the end address for the footer signature are specified.As in the scanning result given in Fig. 7, the data between the beginning and the end addresses are transferred to the file generated in binary structure.Some of the files do not have a footer signature.In this case, the address range is determined by specifying the maximum size after the header signature.
Since there are losses in the structures of the MS Word and PDF files which are recovered by file scraping method, the recovered files cannot be open.Stream decode method should be used for the files with pdf and doc extensions and string search method should be used for the files with docx extension.

DOC Stream Decoder
A Word file with .docextension is encrypted by dividing it into 7 parts.The data in the file is located in the WordDocument section.The WordDocument block in the file begins with the (Content_Types) tag.As shown in Fig. 8, the encrypted data in the file is extracted from the addresses between the beginning and the STX tag.The encrypted data blocks in the scraped MS word file with doc extension are decoded by using Codepage 1252 in the C# programming language.The code block used for the decoding operation is shown in Fig. 9.

DOCX String Searching
The textual data in the recovered MS Word and XML files from the RAM image are stored in <x:w> and <w:tbl> tags in the XML tree structure.The data stored between the tags are not encrypted.Separate string searches are carried out for all the files some of which are shown in Fig. 10.At the end of this scan, the resulting data is also transferred to the text files.The comparison of the original file to the recovered data is given in Fig. 11.

PDF Stream Decoder
When the PDF files are loaded into the RAM, the data in them are encoded by using digital encryption techniques.In order to collect the contents of the recovered PDF files, the encryption method applied to the file needs to be known.

Figure 12 An encoded PDF file
As seen in Fig. 12, the applied method can be accessed by using the Filter tag.The data in the PDF file are encoded between the stream and endstream blocks.The recovered data in the study are placed in data blocks in a blank PDF file to be decoded.When the PDF file is opened in Windows, the data are automatically decoded and the content is displayed.

RESULTS
Four different image files were used by the implemented software.The MS Word and PDF files were scraped from these image files.As seen in Tab. 5, the PDF files were recovered in a period between 23 and 37 minutes, depending on the size of the image file.The scanning time for MS Word files was between 17 and 45 minutes.Since XML files are also included in the scanning process, the duration of scanning MS Word files is longer.The files scraped from the 14 GB RAM image are used in recovering data in the MS Word and PDF files.This image file was acquired by using the developed RAM image acquisition software.10 files for each of the .doc,.docx,and .pdfextensions which were previously opened and terminated in the operating system have been included in the analysis.The analysis process is carried out by the software developed by using the C# programming language.
The data in the Word files with the .docextension were decoded by using codepage1252.The comparison of the recovered data with the original file is given in Tab. 6.The success rate was over 50% for the files under the size of 300 KB.As the file size increases, the recovery rate of the data decreases.In the end, the recovery rate of the data in 10 MS Word files with the .docextension was 35.6%.The comparison of the data recovered by string searching in the scraped file with the .docxextension to the original file is given in Tab. 7. The average success rate in the recovery of the data in the MS Word files.
The data in the PDF files are encrypted by FlateDecode.The date was accessed by decoding each block in the recovered data.The recovery rates for the 10 PDF files can be seen in Tab. 8.According to these figures, the recovery rate for the PDF files under the size of 250 KB is over 50%.The recovery rate for the data in the files over the size of 250 KB ranges between 8.6% and 17.4%.
MS Word and PDF files in the operating system that have been terminated continue to stay in RAM data structures.However, the addresses of newly opened files may conflict with the addresses of previous MS Word and PDF files.This conflict causes the deletion of the data belonging to terminated MS Word and PDF files.
Each paragraph within PDF files is encrypted in a separate data block by FlateDecode.The loss of 1 byte in the encrypted data blocks during the carving process prevents the recovery of the data in the block.The losses in the encrypted blocks result in a lower recovery rate for PDF file data.Different methods are applied to the files with .docand .docxextensions during data recovery.The data in the files with .docxextension is stored in the XML structure without encryption.Therefore, the data in the scraped files are accessed through string searching.The data blocks in the files with the ,doc extension are encrypted by codepage1252.Therefore, encrypted data needs to be decoded to recover data in the files with the .docextension.
Studies can be carried out to improve the performance and data recovery success rates of developed forensic software.Extending the signature database used in the proposed software will enable more MS Word and PDF files to be engraved.In addition, by increasing the features of the hardware where the engraving process is performed, accelerating the processes performed on the GPU will reduce the engraving times.The decoding of compressed data blocks in data recovery software can be strengthened with character-based reverse-engineering algorithms.As a result, it can be expected that the data recovery success rate will increase.

CONCLUSION
When a process is terminated in the Windows operating system, the address information of the process is deleted.However, data belonging to the process are not deleted in RAM data structures.It is possible to access these data by file scraping and data recovery methods to be made in the RAM image.The software has been developed for scraping MS Word and PDF files from the RAM image to be used in forensic informatics.Because the small size of the image file to be scraped reduces the access rate to deleted files, a 14 GB image file was selected.With the proposed software, 10 PDF, DOC, and DOCX files randomly selected from the 14 GB RAM image were compared with the data in their original files.As a result of the comparison, the recovery success rate was obtained for each file.
For future studies, it is planned to increase the success rate of the recovered data by applying different carving techniques and algorithms in the developed software.

Figure 1
Figure1The structure of PDF file3.2Word Document FileText, audio, image, and video data can be saved in the Word file format..docand .docxfile formats are used in Word files.Word files with the .docextension are in the OLE Compound Binary File (CBF) file format whose structure is shown in Fig.2.The OLE Compound Binary File is a method for storing multiple data streams in a single file.The data in files with the .docextension are encrypted by using codepage 1252 between WordDocument blocks.The encrypted data in the file stored between (Content_Types) and STX tags.

Figure 2
Figure 2The parsed structure of a Word file (.doc)

Figure 4
Figure 4 The structure of File_Object

Figure 5
Figure 5 The operating model of the software 4.1 Signature Database MS Word and PDF files can leave different signatures in RAM depending on their versions.For a complete scanning of the RAM image, all signatures of MS Word and PDF files need to be scanned.The textual data of the MS Word files with .docxextension is saved in document.xml file.Therefore, xml files were also added to the signature scan.As shown in Fig.6, all signatures for the files with .doc,.xml,and .pdfextensions are collected in the source database named source.a2s.The signatures to be scanned when the software is running are loaded into the system.

Figure 6
Figure 6 The signatures for MS Word, XML and PDF files

Figure 7
Figure 7 Signature scan for a PDF file

Figure 8
Figure 8 An encoded MS Word file with doc extension

Figure 9
Figure 9 Decoding operation for encrypted data by using Codepage1251

Figure 10
Figure 10 The text view of the scraped file

Figure 11
Figure 11 The comparison of the original MS Word file to the recovered MS Word file

Table 1
Encryption techniques used in PDF files

Table 2
Sample strings used for social media accounts.

Table 3
The signatures of header and footer

Table 4
System data about RAM image

Table 5
Period of scraped PDF and Word files

Table 6
Recovery analysis for the Word (.doc) files

Table 8
Recovery analysis for the PDF files

Table 7
Recovery analysis for the Word (.docx) files