Hiding Data and Detecting Hidden Data in Raw Video Components Using SIFT Points

Steganography is a science of hiding data in a medium whereas steganalysis is composed of attacks to find the hidden data in a cover medium. Since hiding data in a text file would disturb the coherence of the text or make it suspicious, systematically changing pixels of a visual is a more common method. This process is performed on pixels that are spatially (and/or temporally, for video components) distant from each other so that a viewer's eye can be deceived. Online media are subject to modification such as compression, resolution change, visual modifications, and such which makes Scale Invariant Feature Transform (SIFT) points appropriate candidates for steganography. The current paper has two aims: the first is to propose a method that uses the SIFT points of a video for steganography. The second aim is to use Convolutional Neural Networks (CNN) as a steganalysis tool to detect the suspicious pixels of a video. The results indicate that the proposed steganography method is effective because it yields higher peak signal-to-noise ratio (PSNR = 95.41 dB) compared to other techniques described in cybersecurity literature, and CNN cannot detect hidden data with much success due to its 52% accuracy rate.


INTRODUCTION
Steganography, which is a Greek word meaning covering writing [1], is an important sub-discipline of datahiding methods, which involves the process of hiding data in a medium. These media components may be a picture, an audio, a video, a web page, and such. This technique is usually employed by illegal groups who want to disseminate information online in an untraceable way. Therefore, it is important to investigate possible more sophisticated methods and their discoverability in the context of cybersecurity and cryptography. The most important difference between steganography and cryptography is that the former is able to detect whether there is meaningful data in the target object. The purpose of steganography is to keep the confidential and hidden so that neither can be discerned by third parties. In short steganography is described as "the art of hiding data" [2]. Similarly, steganalysis is a method of attack to uncover confidential data in a cover medium. If a hidden data is found, to obtain or, at least, alter this information. Most steganalysis applications are based on mathematical/statistical analysis [3]. However, in recent years, deep-learning-based steganalysis studies have become popular [4]. Such methods are designed to operate over the data scattered in the spatial environment (picture), over time scattered data (sound) and over the data scattered over both time and spatial environment (video).
The current study proposes a real-time method that uses SIFT (Scale-Invariant Feature Transform) [10] points of a video to hide information. It further investigates whether a CNN can detect a frame with hidden data embedded with this technique. Fig. 1 explains the pipeline of the study.
SIFT is a computer-based visual algorithm used to determine and identify key point properties in the image [10]. Objects may be subject to changes in subsequent frames, since images are limited to two dimensions while real objects are three-dimensional. These changes that may be exposed in object tracking are in the form of "Dimensional Change", "Angular Change", "Spatial Variation", "Noise in the Environment" and "Brightness Changes in the Environment". The SIFT algorithm is not affected by the size of the image, the amount of light, the change of camera angle, contrast or noise [10]. The authors believe that SIFT keypoints are useful for steganography because videos streamed online are usually modified to enhance their speed; and thus, these points will generally stay intact in case of any modifications to guarantee persistence of the hidden message. Since it is composed of a complex set of operations, only a deep learning technique which can automatically extract complex features should be able to capture the existence of the hidden data.

Steganography
Steganography aims to hide confidential data in a different medium. Audio, static images, video images, text, and such are used to hide the data. Its idea is to make sure that only the person to whom information is sent and who is in possession of the key can obtain the confidential data [5]. The medium that will hold the confidential data is called "cover media"; after the encoding, it is called "covered media". The confidential data to be encoded in the cover media can be a text file or an image.
The most important criteria in steganography are the non-detectability and impalpability of hidden data. The concept of impalpability means undetectable by human senses; unpredictability refers to be immune to mathematical analysis. Similar to cryptology, where encrypting is not accepted as confidential and the whole responsibility of communication security belongs to crypto keys, steganography employs a data-hiding method that is not accepted as confidential. Each mystery leads to a potential point of failure, and privacy is the main cause of fragility [6].
"Stego" media is the name given to the message to be hidden. This message could be a plain text, chipper text, other images, or anything that can be digitized in bits. As a result of the embedding process, the cover medium and the message itself constitute the covered media.
Data-hiding science is divided into three parts "Algorithm Domain", "Data Environment" and "Perception". Generally subdivisions of "Algorithm Domain" namely "Spatial Domain" and "Frequency Domain" method are being used, [7]. Several steganography methods have been developed to hide information on image files. These can be classified under 3 titles [8,9]. These are: "Adding the Least Significant Bits", "Masking and Filtering" and "Algorithms and Transformations".
The current study proposes a method that falls in under the Spatial Domain of Algorithm Domain because it uses the blue channel of the best 5 SIFT points to hide text data. Using SIFT algorithm ensures that it is mathematically hard to discover as given in the CNN-based steganalysis method in the following section.

Steganalysis
Based on mathematical and statistical methods, steganalysis is generally applied to images, sound and video to look for hidden data. It is generally assumed that the attacker (steganalyst) knows the steganographic system. If the steganalyst does not know the system used, the job gets highly complicated. Steganalysis methods are divided into three categories with respect to their aims: "Passive Steganalysis" which identifies only the existence of the secret message; "Active Steganalysis" which aims to find some or all of the secret message; and "Distorting Steganalysis", a top level of the active steganalysis which aims to detect and destroy the hidden message and/or to replace it with a fake message. The data to be hidden could be encrypted before being embedded in the carrier. Although encrypting confidential data is ineffective in passive steganalysis, it provides solutions to active steganalysis methods.
There is no general steganalysis method to reveal the hidden data by means of steganographic methods. However, most steganalysis applications are based on mathematical/statistical models [3]. The purpose of steganalysis is to develop a large-scale system which can be applied to all data hiding methods rather than to a single method. Steganalysis methods are also used to measure the durability of a steganographic system. For each steganographic method, a separate steganalysis method generally needs to be developed. One method of steganalysis that yields reliable results for one method may not be good for another. Steganalysis methods are organized into three categories according to their type of attack. These types are "Sensory Attacks", "Structural (Signature/Pattern) Attacks" and "Statistical Attack". Examples of statistical attacks are "Neural Networks", "Clustering Algorithms", "Artificial Intelligence", "Machine Learning" and "Deep Learning". The current study employs a deep-learning-based steganalysis attack on the proposed steganography algorithm.

METHOD
For the steganography part, data-hiding transactions were performed in the current study using non-repetitive, best-quality SIFT keypoints on momentarily received realtime raw videos. The blue channels of top 5 SIFT points' LSBs (Least Significant Bit) are used to hide the data as given in Algorithm 1.  In an unpublished PhD dissertation [15], the time complexity of the original SIFT algorithm [10] is given as Θ(αβN2), where N2 is the size of a frame, α is the fraction of local extrema in a frame, and β is the fraction of local extrema that turn out to be SIFT descriptor. Both of these fractions are between 0 and 1 and depend on the visual. For the proposed Algorithm 1, assuming that number encoding bits will be less than or equal to the total number of frames, the time complexity is Changing the LSB ensures that the encoding is not visible to human eye. An example application to a ".bmp file" which has a width of 2231 pixels, a height of 3361 pixels and a bit depth of 24 is given in Fig. 2. Fig. 2a shows all SIFT keypoints. Fig. 2b contains enriched partial SIFT keypoints. In Fig.  2c, there are two SIFT keypoints whose coordinates are the same but repetitive in different spatial domain at different angles as given in Tab. 1.  The x and y points of the repetitive SIFT keypoints correspond to the same coordinates in the spatial domain. The only difference between these points on the same coordinates is the differing angle of the keypoints (kp.angle). This shows that, as seen in Fig. 3, although there are 5 top-quality SIFT keypoints from the real-time raw video components for each frame, 4 non-repetitive SIFT keypoints can be used in the spatial domain.
Tab. 2 contains the feature information of the top 5 key SIFT keys in the cover media before the data was hidden of the ".bmp" file shown in Fig. 2a. For the ranking of the most powerful keypoints, "kp.response" value is taken. As seen in Tab. 2, the x and y points of the kp1 and kp2 keypoints correspond to the same coordinates as the whole number in the spatial domain. The only difference between these points which overlap in the same coordinates, are the angles as seen in "kp.angle" column. Although we have 5 high quality SIFT keypoints for real-time raw video components for each frame, 4 non-repetitive SIFT keypoints can be used in the spatial domain. Therefore, 4bit data can be stored due to the repetition of keypoints (only angles are different) of the frame which has 5-bit data embedded capacity. Data have been hidden in RGB (Red-Green-Blue) channels for the top quality 5 SIFT keypoints of the ".bmp" format image file located in Fig. 2a by means of LSB method. Tab. 3 shows the changes in the keypoints that occur when the top-quality 5 SIFT keypoints of covered media are acquired again after the data-hiding transaction. The differences between Tab. 3 and Tab. 2 are indicated in bold. After the data-hiding process, there was little structural change in the quality order of SIFT keypoints. Therefore, it has been observed that the text stored in the cover media as a result of steganography application (encode process) is identical to the text obtained from the covered media (decode process). Tab. 4 shows the variation of the best quality of the 5 SIFT keypoints RGB values before and after data-hiding. This invariance before and after the data hiding process allows the hidden text to be recovered from the covered media. As shown below, it is clear that the RGB changes at the points in the spatial domain (kp 3 , kp 4 and kp 5 ) are +1 for each, while the RGB changes of the repeating (kp 1 , kp 2 ) points in the spatial domain are +2. This is repeated for the required number of frames until the entire text is hidden in the video. In order to retrieve the hidden text back, the process is reversed. In other words, top 5 SIFT points of each frames are retrieved from the covered media. Then, the blue channel values of the points are converted to text. Since the order of these points in each frame is identical to the order of letters in the text, the hidden text is successfully recovered.
For the second part of the study, steganalysis, deep learning-based CNN detectors, are employed to detect the existence of hidden data. For this purpose, two sets of training data are prepared: one set with 2010 images with hidden data embedded by the proposed Steganography, and 1020 clean original images. A Python deep learning library, Keras, was used to create a sequential model [13]. Keras provides easy and rapid prototyping and can support convolutional networks, repetitive networks and hybrid networks consisting of both [12]. Tab. 5 represents the model used for steganalysis. The model has 4 convolutional [14], 4 maxpooling, 1 flatten and 2 dense layers and a total of 45,281 parameters to be trained. The results are given in the following section.

RESULTS
In a similar study in cyber security literature [11], texts with different contents and sizes were stored in images. To compare the cover media before data-hiding with the covered media after data-hiding, a full reference quality image metric, PSNR (Peak Signal Noise Ratio), was suggested to measure the method's success rate.
In this study, data-hiding operations are not performed on pixels that have a predetermined static pattern according to some mathematical operations. Instead, videos are taken in real time to detect the highest quality SIFT keypoints of each frame dynamically. Therefore, the initial state of the snapshot and the cover media is absent. The proposed approach makes it difficult for another party to perform steganalysis. The current approach was applied to the highest quality 5 SIFT keypoints for each frame in the digital images taken in real time. The performance tests were performed with 300 frames per 10 second instant images (30 f/s). As seen in Tab. 6, the loss of instant frame was found to be approximately 82%. Tab. 7 shows the individual SIFT keypoints, which are hidden by the frame-based data. 35 SIFT keypoints need to be obtained for the 7 frames discussed, but 23 SIFT keypoints were obtained because of the repeated points in the spatial domain (different angles of the same keypoint).
in the carrier medium, the content of the hidden data is the same as the content of the data. The SIFT-based steganography model had a PSNR of 95.41 dB which is better than the previous studies. For example, an LSB steganography algorithm based on quantum circuits reports its best PSNR value as 51.14 [16] and 77.45 for another [20]. Another method employing pseudo random number generators to change LSB for data hiding reports 59.73 peak PSNR value [17]. Hajduk and his friends hide a QR code in various images and achieve 71.44 [18]. Similarly, Zhou and his friends propose another LSB-based steganography technique for colored images and get 56.513 PSNR [19]. Besides its higher PSNR value, the proposed method in this paper also provides an easier way to retrieve the hidden text. In other words, since the quality and order of SIFT points do not change after the data hiding process, the original text can be retrieved from the covered media. Tab. 9 summarizes the overall characteristics of this steganography algorithm.  In order to test the existence of data hidden by the proposed method, the CNN is employed in multiple tests. In the "accuracy" graph in Fig. 4, the accuracy value increases to 0.52 when epoch time is 100. It was observed that the accuracy value varied between 0.49 and 0.52 in the tests. This result is considered normal because both steganalysis attacks based on pixel neighboring matrix and steganalysis attacks with CNN detectors are not successful at all. In other words, the proposed method cannot be attacked by a CNN.
As shown in Tab. 10, a total of 2040 "640 × 480" pixels ".bmp" files were used for training. 1020 images were "cover" files whereas 1020 of them were "covered/stego" files. Similarly, a total of 510 images were used for testing. 255 of them were "cover" while 255 were "covered/stego" files. 10, 20, 50 and 100 values were used as epoch number. The total number of images reported to have no data hidden by the CNN can be seen in the "cover? " column. The "stego?" column shows the CNN's answer for the number of covered files. The reason for the inconsistency in the values in the "stego accuracy" column is validation accuracy varied 0.49 and 0.52.

CONCLUSION
In this study, steganography and steganalysis studies were performed for instantly taken real-time raw videos. Text data was hidden using up to 5 non-repeating SIFT keypoints within each frame. Data-hiding was applied to the related pixel's blue channel with the LSB method in the spatial domain.
It was seen that there was little structural change between the SIFT keypoints before and after the datahiding. As a result of data-hiding, it has been observed that the hidden message embedded in the cover media (encode process) is identical to the message extracted from the covered media (decode process). This technique allows invariant data to be hidden in the streaming media due to SIFT.
The success rates of data-hiding were measured with PSNR values. The steganography application was more successful than it was in the other studies in the literature [16][17][18][19][20]. In addition, the steganalysis was implemented using deep learning-based CNN detectors. The accuracy was found to vary between 0.49 and 0.52 which makes the proposed method hard to detect. The reason why these steganalysis attacks were not very successful is that the SIFT-based steganographic performance value (PSNR) is high and because there are few structural differences between the "cover file" and the "covered file" to be captured.

Applications
Applications are developed with the Python programming language. For computer vision algorithms, Python 2.7 "opencv" library is used. The Keras 2.2.4 library based on CNN was used on Python 3.6. deep learning class (cnnKerasSteganalysis), as processor"Intel (R) Core (TM) i5-4210U CPU @ 1.70 GHz 2.4 GHz", as memory "8.00 GB RAM" and as operating system "x64based 64-bit operating system" is used. Mean epoch time was calculated as 750 seconds.