Cryptanalysis of Polyalphabetic Cipher Using Differential Evolution Algorithm

: Today it is necessary to keep information secure and cryptography is the most common technique for data security. The Vigenere cipher, one of the polyalphabetic encryption algorithms, has been used in the history by substitution of the plaintext letters with other alphabet letters using a secret keyword and a systematic table. In order to make the ciphertext readable with a keyless procedure, the cryptanalysis technique is used. However, extracting all possible permutations of the letters is exhaustive or frequency analysis is ineffective to extract the letters from the cipher. Therefore, this study aims to propose an efficient polyalphabetic Vigenere cipher cryptanalysis using Differential Evolution algorithm on English and Turkish texts at different lengths. The efficiency of the Differential Evolution algorithm is compared to those of Genetic Algorithm and Particle Swarm Optimization algorithms in terms of the number of key letters recovered correctly. The results show that Vigenere cipher analysis using Differential Evolution algorithm is more effective in polyalphabetic cryptanalysis.


INTRODUCTION
Cryptography and cryptanalysis are two highly common methods used in the cryptology. The first is related with generating a variety of algorithms for encoding and decoding messages to keep them confidentiality secure, whereas in the latter, the goal is to work on the ciphertext to extract the plaintext even though there is no prior information and authorization between the sender and the receiver of the message to retrieve the keyword [1]. Typically, the classical ciphers are classified into two subgroups: transposition (or permutation) and substitution ciphers. An encoder uses a particular permutation to divide the plaintext into several blocks with a certain size or interchanges the letters in a systematic way [2][3][4][5].
The cryptosystems for substitution cipher may also be sub-categorized as monoalphabetic and polyalphabetic. The most common polyalphabetic algorithm is Vigenere cipher [6,7]; it works through replacement of each plaintext letter with another letter, which is found through addition of the index numbers of the plaintext character and an arbitrarily chosen code word. The original message is encoded using a table of rows and columns formed by alphabet letters in English or Turkish or any other language, through the replacement of the letters in the plaintext with the letters in the table based on the indices [8][9][10]. For instance, the number of possible keywords is 26 m in English or 29 m in Turkish, where m is the key length. The plaintext letters are re-written as a sequence of integers as well as the key letters. The integer string of the message is split into reasonable blocks, depending on the key size. Eq. (1) and Eq. (2) are used for encryption and decryption purposes, respectively: where P = (P 1 , P 2 , P 3 , ..., P n ) is a plain text block, K = (K l , K 2 , K 3 , ..., K n ) is key, C = (C 1 , C 2 , C 3 , ..., C n ) is ciphertext block and N is the number of alphabet letters in the target language.
Since a potential key is a permutation for each alphabet letter, a wide range of permutations are available for the key. Therefore, manual cryptanalysis or cryptanalysis using Brute force are ineffective due to their computational cost and work. Accordingly, metaheuristic algorithms are useful to make a systematical search and to find the optimal key.
Nature-inspired algorithms have been utilised by the researchers in the cryptanalysis of classical cryptosystems and positive outcomes have been claimed by many researches. Spillman et al. [11] implemented a Genetic Algorithm (GA) to break a Monoalphabetic Substitution Cipher. Furthermore, Genalyst was proposed by Matthews [12] to break the transposition cipher. Clark [13] presented GA, Tabu Search (TS) and Simulated Annealing (SA) to cryptanalyze the substitution cipher. Moreover, Clark et al. [14] were the first to recommend the adoption of GA in order to complete an attack on a polyalphabetic substitution cipher. Clark and Dawson [15] improved the work by a parallel GA to attack the Vigenere cipher. Moreover, Clark and Dawson [16] performed a comparison among SA, GA and TS on simple substitution ciphers. Dimovski and Gligoroski [17] applied SA, GA and TS in order to achieve transposition cipher cryptanalysis. Verma et al. [18] presented a monoalphabetic substitution cipher based on GA and TS and compared the overall efficiency of these algorithms. An automated approach to the cryptanalysis of transposition cipher was developed in the works of Song et al. [19] and Garg [20] based on GA, TS and SA algorithms. In addition, Omran et al. [21] developed a GA to attack the Vigenere Cipher. Bhateja and Kumar [22] adopted elitism in GA with a novel fitness function and applied it to cryptanalyze a Vigenere cipher. In this regard, Boryczka and Dworak [23] considered the evolutionary algorithms to increase the speed of cryptanalysis of the transposition cipher. Uddin and Youssef [24] applied Ant Colony Optimization (ACO) in order to attack simple substitution ciphers. Bhateja et al. [25] investigated the performance of Cuckoo Search (CS) algorithm in the cryptanalysis of the Vigenere cipher, whilst Luthra and Pal [26] directed their efforts towards examining the integration of mutation and crossover with the Firefly Algorithm (FA) for cryptanalysis of the monoalphabetic cipher. Sabonchi and Akay [27,28] presented Artificial Bee Colony algorithm (ABC) in cryptanalysis of the substitution ciphers.
Nonetheless, such techniques are inefficient in analysis of the cipher if the key size exceeds 15 characters. One of the successful evolutionary algorithms in problemsolving, Differential Evolution (DE) [29] gained a success on many problems in various research fields [30]. This encourages further work on DE algorithm in the cryptanalysis of Vigenere cipher, which is the aim of this study. Several encrypted English and Turkish texts at different lengths, and keyword sizes are used to evaluate the efficiency of DE, GA and Particle Swarm Optimization (PSO) on Vigenere cipher analysis.
The rest of the paper is organized as follows: in Section 2, some brief descriptions of the algorithms used in this study are presented. In Section 3, the proposed cryptanalysis approach based on DE algorithm is provided, In Section 4, experimental study is explained, and the results are discussed. Finally, Section 5 is dedicated to the conclusion and future work.

BRIEF DESCRIPTION OF THE ALGORITHMS USED IN THE STUDY
Metaheuristic algorithms find solutions systematically by directed and randomized searches for the problems especially computationally unmanageable. We used some of these algorithms in our study including Genetic Algorithm, Particle Swarm Optimization and Differential Evolution. Brief descriptions of these algorithms are provided below.

Genetic Algorithm
The Genetic Algorithms (GAs) were presented by Holland [31] that modulate the idea of the Evolutionary Algorithm, through addition of a phase referred to as crossover. GA is known to include a random numbergenerator, genetic operators for reproduction, and a fitness evaluation unit. The main steps of the GA algorithm are presented as follows: 1: Initialize Population, 2: repeat, 3: Evaluation, 4: Reproduce, 5: Crossover, 6: Mutation, 7: until requirements are met. In the initialization step, a random solution (x i ) is generated and then, the value of cost function f(x) for every chromosome in the population is evaluated.
In the reproduction step, two chromosomes from the population are selected and the chromosome with the higher fitness value has a better opportunity to be chosen. In the crossover step, two new off springs are generated by the crossover operator applied to parents chosen. In the mutation step, if a random number within the range (0, 1) is less than the mutation rate (MP), the parameter or parameters of the offspring are mutated to introduce diversity between parents and the offspring. Then, the parents are discarded, and the offspring are kept in the population.

Particle Swarm Optimization
Particle Swarm Optimization (PSO) presented by Kennedy and Eberhart [32] models the collective behaviors of birds flocking, or fish schooling. In the algorithm, each particle uses its previous experience while setting its own position for the best position in the track. The main steps of the PSO algorithm are presented as follows: 1: Initialize Population, 2: repeat, 3: Evaluate, 4: Update the best experience of all particles, 5: Choose the best particle, 6: Calculate particles' velocities, 7: Update particles' positions, 8: until requirements are met.
In the initialization step, a random position (x i ) is generated for each particle using Eq. (3), and then, the fitness function f(x)of each particle is computed in the evaluation step. The position of each particle is updated using Eq. (4) in updated step.
where x ij is the position of i th particle, i = 1 swarm size, j = 1 dimension of the problem, and max j x , min j x refer to the lower and upper bound respectively.
where t i x is the position of each particle at iteration t, and t i v is the velocity of each particle at iteration t.
Then, the particle with best fitness f(x) value chosen, and velocity is updated using Eq. (5): where w is inertia weight, c 1 and c 2 refers to cognitive component and social component sequentially, pb is personal best position of i th particle, gb is global best position of any particle and both r 1 , r 2 indicates a random value within the range (0, 1).

Differential Evolution Algorithm (DE)
The differential evolution algorithm is an intelligent search algorithm proposed by Storn and Price [29]. In DE, all variables are represented as a real number and the crossover, mutation and selection operators are iterated in DE. The main steps of the DE algorithm are presented as follows: 1: Initialize Population, 2: Evaluation, 3: repeat, 4: Mutation, 5: Crossover, 6: Evaluation, 7: Selection, 8: until requirements are met in.
In the initialization step, a random individual (x i ) is generated using Eq. (3) and then, the cost function f(x i ) of each individual is computed in the evaluation step.
In the mutation step, an mutation i x individual is generated using Eq. (6).
where r 1 , r 2 and r 3 are random integers generated, each one is different from each other, and are all not equal to mutation i x , F is scaling factor that is generated randomly in the range of (0, 1).
In the crossover step, an ( where j is a random integer number and CR is a crossover rate generated within the range of (0, 1).
In the selection step, each ( trial i x ) individual competes with (x i ) individual and the best one is saved in the population.

PROPOSED DE-BASED CRYPTANALYSIS
In this study, cryptanalysis steps can be achieved by considering the following pseudocode code:  Initialize population pop(i) using Eq. (3), control parameters,  while requirements are met do,  for every suggested key(i) ∈ pop(i) do o Evaluate f (suggested key(i)) using Eq. (8), o Apply mutation operator to create trial suggested key(i) using Eq. (6), o Apply crossover operator to create offspring suggested key*(i) using Eq. (7), The cost function (fitness function) has a critical role in the efficiency of a metaheuristic algorithm since the cost function discovers the integrity of the possible key. The objective here is to offer meaningful and comparable value to guide the algorithm. The solution with high fitness function has a chance to remain in the next generation and to continue towards optimal solutions. The fitness function provides local optimal solutions, and its quality is higher if a global optimal is achieved.
In the present study we employed a fitness function defined using the unigram and bigram statistics of the language considered [22,25]. The fitness function f of a suggested key K can be defined by Eq. (8): where K is the key used to decode the message, OFM(i) and EFM(i) are the observed and expected frequencies for i th monogram, respectively, OFB(i) and EFB(i) are the observed and expected frequencies for i th bigram, respectively. λ 1 and λ 2 are the weights assigned to unigram and bigram statistics respectively. The optimal weights, λ 1 = 0.23 and λ 2 = 0.77 are found in [33] based on the percentages of the retrieved words to be correct at different lengths for both ciphertext and keyword. In this study, we considered some texts written in the English and Turkish languages. The highest frequencies of unigrams and bigrams in these languages are found through computation. The frequency values observed for unigrams are subtracted from the normal frequencies and the sum of the differences is calculated. The same procedure is performed for bigrams. Tab. 1 and Tab. 2 present the expected values for unigram and bigram [34] generated using around 4.5 billion characters in English, similarly, Tab. 3 and Tab. 4 present the expected values in Turkish [35].

EXPERIMENTS
In the first part of the experiments, the results of the proposed DE algorithm on the cryptanalysis of Vigenere cipher are presented. In the second part, the results of the DE, GA and PSO algorithms on the cryptanalysis of Vigenere cipher are compared to examine the efficiency of the DE method over the GA and PSO algorithms.
In all experiments, we assume that both plaintext and ciphertext include only the English (26 Letters) and Turkish (29 Letters) alphabets. We investigated the keywords with five different sizes (5, 10, 15, 20 and 25), and plain texts with four different lengths (250, 500, 750 and 1000). The best values of the control parameters used in this experiment are obtained from grid search and presented in Tab. 5. Each experiment is repeated 30 times and statistics of these runs are reported in the results.

Experiment 1: De Algorithm in the Vigenere Cipher Cryptanalysis
Tab. 6 and Tab. 7 display the retrieved key characters and the fitness levels obtained for English and Turkish ciphertexts using DE algorithm, respectively.
When the ciphertext size is less than 250 character, the minimum and maximum number of key characters recovered correctly is less than the minimum and maximum number of key characters recovered correctly in solving a ciphertext of size 500, 750, 1000. Likewise, the mean of the number of the key characters recovered correctly is less than the mean of the number of key characters recovered correctly in solving a cyphertext of size 500, 750, 1000. The standard deviation of the number of key characters recovered correctly with ciphertext of size 250 is higher than the standard deviation of the number of key characters recovered correctly with ciphertext of size 500, 750, 1000.
With an increase in ciphertext length, (> 250 character) the number of key characters recovered correctly increases as well because the reliability produced by higher size of ciphertext made the fitness higher and a good approximation to the expected values is obtained. From the results, the iteration cycle is directly related to the key and ciphertext length. When the ciphertext is getting small, the iteration cycle is increasing in the decryption. The encrypted text with more characters makes the key estimation more effective and reduces the iteration cycle needed. Interestingly, the accuracy to find the keys is typically higher in Turkish than that in English texts, even if the ciphertext is short (≤ 250 character) because the average length of the words in the Turkish language is 6.1 letters about 30% more than that of the English language, moreover, the short words (with 3 to 8 letters) represent over 60% of total usage in the Turkish language and that provides a wealth of information in cryptanalysis [35].

Experiment 2: Comparison of DE, GA and PSO in the Vigenere Cipher Analysis
In this part of the study, the proposed DEalgorithm is compared to other search algorithms including GA and PSO on Vigenere cipher analysis. The best, mean and standard deviations of the maximum number of key characters recovered correctly by DE, GA and PSO algorithms are considered to validate the results. These results are shown in Tab. 8 for English ciphertexts and in Tab. 9 for Turkish ciphertexts.
When the key size is equal or less than 5 characters, the best value is the same for all three algorithms although for all cases, DE has the highest mean value and the minimum standard deviation compared to GA and PSO. When the ciphertext size is greater than 250 characters, GA produced the best value and the highest mean value with minimum standard deviation compared to the PSO. Best values produced by both GA and DE are very close, but the mean value and standard deviation produced by DE is better. From the results, it is seen that GA algorithm retrieves almost all characters, when the size of the keywords is higher than 15 and the size of ciphertexts is higher than 250. PSO algorithm is efficient when Vigenere cipher uses a smaller length of key (less than 5characters) while it is not so efficient when dealing with longer key lengths (greater than 5 characters). It is also found that DE is more efficient than GA, when the ciphertext length is equal or less than 500 characters. Furthermore, the best mean and standard deviation values of the number of key characters recovered correctly produced by DE are better than those obtained from GA and PSO.

CONCLUSION
In the Vigenere cryptoanalysis, there is a huge range of possible keys, and the manual cryptanalysis and the statistical techniques are inefficient when the key length is longer. This study aimed to analyze the suitability of DE algorithm as a cryptanalytic tool. The results show that it is an efficient method for the Vigenere cipher. Consequently, DE algorithm has ability to retrieve all the characters of the keyword for the key size is 25 characters and ciphertext size is more than 250 characters, while GA and PSO algorithm can retrieve the entire key correctly when the length of keys is small. Based on experimental results, we can conclude that the results of DE are better than those obtained from GA and PSO in the cryptanalysis of Vigenere cipher. Also, computational results and comparisons demonstrate that iteration cycle is directly related to the key and ciphertext length, also the accuracy to find the keys is typically higher in Turkish texts than that in English texts, even if the ciphertext is short (≤ 250 character). Tailoring efficient fitness functions remains to be studied as our future work.