Simulation of Academic Computer Networks Using Probability Distributions: A Case Study in A Campus Network

: Computer networks are becoming more complex with the advances in technology. Hence, the installation of computer networks becomes more complicated and costly. Therefore, many parameters of the existing or planned networks, such as the requirements, limits and performance are modelled through simulators. Thus, it is possible to save both in terms of time and cost. Campus networks are networks that are established by consolidating many local area networks. The aim of this study is to model campus networks which have a general daily behaviour pattern, through simulators. The data used in the study are collected in real time from Siirt University. The daily behaviour of the network in working hours is divided into four separate time intervals according to the network traffic and in consideration of similar studies in the literature. The most appropriate distributions that model the transmission times of the incoming/outgoing packets at each time zone are identified. The results are presented in comparison with the previous studies conducted to model campus networks. At the same time, the most generic distributions that model the daily incoming / outgoing traffic of the network are identified. The distribution that best models the transmission times of the network packets was identified to be the lognormal distribution for TCP packets and the Generalized Pareto distribution for UDP packets. Compatibility of the distributions was determined through the use of Kolmogorov-Smirnov and Chi-Squared tests


INTRODUCTION
In order to accurately assess the effect of the protocols, applications, and users, used in network simulation, it is very important to generate a simulated traffic. There are two types of traffic to consider when modelling with simulators. The first is the application-specific traffic to be modelled for the target application, and the other is the background traffic generated by other applications on the network. Background traffic has a significant effect on the behaviour of the target application with regard to the use of network resources [1]. The dimension or extent of this effect was analysed by various researchers.
Venkatesh and Vahdat [2] conducted a behavioural analysis on the synthetic and real background traffic for different applications. According to the results they obtained, it was concluded that each application was affected by the intensity in traffic at certain levels, depending on the type of application.
In another study conducted by Venkatesh and Vahdat [3], it was demonstrated that structural traffic models specific to applications can be successfully established. When generating new traffic, the transmission frequency and time, distribution of packet sizes, characteristics of the flows and destination internet protocol (IP) and destination port addresses of the packets from the original traffic were taken into consideration. They demonstrated through their study that the traffic they generated with a different application, different network and user conditions was compatible with the real network traffic.
It was demonstrated by Nahum et al. [4] that WAN (Wide Area Network) conditions had a significant effect on network performance. In this study, parameters (file size, request transmission time, etc.) that might cause traffic density on the servers were identified.
In the study conducted by Eylen and Bazlamaçı [5], they needed background traffic in order to obtain a traffic similar to the real traffic conditions. For this purpose, background traffic was generated through Poisson distribution, in order to add random delays on the trial packages used in the study. The real traffic was modelled by generation of three different rates of traffic and the proposed method was analysed more accurately under background traffic.
When the conducted studies are examined, it can clearly be seen that the background traffic has a significant effect on both the applications and servers. The size of this effect varies according to many different parameters. For this reason, to ensure realistic analysis of the application, it is necessary to model the network traffic generated outside the application (background traffic).
So far, many different distributions were used to model the traffic that occurs on a network during the day. At earlier times, exponential modelling of packet transmission times by Paxson and Floyd [6] was accepted to be a convenient method. In later years, the Poisson distribution was shown to be accurate for designing a flow-based internet traffic model [7]. In 2008, Fras et al. [8] modelled the statistical processes of network traffic by using the probability density function. Histograms of the measured traffic were used to determine the parameters of the Pareto, Weibull and exponential distributions used in the study. The most fit distribution was evaluated through the use of Kolmogorov-Smirnov, Anderson-Darling and Chi-Squared statistical goodness of fit tests. In terms of package size, Weibull distribution was found to be more suitable than the other distributions, in all three tests.
Bhattacharjee and Nandi [9] compared the Log-Normal distribution and the Pareto distributions to model the transmission times of the academic network data. In the study, which was based on the statistical analysis of data in terms of location and time, it is concluded that the Log-Normal distribution is more suitable for its own data than Pareto distribution.
However, it was shown that the use of a single probability distribution was not suitable for the different behaviour of the network over different time periods [10,11]. Garsva et al. [10] conducted a statistical analysis of the academic network data collected with Netflow. In this study, the network traffic was divided into eight time intervals. In general sense, it was seen that Pareto 2 distribution was suitable to model the packet transmission times during the more intensive (heavy tail) time intervals, and Weibull and Pareto2 distributions were more suitable to model the packet transmission times at the low-intensity traffic hours.
When the studies conducted until now are examined, fit distributions for modelling different types of networks were demonstrated, but the architecture of the modelled network was not included in the studies [9][10][11]. Meanwhile, there are only a few studies on networks with periodic behaviours throughout the day. In this study, both the architecture of the network from which the data is obtained and the probability distributions that model the packet transmission times within different time intervals are provided in comparison with the previous studies.
The next part of the study is organized as follows. The second section describes the modelled network architecture. The third section conducts a statistical analysis of the data on the network traffic. The fourth section briefly mentions the tests used to model package transmission times. Statistical analysis of the transmission times of the package is provided in the fifth section. We complete this paper with the conclusion and some guideless for future work.

MODELED NETWORK ARCHITECTURE
The network structure of the university is shown in detail in Fig. 1. Ulaknet provides the access of the University to the internet and the infrastructure is provided by the service provider (Türk Telekom). The bandwidth of the University is 500 Mbps [12]. There are local area networks between the faculties and vocational schools in the University. Each client connected to the network sends a request to open a port from the university to access the Internet. This request then passes through the firewall to reach Ulaknet and then internet access is provided. To access a server in the university network, again, a request is sent to open a port. However, direct access to the server is provided for this request without having to pass through the firewall. Communication between clients is provided through the Cisco Switch, without the need to open a port.

GENERAL STATISTICAL ANALYSIS OF NETWORK
In this part of the study, firstly, the Z-score method, which is used for eliminating outliers is mentioned. Then, the detailed statistical analysis of the modelled campus network is described.

Outlier Analysis
Various distributions are used in the literature to model different networks. The structure and overall behaviour of the network should be taken into consideration in determining the compatibility of the distributions. In our study, by reference to a prior study [13], modelling was performed by taking only weekdays into account when campus network data are obtained. The sample data belonging to the collected dataset is shown in Fig. 2. Traffic for one day is first classified as incoming and outgoing traffic. Each case is then divided into 4 different time intervals. Heavy network traffic conditions (working hours) are taken into consideration in determining these time intervals. In Tab. 1, the packets for incoming traffic are divided between time intervals 1 to 4 and the packets for outgoing traffic are divided between time intervals 5 to 8. When the table is analysed, it can be seen that the majority of the incoming and outgoing packets are transmitted in time intervals 2 (20.29%) and 6 (27.61%). A significant increase in network traffic was observed with the start of the workday and a significant decrease was observed with the end of the workday. After the incoming and outgoing traffic is divided into time intervals and collected, before proceeding with the analysis of the data, the outlier values within the data need to be identified.
The observations which are numerically distant from the other data for some reason, are called outlier values. Outliers often lead to negative effects such as increasing error difference, influencing estimation results, and reducing the strength of statistical tests [14]. Therefore, outlier analysis methods are applied to data that do not have a normal distribution and have too many outlier values. [15].
In this study, Z-Score method, which is based on statistical approach, is used in determination of the outliers. In the Z-Score method, the average (µ) and standard deviation (σ) values are used to determine whether any value (z) is an outlier (Eq. (1)).
The z value obtained in Eq. (1) is considered to be the normal value if it is within the (-3, 3) range. All values outside the defined values are outliers [16]. In our study, calculations are based on the number of packages that fall within each time interval. The average number of packets was calculated for each time interval and z values were calculated. The percentage of the outlier values obtained in the study conducted by Garsva et al. [13] based on TCP and UDP (TCP: 3.39 UDP: 3.77) is around 3 times the values obtained in our study (TCP: 0.77 UDP: 0.37). In Fig. 3, the data obtained by the interquartile range (IQR) technique that used by Garsva et al. [13] and the values obtained when the z score method is applied are shown comparatively. The results obtained by the IQR method on the left of the figure and the results obtained by the z-score method on the right are shown. When the IQR method is examined, there is an inconsistency in the data obtained. In Fig. 3a and Fig. 3b, the IOR method cut the data from the lower values than the z-score method, but in (c) it could not even eliminate very high values in the data. Thus, the zscore method was preferred because outlier values are eliminated more consistently in our study.
When the left side of Tab. 2 is examined, for example, time zone 1 represents the incoming packets within the interval of 07: 00-10: 59. While 78.34% of these packets are TCP protocol packets, 21.61% of them are UDP protocol packets. 0.26% of the incoming TCP packets within this interval contain outlier values, while 0.15% of the UDP packets contain outlier values, which were eliminated. When the table is broadly analysed, it is observed that the outliers are higher in the periods when the network traffic is intense (2-3 for incoming traffic, 6-7 for outgoing traffic). At the same time, an average outlier ratio of 0.77% is observed in the TCP protocol, while the average value is 0.37% in the UDP protocol. The majority of outliers for the TCP protocol are observed in the incoming packets (1.04%), whereas for the UDP protocol, more outliers are observed in the outgoing packets (0.45%).

Statistical Analysis of Network
After eliminating the outliers in the data, graphical distribution of the daily traffic is obtained. The graphs are given in Fig. 3. Both the protocol-based and graphs containing all the protocols are given in detail. The x-axis of the graphs represents the clock and the y-axis represents the number of flows within the relevant time interval. For the representation of the graphs, Garsva et al. [13] are taken as the reference. It can be seen that the distribution of network traffic in the graphs is consistent with the distribution in the time intervals presented in Tab. 1.
In Tab. 3 and Tab. 4, packet and flow information on incoming and outgoing traffic are given in detail, respectively. In terms of incoming traffic, TCP traffic is 7.9 times that of UDP traffic. The traffic generated by ICMP protocol traffic is very low compared to TCP and UDP traffic. The number of TCP flows is 3.4 times the number of UDP flows, but the number of TCP and UDP packets per flow is almost the same for incoming traffic. Again, the average size of TCP packets is higher than UDP packets (around 2.4 times).
In terms of outgoing traffic, TCP traffic is 6.3 times the UDP traffic. The number of TCP flows is 4.3 times the number of UDP flows, but unlike incoming traffic, the number of packets per flow is 1.6 times for TCP than that of UDP. When the average size of the packets is compared on the basis of outgoing traffic, it is seen that there is no significant difference.

Figure 2 Sample dataset
Again, the right side of the tables presents the data for the incoming and outgoing traffic of the network modelled by Garsva et al. [13]. When compared in terms of total traffic, the traffic values in the study [13] and the traffic in our study are close. However, if the number of flows on the basis of protocol is compared, the number of TCP flows in our study is approximately 3 times that of the flows in study no [13], while the UDP and ICMP packet flows in study no [13] are higher than those in the traffic we modelled. The average number of packets per flow and the average size of the packets are higher than the values of the network in our study.
Incoming and outgoing data traffic is also analysed with regard to some known ports. The number of transmitted packets and the packet size information by port types are presented in Tab. 5 and Tab. 6.   There are 65536 ports available for use in TCP or UDP. They are divided into three ranges [10,17]; -0-1023: Well known ports Well known ports such as HTTP, FTP, SMTP, contain port numbers used for standard pre-defined operations. Registered ports can be used by common user operations or programs executed by common users in most systems whereas dynamic ports can be used dynamically by any application [10,17]. When Tab. 5 and Tab. 6 are analysed, it can be concluded that the highest number of packets are transmitted from ports in the range of 0-1023, and the port with the largest average packet size per packet is 443. With regard to outgoing packets, it is seen that the highest number of packets are transmitted from ports in the range of 49152-65535, which are used by dynamic applications; however, the port with the largest average packet size per packet is port 80.

GOODNESS OF FIT TESTS
In this section, the probability distributions, which would most efficiently model the time intervals and the general traffic of one day presented in Tab. 1, are identified. Kolmogorov-Smirnov and Chi-Squared tests are performed to determine the most suitable distribution. Kolmogorov-Smirnov and Chi-Squared tests are nonparametric tests. Nonparametric tests are widely used when knowledge of the data to be modeled is not available. They can also process limited number of data. In general, they process data faster according to parametric tests. In scope of the study nonparametric tests were preferred because there was data sparsity in some time intervals to be modeled. Also, since it would not be correct to determine suitability over a single test, the most commonly used Kolmogorov-Smirnov and Chi-Squared tests were preferred among non-parametric tests. The descriptions of these tests and the parameters of the fit distributions are presented in detail in the subsections.

Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov (K-S) test is a nonparametric goodness of fit test used to differentiate the changes in the data. In this way, it provides more successful results than parametric data in cases where the assumptions about the data are insufficient. K-S test is applied in the modelling of the internet network as well as fields such as astronomy and wireless sensor networks. The aim of the K-S test is to compare the Cumulative Distribution Function (CDF) of the data with the recommended CDF [18,19]. This comparison process is performed by following the steps below [19,20].
Step 1: If the observed frequencies are equal to the expected frequencies Hypothesis 0 is accepted, if not Hypothesis 1 is accepted.
Step 2: The test statistic value is calculated with the formula in Eq. (2). In Eq. (2), D represents the test statistic, F o represents the observed cumulative frequency and F e represents the expected cumulative frequency.
Step 3:   The critical value is calculated with the formula in Eq. (3). In Eq. (3), N represents the number of observations. If the test statistic value is greater than the critical value, Hypothesis 1 is assumed to be α significant. Otherwise, Hypothesis 0 is valid.

Chi-Squared Test
The Chi-Squared distribution is also often used to test two independent qualitative criteria. The process steps are almost identical to the Kolmogorov-Smirnov (K-S) test. Hypothesis 0 indicates that the two criteria are independent and Hypothesis 1 indicates that there is a correlation between the two criteria. The only difference with the K-S test is that the test statistic value is calculated as shown in Eq. (4). Nevertheless, for the Chi-Squared test to be performed, the expected frequencies must be greater than 5, [21]. This seems to be a significant disadvantage compared to the Kolmogorov-Smirnov test.
In Eq. (4), O represents the observed frequency, E represents the expected frequency and χ 2 represents the chisquare value.

PACKET INTER ARRIVAL TIME STATISTICAL ANALYSIS
In Tab. 7, the distributions fit for packet transmission times for each section are determined for both TCP and UDP packets. When the table is examined, it can be observed that Pareto 2 distribution is prominently fit for modelling the transmission time of the packet for both protocols. Other remarkable situations in the table are the compatibility of the Log Logistic distribution for the TCP protocol and the Weibull distribution for the UDP packets during the low traffic time intervals 4-8. The parameters of the distributions listed in the table and the feasibility values according to Kolmogorov-Smirnov and Chi-Squared tests are listed in detail. In Tab. 7 parameter 1 column, α and k symbols represent the shape parameter, σ symbol represents the standard deviation value of the Lognormal distribution, β and σ symbols in the Parameter 2 column represent the scale parameter, and finally µ symbol, which is the third parameter, represents the location parameter for Generalized Extreme Value distribution and the mean value for the Lognormal distribution. The fit distributions obtained specifically for the traffic sections in the study conducted by Garsva et al. [13] are compared with the distributions obtained in our study, according to Tab. 7. Garsva et al. concluded that Pareto 2 and Weibull distributions were fit in general, likewise, Pareto 2 distribution was fit for various sections in our study. When the studies are compared specifically on a section basis, it can be concluded that the same protocol is fit for sections 1 and 3 for the TCP protocol and sections 2 and 6 for the UDP protocol. In our study, unlike Garsva Fig. 4 also presents the pdf (probability distribution function) of the fit distributions. In the graphs in Fig. 4, the y-axis represents the pdf function and the x-axis represents the transmission time of the packets in seconds. When Fig.4 is examined, it is understood that the number of packets sent for the TCP protocol during time intervals 2-6, 3-7 is higher and in these time intervals the transmission time between packets is shorter. The specified time intervals correspond to working hours with intensive traffic. The frequency of packet transmission decreases with the decrease in traffic. As can be seen in Fig. 4a, in the distribution graphs of time intervals 1-5, 4-8, the frequency of packet transmission decreases notably whereas in Fig. 4b the packet transmission time distributions for UDP traffic are presented. The time interval with the lowest UDP traffic is observed to be 1-5. However, no significant difference in the packet transmission frequency is observed in other time intervals.  Figure 5 Pdf of packet inter arrival time graphics according to distributions In the literature, the distributions commonly used for modelling computer networks are listed [13] as Weibull, Pareto, Gamma, Exponential and Lognormal. By taking these distributions into consideration, the distributions that best represent the modelled general traffic (incoming and outgoing) are listed in Tab. 8. Since Generalized Extreme Value and Generalized Pareto distributions model data better than Exponential and Gamma distributions, the results of these distributions are not included in the table. Unlike the results in Tab. 7, in terms of general traffic, the distribution that best modelled the network TCP protocol was the lognormal distribution, and the distribution that best modelled the UDP protocol was the Generalized Pareto distribution. Whereas in the study conducted by Garsva et al. [13], Pareto 2 distribution was the fit distributional for modelling an academic network for both TCP and UDP protocol. The graphs for the distributions are presented in detail in Fig. 5.
In our study, the statistical modelling of an academic network is performed comparatively by reference to the study conducted by Garsva et al. [13]. It is possible to model all campus networks around the world with the same physical conditions and working hours as the network modelled by the study performed. In addition, fit distributions to model the data collected by Bhattacharjee and Nandi [9] between 16:15-17:30 hours for modelling an academic network are identified. In the result of the study, it is concluded that Log Normal distribution models the data better than Pareto distribution in the specified time interval. The data collected in our study is the daily data collected from 15:00 to 00:00 on 25.04.2018 and from 00:00 to 15:00 on 26.04.2018. In order to compare with the study conducted by Bhattacharjee and Nandi [9], data section of the 16:15-17:30 interval is selected and the fit distributions for this section are obtained. The results are shown in Tab. 9 in detail.
Bhattacharjee and Nandi stated that lognormal distribution in lower tail regions showed better compatibility than Pareto distribution [9]. However, according to the table obtained in our study, Generalized Pareto distribution (0.1481) produced more consistent results than lognormal distribution (0.22836).

CONCLUSION
This study is conducted with the aim of modelling a campus network with periodic behaviour with simulators, by using real time data collected from Siirt University Campus Network. The data collected are analysed statistically on both incoming/outgoing traffic and port basis. The results are presented in comparison with the results of prior studies conducted by Garsva et al. [13] with the aim of analysing a campus network. The comparison results with the study conducted by Bhattacharjee and Nandi [9] on academic networks are also presented.
When the obtained results are analysed on the basis of general traffic, the distribution that best models the transmission time of TCP packets in the network is the Lognormal distribution and the distribution that best models the arrival time of UDP packets is the Generalized Pareto distribution. In terms of daily periods, Pareto 2, Weibull, Logistic, Lognormal and Generalized Extreme Value distributions are found to be the fit distributions. The results are given in detail along with the network architecture to enable modelling with simulators.