Using Statistical Analysis for Improving Information Security

9 min readOct 23, 2022

Introduction

If one may paraphrase the famous claim by Carl Friedrich Gauss about mathematics [1], one would say that in today’s world information security is the queen of modern technology and data is the queen of security. This is because information security is in the core of any modern technology. Of course, mathematics is continue being the queen of science and in this paper, we will talk about these queens — security and mathematics.

One of the most difficult nuisances in modern times are ransomware attacks. According to the predictions, ransomware will continue being the number one threat today and probably many years to come [2]. Ransomware is literally a deadly weapon that several years ago contributed to the first reported death related to a cyber-attack. A hospital in Germany was locked out of their more than 30 servers with ransomware and was unable to deal with patients. Because of this situation, a woman in need of urgent care was transferred to another nearby hospital but did not survive [3].

Based on PurpleSec survey that interviewed 582 information security professionals, found out that 50% of them do not believe their organization is prepared to repel a ransomware attack. Also, 75% of companies infected with ransomware were running up-to-date endpoint protection that of course didn’t help them at all [4].

As a reminder, ransomware is a form of malware (malicious software) that encrypts files on a computer making them unusable unless a victim pays ransom in exchange for decryption.

After they infect computers, cyber criminals often exfiltrate all sensitive data, such as client lists, personal information, images, and so on. Besides that, ransomware actors often threaten to sell or leak this data or authentication information if the ransom is not paid.

Except limited number of variants (e.g., Vipassana), the majority of ransomware “calls home” to report successful infection. It may also use a warm-like capabilities to propagate itself around infected networks (e.g., Ryuk) and attacks other “innocent” computers [5].

Another real threat is cryptojacking, that is an unauthorised use of your employees’ computers to mine cryptocurrency (e.g., Coinhive). Cryptocurrency mining malware typically is stealthy but drains computer resources [6]. Often compromised commuters are used for attacking other networks (e.g., being a part of a botnet). These attacks, if being successful besides draining computers power may dramatically increase network traffic.

The human factor is (and in our opinion is going to continue) the most important and the challenging one to control. Besides doing regular security awareness campaigns, to practice proactive security defence, one must detect an attack or anomalous event as early as possible. This requires quickly processing incoming data and comparing it to the historical data to locate any deviation.

Besides that, one must be honest that antivirus applications alone cannot protect a network from malware no matter how sophisticated or advanced is your antivirus algorithm.

For being able rapidly response to a potential cybersecurity incident we decided to investigate the volatility, a well-known concept in statistical analysis.

Our assumption was that usually network traffic is somehow predictable. When people come to work, go for lunch, and leave the office at the end of the day the network traffic reflects this scheduler. Because network traffic is usually huge of a size, several hours of logging may easily collect several gigabytes of data. Therefore, analyses conducted directly from such raw data may not be feasible and obviously inefficient. In practice, not only does proper statistical sampling play a major role in the analyses, but also different sampling techniques may affect the final analysis.

We understand, that in our tests, to be admissible for analysis, data must meet several basic criteria. For example, be of adequate sample size (for being used in statistical models), be stable, be reliable (for eliminating false alarms), be detectable (for collecting further samples), and have discriminating power (to be distinguishable between two or more groups being assessed).

Normal Distribution

A normal distribution, (a.k.a. a probability distribution) is used to model results of an event that has a default behaviour and possible deviations from that behaviour.

We decided to use Bollinger Bands in our network traffic analysis. In financial world Bollinger Bands (BB) analysis is used for technical analysis of financial assets. Bollinger Bands characterize the prices and volatility over time, using a mathematical formula, offered by John Bollinger in the 1980s. Bollinger Bands are used for pattern recognition among other things.

To understand how BB work let’s first talk about the concept of normal distribution (also known as Gaussian distribution) which is the quintessence of statistics and the premise of statistical theory. We see it’s bell-shaped curve in statistical analyses reports (Fig. 1).

Fig 1. Standard normal distribution

For example, an experienced darts player puts almost all his/her darts inside of the Triple Ring. However, occasionally, the darts are landed inside on the Double Ring (Fig 2).

Fig 2. Dartboard. By Tijmen Stam — https://commons.wikimedia.org/wiki/File:Dartboard.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=31323956

The average of this distance from the center (the Inner or Outer Bullseyes) is known as accuracy (the mean in statistics), while the variation in the distances is precision (the standard deviation). Therefore, the darts player’s proficiency can be measured with two values: a mean and a standard deviation.

In a normal distribution there is a about a 68% probability that a dart will land within one standard deviation of the player’s average accuracy (between -1-sigma and 1-sigma). There is about a 95% probability that a dart will land within two standard deviations (between -2-sigma and 2-sigma) of the player’s average accuracy. Also, there is about a 99.7% probability that a dart will land within three standard deviations and so on, increasing towards 100% (Fig 3). This is known as the 68–95–99.7 (empirical) rule, or the 3-sigma rule [7].

Fig 3. Normal Distribution. By M. W. Toews — Own work, based (in concept) on figure by Jeremy Kemp, on 2005–02–09, CC BY 2.5, https://commons.wikimedia.org/w/index.php?curid=1903871

One more thing to understand is moving average (“moving mean”). Moving average captures the average change in a data series over time. Bollinger Bands, volatility bands capture standard deviation or volatility change over time. BB are displayed above and below a moving average. When volatility increases, the bands are widened and when the volatility decreases — they contract (Fig 4).

Fig 4. Bollinger Bands example where orange colour represents middle band (mean), green — upper band (2-sigma) and blue is the lower band (-2-sigma)

How to Calculate Bollinger Bands

As we mentioned above the bands serve as a volatility indicator — they automatically widen when volatility increases and contract when it decreases. Formula to calculate the middle band (a Simple Moving Average or SMA) is:

Where J is the average of a range of data divided by the number of time periods in that range. Upper band is calculated using the following formula:

Lower Bollinger Band is calculated using the following formula:

Where D is the standard deviation.

Upper and lower bands are shifted up and down respectively by the number of standard deviations “D”.

Test

Taking all this into consideration, we used a netstat command to analyze number of open network ports at predefined time slots and collect model samples (use man netstat for more information).

A working hypothesis for our study was that the unusual network activity (e.g., ransomware infection, crypto mining, botnet attack) will suddenly increase the number of open network ports. By using free and native network monitoring tools we wanted to be able to capture this sudden increase in open network ports and filter it out from the regular network traffic “noise”.

As we mentioned before for monitoring all open network ports, we have used netstat command line tool that is available for both UNIX and Windows platforms.

Doing this we have collected data for several open network ports for a few hours. Periodically, we used to launched various malware applications (ransomware, botnet, and cryptojacking). Picks in number of open network ports after launching malware applications are clearly visible on Fig. 5 graph (circled in red).

To analyze the number of open ports and create a Bollinger Band graph, we used a simple Python script (inspired by [8]).

#!/usr/bin/python3

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

plt.style.use(‘fivethirtyeight’)

#Read the data

df = pd.read_csv(‘netstat.csv’, parse_dates=[‘Time’], index_col=[‘Time’])

# sort the index

df.sort_index(inplace=True)

#Show the data

#print(df)

#Get the time period (10 min)

period = 10

#Calculating the Simple Moving Average

df[‘SMA’] = df[‘Ports’].rolling(window=period).mean()

# Get the standard deviation

df[‘STD’] = df[‘Ports’].rolling(window=period).std()

#Calculate the Upper Bollinger Band

df[‘Upper’] = df[‘SMA’] + (df[‘STD’] * 2)

#Calculate the Lower Bollinger Band

df[‘Lower’] = df[‘SMA’] — (df[‘STD’] * 2)

#Create a list of columns to keep

column_list = [‘Ports’, ‘SMA’, ‘Upper’, ‘Lower’]

#Plot the data

df[column_list].plot(figsize=(22.2,8.4))

plt.title(‘Bollinger Band for Open Ports’)

plt.ylabel(‘Number of Open Ports’)

plt.show();

Fig 5. Bollinger Bands graph of our experiment, where SMA — Simple Moving Average, Upper and Lower — correspondingly upper and lower Bollinger Bands.

On the above graph a number of open ports is depicted in blue. SMA — a Simple Moving Average (in red) is the average of a range of open network ports divided by the number of time periods in that range (in our case 10 minutes). Upper band (2-sigma) is in yellow and the lower band (-2-sigma) is in green.

Every time a blue line (number of open ports) crosses over the upper Bollinger Band line was the time when we launched our samples of malware. This correlates very well with our initial assumption that we will be able to track malware activity using data statistical analysis and particularly Bollinger Bands indicator. We have tested this technique for several times and every time we were able accurately identify a malware launching event.

This allowed us to identify the exact time when malware was executed.

Conclusion

Malware evading techniques allow cybercriminals install and operate malicious applications. These malware applications go undetected by companies sometimes for dozens of years [9].

Obviously, the above-described technique must be further researched to be practical solution of modern intrusion detection systems.

However, by monitoring these events one may set network security alerts that will trigger security personnel’s attention to some unusual network activity.