Random Sampling using "R"

Dennis Steenbergen

Random Sampling using “R”

Assessors are wading through seas of complicated inter-related devices and diagrams to make sense of how data flows through them. Some sample lists can contain a population of hundreds, but which ones should we sample? Most companies can justify sampling 20% of any given population. But which 20%? As humans we know we have selection biases. This phenomenon uses data in such a way as to not produce true randomness. For assessors, it means your sample is not representative of the population. This is not objective and not science!

We can eliminate selection bias by using an accurate randomization software tool, “R”. Why “R”? because it is the open-source leader for statistical computing and graphics which is used widely by statisticians and data miners for data analysis. “R”’s sampling rationale is unbiased because it’s “sample” function uses a random permutation from the dataset regardless of the population number. Information about how to install and download the free software can be found here: https://www.r-project.org/

Let’s walk through a device selection example to learn how. We can use this rule of thumb where the number in our population determines the percentage sampled:

Population	% Sampled
1-3	100 %
4-20	50%
21-50	10%
>= 51	5%

This means devices in any given sample set were numbered based on the above criteria in a spreadsheet where the sampled percentage was less than 100%. If we have 3 staff members, we interview all of them. If we have 100 routers, we sample 5% of them. If we have 20 Windows Servers, we sample 50% of them and so on.

Cut and paste your population (from your customer’s asset inventory) into a spreadsheet and number them “1, 2, 3, 4,…(n)”. This will tell you which number is associated with which population item when we get our result. Let’s sample 50% of 20 Windows Servers. This means “10” is our sample size.

The below example is a script which you can cut-and-paste into a “R” command window and just hit “enter”. Change population and sample size to fit your needs. As it is now, the data set size has “20” configured. We will sample without replacement which just means we took one sample from the dataset and did not reset the dataset to its original state before drawing another sample. Or in other words, the element drawn stays out of the dataset for the next draw. This is what the “replace=FALSE” means.

Seen below is the simple programming “R” syntax I used with the pound symbol “#” noting a comment line.

#
# Create a data set of 20 numbers 1 through 20 called “mydata”.
#
> mydata <- c(1:20)
#
# Draw a random sample of 10 from a dataset sample (“mydata”) containing 20
# numbers without replacement using the “sample” function.
#
> sample(mydata, 10, replace=FALSE)
#
# Below is the result
#
1 12 19 10 5 9 13 4 2 8 18
#
# The first “1” just denotes the row number of the results
#

Thus, we would sample 10 devices (50%) containing devices numbered in our spreadsheet as “2”, “4”, “5”, “8”, “9”, “10”, “12”, “13”, “18”, and “19.”

Now iterate over the draw size of the other sampled sets where the percent sampled is less than 100% and you’re on your way to true randomized sampling!

Services

Compliance Framework Management
CyberSecurity Consulting
IT Audits
Pen ASV/Testing
Risk and Gap Assessments
Training

Who we are

PCI Live is a provider of information security and compliance management training solutions to large and small businesses throughout the world. PCI Live analyses, protects and validates an organization’s data management infrastructure from the network to the application layer – to ensure the protection of information and compliance with industry standards and regulations such as the PCI DSS and ISO 27001, and others. PCI Live is headquartered in Tennessee with offices throughout Europe.

Random Sampling using "R"

Advance your career!

Thank you for your purchase. We will continue to try and earn your trust.