Random Sampling using "R"

Dennis Steenbergen

Random sampling ensures that results obtained from your sample should approximate what would have been obtained if the entire population had been measured (Shadish et al., 2002). The simplest random sample allows all the units in the population to have an equal chance of being selected.

Sampling is an option for assessors conducting PCI DSS assessments to facilitate the assessment process when there are large numbers of items in a population being tested.

This is one method.

Random Sampling using “R”

Assessors are wading through seas of complicated inter-related devices and diagrams to make sense of how data flows through them.  Some sample lists can contain a population of hundreds, but which ones should we sample?  Most companies can justify sampling 20% of any given population.  But which 20%?  As humans we know we have selection biases.  This phenomenon uses data in such a way as to not produce true randomness.  For assessors, it means your sample is not representative of the population. This is not objective and not science!

We can eliminate selection bias by using an accurate randomization software tool, “R”.   Why “R”?  because it is the open-source leader for statistical computing and graphics which is used widely by statisticians and data miners for data analysis.  “R”’s sampling rationale is unbiased because it’s “sample” function uses a random permutation from the dataset regardless of the population number. Information about how to install and download the free software can be found here: https://www.r-project.org/

Let’s walk through a device selection example to learn how. We can use this rule of thumb where the number in our population determines the percentage sampled:

Population % Sampled
1-3 100 %
4-20 50%
21-50 10%
>= 51 5%



This means devices in any given sample set were numbered based on the above criteria in a spreadsheet where the sampled percentage was less than 100%.    If we have 3 staff members, we interview all of them. If we have 100 routers, we sample 5% of them.  If we have 20 Windows Servers, we sample 50% of them and so on.  

Cut and paste your population (from your customer’s asset inventory) into a spreadsheet and number them “1, 2, 3, 4,…(n)”.  This will tell you which number is associated with which population item when we get our result.  Let’s sample 50% of 20 Windows Servers. This means “10” is our sample size.  

The below example is a script which you can cut-and-paste into a “R” command window and just hit “enter”. Change population and sample size to fit your needs. As it is now, the data set size has “20” configured.  We will sample without replacement which just means we took one sample from the dataset and did not reset the dataset to its original state before drawing another sample. Or  in other words, the element drawn stays out of the dataset for the next draw. This is what the “replace=FALSE” means.

Seen below is the simple programming “R” syntax I used with the pound symbol “#” noting a comment line.

#
# Create a data set of 20 numbers 1 through 20 called “mydata”.
#
> mydata <- c(1:20)
#
# Draw a random sample of 10 from a dataset sample (“mydata”) containing 20
# numbers without replacement using the “sample” function.
#
> sample(mydata, 10, replace=FALSE)
#
# Below is the result
#
 1 12 19 10  5  9 13  4  2  8 18
#
# The first “1” just denotes the row number of the results
#


Thus, we would sample 10 devices (50%) containing devices numbered in our spreadsheet as “2”, “4”, “5”, “8”, “9”, “10”, “12”, “13”, “18”, and “19.”

Now iterate over the draw size of the other sampled sets where the percent sampled is less than 100% and you’re on your way to true randomized sampling!

Created with