There are many reasons that you want to pick up some lines from a file randomly, how?

Here are few ways you can choose, I also list the cost of each way to have a reference for you.

On a SL6 server, 4CPUs, 12G memory. I choose a file that has nearly half million lines. load it to memory to eliminate i/o effect.

In examples below, the goal is to extract 10k lines from a file that has nearly half million lines. Time the cost.

1. shell, random function

The basic idea is to get a random number that is less than the total lines in the file. extract the line from the file.

Here is the script

$ cat 

RANGE=`wc -l ./testfile|cut -d' ' -f1`

for i in `seq 1 10000`
  let "number %= $RANGE"
  awk -v line=$number '{ if(FNR == line ) print $1 }'<testfile


Surely it works, but just takes too much time

Time cost

$ time ./ >/dev/null

real    15m31.180s
user    14m5.030s
sys    1m18.567s

2. sort

Use random function and pick top number of lines from the file. Command and the cost

$ time sort -R testfile | head -n 10000 >/dev/null

real    0m21.537s
user    0m21.493s
sys    0m0.031s

3. awk

Use rand function of awk, here is it

$ time awk 'rand() <0.02' <testfile |wc
  10066   10066  355210

real    0m0.089s
user    0m0.092s
sys    0m0.005s

4. shuf

  shuf is a Linux command that generate random permutations, it extract a random permutation of the input lines and print to standard output by default

The example beow uses -n option, randomly pick 10k lines.

$ time shuf -n 10000 lto4_pnfsid.lst |wc
  10000   10000  353776

real    0m0.043s
user    0m0.022s
sys    0m0.028s


For me, I choose shuf, you can tell which is the best for for you.


Comments powered by CComment