Bigdata

Page content

How to Process Large Files … ?

Large is a variable Term, 700 GB is large for me, while it could be a small peace for others.

Assuming you need to count the lines … this simple Task can take minutes !

Size

[user@host /tmp]$ du -sh bigfile
745G bigfile

Wordcount -> 10 min

if you need to count the lines, use the wordcount command and you get the exact number … but you have to wait for minutes, depending in your disk subsystem and the file size of course

[user@host /tmp]$ time wc -l bigfile

1265723263 bigfile

real  10m42.255s
user  1m1.684s
sys   5m6.303s

Estimate Lines Script (Linux)

if you can live with an estimate, just try this script. it just meassure the first 100 lines, calculcate the size and calculate it for the rest of the file

cat << 'EOF' > linestimate.sh
#!/usr/bin/env bash
head -101 $1 | tail -100 > $1_fewlines
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_fewlines | cut -f -1)
rm $1_fewlines
echo $(expr $filesize / $linesize \* 100) $1
EOF

chmod 755 linestimate.sh

Estimate Lines Script (OpenBSD)

OpenBSD needs gdu of the coreutil. the onboard “du” command is not able to count the Bytes :(

doas pkg_add coreutils

cat << 'EOF' > linestimate.sh
#!/usr/bin/env bash
head -101 $1 | tail -100 > $1_fewlines
filesize=$(gdu -b $1 | cut -f -1)
linesize=$(gdu -b $1_fewlines | cut -f -1)
rm $1_fewlines
echo $(expr $filesize / $linesize \* 100) $1
EOF

chmod 755 linestimate.sh

Run Script -> 10 ms

[user@host /tmp]$ ./linestimate.sh bigfile
1263427700 bigfile

real  0m0.011s
user  0m0.005s
sys   0m0.010s

Deviation: 0.2 %

depending on the type of data, you will get a fairly accurate estimate with a deviation of a few parts per thousand

wc -l:  1265723263
script: 1263427700

Diff:     +2295563
->      2295563 / 1265723263 = 0.001813637362214 -> 0.18 Percent :)

Any Comments ?

sha256: aa7210d9822299c3fd8174acaf684bb45724e458d86fafa6d20728b17de9d78e