An attempt at file offset-based wc
implementation that can use
multiple cores to read the same file. Outline:
- Get number of CPUs on the machine
- Create those many number of go routines that start reading the file in chunks.
- Offsets are created based on that number so that each chunk is read starting from that offset.
Mostly works on *nix machines
Warning: Code in here is crap, don't read it.
I haven't created releases or per-OS packages, so the only way to try
this out is via go get
, which means you need to have working Go
installation.
go get github.com/kgrz/kwc
That should compile and install the binary into your $GOPATH
. Then run
the binary as kwc
. If it's not there, then cd
into
$GOPATH/src/github.com/kgrz/kwc
and run go install
.
-
I'm finding it non straight forward to do UTF-8 aware reading because if a chunk cuts an particular multi-byte character in the middle, that shouldn't be counted as two separate words! If we use
utf8.RuneCount()
on a slice that has a partial multi-byte word, that count can end up being wrong.Update: I think I have a solution for this! Will implement it soon.
It's fast™
-
The
os.readAt
Go function internally uses thepread
syscall which works well with multi-threaded access of the same file: http://man7.org/linux/man-pages/man2/pread.2.html -
The initial implementation used a naive
isspace
function I wrote that only catered to spaces and tabs (ascii 32 and 9). But as per the man page ofwc
andisspace
function that gets used in it, a "space" for the purposes ofwc
contains both a whitespace characters and new lines or equivalents:- ascii space (32)
- ascii tab (9) \t
- new line (10) \n
- vertical tab (11) \v
- form feed (12) \f
- carriage return (13) \r
- non breaking space (0xA0)
- next line character (0x85)
-
Avoiding
bufio.Scan()
is maybe something you'd want to consider if you're looking for speed. TheScan()
function does a lot of things extra like basic consistent error handling, and it's very useful if you want to store the scanned bytes into lines/words for every iteration. We don't need to do that when just counting the characters or words, so we avoid using it. Perf impact is considerable.To do a basic test of this hypothesis, try running the program on a
cat
-ed output which uses the scanner codepath and compare it withwc
.