AIO op latency measurement for squidng/butterfly
Methodology
The squid and the kernel has been modified to timestamp aiocb (the
AIO control block) as it passes certain processing stages. Whenever
aio_* operation finishes, squid writes the information collected in
its aiocb to a file. The unpack.pl script is then
used to convert the binary statistics file to a format suitable for
the graph utility (GNU plotutils).
The samples are numerous, and latency jitter is rather high, so
samples are averaged before plotting. Every point plotted is an
average of 1000 successive samples.
What is plotted (the legend)
- ENQUEUED, solid red line (-----)
- Time passed from the moment squid does the aio_* syscall to the
moment the kernel puts the request onto internal queue.
- DEQUEUED, solid green line (-----)
- Time passed from the ENQUEUED timemark to the moment when a
slave KAIO thread takes the request from the queue.
- IO, solid blue line (-----)
- Time passed from the DEQUEUED timemark to the moment when the
I/O operation proper is complete, what is left is maybe copying the
data from the kernel to the user space. This timemark is set only for
BFREAD and BFWRITE requests.
- COPY, solid magenta line (-----)
- Time passed from the IO timemark to the moment when copying
data from the kernel to userspace is complete. This only appears in
BFREAD requests.
- COMPLETE, solid cyan line (-----)
- Time from the previous (COPY for BFREAD, IO for WRITE, DEQUEUED
for others) timemark to the moment when aio request processing in the
kernel is complete, and the op completion status is setup in aiocb.
The time spent on sending the completion signal is not counted here
because the time statistics must be copied to userspace before the
completion signal is sent.
- FINISHED, dotted red line (.....)
- Time from the COMPLETE timemark to the moment when aio request
completion is noticed and serviced by squid.
Tests done
- boiler
- boiler with pre-filled cache. 4 cache disks, threads=8 (=32
threads total).
- polymix-3 at 250 RPS
- polymix-3 with shortened fill phase (only as long as is
necessary to develop the WSS). 4 cache disks, threads=8 (=32 threads
total).
Observations of the results
- Under full load (during the top AIO requests spend the most
considerable time (up to 1 second during the perios of the full load)
in the in-kernel queue, waiting to be served by a kernel thread. The
actual I/O operation typically completes in under 100ms time. To me,
this was a surprise. We seem to be in need of more kernel threads, or
some KAIO change that will increase parallelizm in AIO processing without
increasing the number of processes too much.
- The FINISHED time is not too bad---for the requests that take
priority in op completion processing (
BFOPEN,
BFREAD,
BFCLOSE.)
For low-priority requests (
BFCREAT,
BFWRITE)
it is much worse, but that doesn't matter.
- The close(2) seems to be so short (10us) that we may be better
off doing it synchronously.
- There seems to be a bug in the buttefly code that makes it to
lose some op completions. This causes the "ops in flight"
graph for polymix-3 not to go down to zero during the p-idle
phase as it ought to.
Another strange thing are the gaps in BFCLOSE graph. Chances
are that this is related to the above.
Tests planned for the nearest future
- Increase the threads= and see if it helps to shift the bottleneck
from the aio request queue to the elevator.
- Change squid back to signal-driven AIO completion processing (but
with Nikita's vector sigtimedwait or Henrik's
no-schedule-in-sigtimedwait patch), and see if and how much this
improves the FINISHED latency.
sizif@botik.ru
Last modified: Tue Oct 3 17:07:12 MSD 2000