AIO op latency measurement for squidng/butterfly

Methodology

The squid and the kernel has been modified to timestamp aiocb (the AIO control block) as it passes certain processing stages. Whenever aio_* operation finishes, squid writes the information collected in its aiocb to a file. The unpack.pl script is then used to convert the binary statistics file to a format suitable for the graph utility (GNU plotutils).

The samples are numerous, and latency jitter is rather high, so samples are averaged before plotting. Every point plotted is an average of 1000 successive samples.

What is plotted (the legend)

ENQUEUED, solid red line (-----)
Time passed from the moment squid does the aio_* syscall to the moment the kernel puts the request onto internal queue.
DEQUEUED, solid green line (-----)
Time passed from the ENQUEUED timemark to the moment when a slave KAIO thread takes the request from the queue.
IO, solid blue line (-----)
Time passed from the DEQUEUED timemark to the moment when the I/O operation proper is complete, what is left is maybe copying the data from the kernel to the user space. This timemark is set only for BFREAD and BFWRITE requests.
COPY, solid magenta line (-----)
Time passed from the IO timemark to the moment when copying data from the kernel to userspace is complete. This only appears in BFREAD requests.
COMPLETE, solid cyan line (-----)
Time from the previous (COPY for BFREAD, IO for WRITE, DEQUEUED for others) timemark to the moment when aio request processing in the kernel is complete, and the op completion status is setup in aiocb. The time spent on sending the completion signal is not counted here because the time statistics must be copied to userspace before the completion signal is sent.
FINISHED, dotted red line (.....)
Time from the COMPLETE timemark to the moment when aio request completion is noticed and serviced by squid.

Tests done

boiler
boiler with pre-filled cache. 4 cache disks, threads=8 (=32 threads total).
polymix-3 at 250 RPS
polymix-3 with shortened fill phase (only as long as is necessary to develop the WSS). 4 cache disks, threads=8 (=32 threads total).

Observations of the results

Tests planned for the nearest future

  1. Increase the threads= and see if it helps to shift the bottleneck from the aio request queue to the elevator.
  2. Change squid back to signal-driven AIO completion processing (but with Nikita's vector sigtimedwait or Henrik's no-schedule-in-sigtimedwait patch), and see if and how much this improves the FINISHED latency.

sizif@botik.ru
Last modified: Tue Oct 3 17:07:12 MSD 2000