I'm troubleshooting a master-slave redis 2.8.x setup on AWS/Linux, which shows a puzzling behavior: when the setup is under max load (CPU usage % on both master and slave are ~100%) localhost connection attempts to the slave often time out, whereas connection to the master don't. The socket timeout is set to 1 second.
Both servers live on the same-sized VMs. Memory usage is low, there's no swap activity. Master is serving all the read-write traffic. The only connections to the slave come from redis sentinels, the master (replication stream) and monitoring agents.
Quick profiles captured with
perf top on both hosts show the two top offenders are the same:
64.63% redis-server [.] compareStringObjectsWithFlags 20.02% redis-server [.] listTypeNext
Which brings me to the final question: what's the difference between redis master and slave that causes this behavior? Does the slave apply updates in batches that lock it up for extended periods of time, whereas the same updates would be interleaved on the master, for example?
Update: forgot to add that
strace of the slave shows extended waits after big (16kB) reads from the master.