Diagnosing MySQL performance problems

Diagnosing and Fixing
MySQL Performance Problems
Percona, Inc.
http://www.percona.com/

1

Table Of Contents

0. Welcome 5. Your Toolchest

1. Defining Performance 6. CPUs and Tasks

2. The Stack at 10000 Feet 7. Utilization versus Backlog

3. Isolating Problems 8. External-to-the-database Problems

4. Approaching the Stack 9. Conclusion

2

Introduction
★ Welcome Everybody!
★ Your hosts for the next 3 1/2 hours are @jswanhart

and @RoomieGunns from @percona.

3

House Keeping
★ Restrooms? -> Leave if you need to.
★ Questions? -> Ask at any time.

★ Heckling? -> There will be a prize for the best
heckle.

★ I hope you enjoyed lunch. We‟ll be taking one
scheduled break in the middle of the tutorial..

4

Wifi / Power?!
★ Everyone connected, are the wifi instructions posted?
★ Be kind to your neighbour - share the power strips in

the room.

5

Where does this talk come from?
★ If you‟re curious – some of this presentation comes
from the slide decks we use at our training.
★ However, we‟ve prepared this material for this

conference.

6

Can we get a show of hands:
★ MySQL 5 / MySQL 5.1 / MySQL 5.5 / Anything else?
★ Linux / Windows / Solaris / BSD / Something else?

★ Oracle / SQL Server experience?

★ What are you interested in knowing -

✦ Problems you are currently having?
✦ Weird things that happened you could never explain?
✦ Something else?

7

Can you have this slide deck?
★ We‟ll give you the URL at the very end of the day.
★ Be careful to write it down!

✦ O‟Reilly doesn‟t normally upload tutorial slides the same
way they do for conference sessions.

8

Table Of Contents






9

What about Performance?
★ Performance optimization should start with a goal.
★ That goal can be to -

✦ Improve time for users?
✦ Reduce load?
✦ Grow the service?

10

What performance is not:
★ CPU Usage
★ Memory Usage

★ Load Average

11

Performance is best measured as:
★ The response (R) time for a user‟s task.
★ What is a task?

✦ a unit of work
✦ a business unit of work
✦ something a user wants to do
✦ some function of the application

12

Related Concepts

Load Utilization Scalability

Throughput Concurrency Capacity

13

Related Concepts (cont.)

Load:how much
work is incoming?
Utilization Scalability
or, how big is the
backlog?


14


Load:how much Utilization:
work is incoming? how much of a
Scalability
or, how big is the system's resources
backlog? are used?


15


Scalability:
what is the
relationship
between utilization
backlog? are used?
and R?


16


Scalability:
what is the
relationship
between utilization
backlog? are used?
and R?

Throughput:
X - how many tasks Concurrency Capacity
can be done per
unit of time?

17


Scalability:
what is the
relationship
between utilization
backlog? are used?
and R?

Throughput:
Concurrency:
X - how many tasks Capacity
how many tasks
can be done per
can we do at once?
unit of time?

18


Scalability:
what is the
relationship
between utilization
backlog? are used?
and R?

Capacity:
Throughput:
Concurrency: how big can X go
X - how many tasks
how many tasks without making
can be done per
can we do at once? other things
unit of time?
unacceptable?

19

R = Time / Task
X = Tasks / Time

Why Throughput != Performance
★ Big Niagara machines with many
cores, each of them very slow.
✦ Also not reciprocal of performance,
because....

Sun T2000. Photo Credit:
http://www.flickr.com/photos/clone
dmilkmen/310565533/

22

Throughput != Performance
(cont.)
★ Parallelism
✦ 10 tasks at once take 10 seconds.
✦ X = 10/10 = 1, R = 10
★ Non-uniform distribution
✦ 100 tasks take 100 seconds, what is R? Don't know, it
could be 99 fast and 1 slow task.

23

What is important...
★ Is the relationship between throughput, utilization,
response time and capacity.
★ Queuing may occur:

✦ R is the combination of service time and wait time.

24

What to take away:
★ Above all, focus on time.
✦ For performance optimization, reduce time spent waiting
for response.
✦ For load reduction, reduce total time consumed by the task.

25

Table Of Contents






26

What is the performance
problem?

Web Web

Web Web
15 second login? Database Server

Web Web

Web Web

27

If you said that...
★ The Database
✦ You‟ll be right most of the time - but you‟re not being 100%
honest with yourself.
✦ The database has more scalability challenges than the
other components. For the most part we can just add web
servers.

28

However;
★ We can lead ourselves into a real trap by guessing
based on previous experience.
★ Proving is probably a lot more important than knowing.

29

What‟s interesting...
★ What‟s more interesting than drawing the stack is
drawing the flow of information between each
component of the stack.
★ It‟s important to be able to do this while users execute

tasks.

30

Following the Flow:
★ For a given task, measure the breakdown in time:

31

Wait, what.!?
★ Updating the last_login_date takes a sizeable
amount of time?
★ For the value that it provides, why are we spending so

long on that sub-task?

32

My analysis:
★ Query is:
✦ UPDATE users SET last_login_date=NOW()
WHERE id = N;
★ Schema is:
✦ CREATE TABLE users (
id INT NOT NULL PRIMARY KEY auto_increment,
username CHAR(32) NOT NULL,
..
last_login_date DATETIME,
UNIQUE (username)
) ENGINE=MyISAM;

33

$ uptime
15:00 up 11 days, 16:58, 5 users, load averages: 0.88 0.61 0.44

35

$ uptime

mysql> show global status like 'slow_queries%';
+---------------+-------+
+---------------+-------+
| Slow_queries | 3 |
+---------------+-------+

mysql> show global status like 'table_lock%';
+-----------------------+-------+
+-----------------------+-------+
| Table_locks_immediate | 2267 |
| Table_locks_waited | 15640 |
+-----------------------+-------+

37

Why do I show you this example?
★ A lock is a form of queue.
✦ R = Queue + Service Time
★ Some queuing shows up as “Load”.
✦ CPUs have a run queue.
✦ IO requests are queued before execution.

38

However;
★ Not all queuing shows up in the Operating System -
such as internal locking contention.
Cause: Possible Effect:
More people request a System may look busy
resource than current (CPU or Disks).
capacity that can be offered.

★ It‟s always better to instrument for the cause - as it
may still be there without any „effect‟ present.

39

Back to the Sequence Diagram
★ You want to find what doesn’t look right. Then take
those quick wins.
★ This sounds trivial - but without instrumentation your

guesses are often not correct.

40

Another Sequence Diagram:

41

The Lesson
★ In this case, the external API looks like a great piece
to attack first.
★ How do we improve it?

✦ Maybe a developer can help here..
✦ Can we cache the credentials?

42

The Important Lesson (cont.)
★ We had a customer exactly like this:
✦ They looked at their webservers.
✦ They weren‟t loaded.
✦ They looked at their database servers.
✦ They weren‟t loaded either.
✦ They tried to optimize their database servers because „it
was a database problem last time‟.
✦ Performance wasn‟t much better.

43

Why do we set different goals for
each task?

Two different Tasks:
★ Sequence #1: (Total time 0.01 seconds)

Start Update my status Done

★ Sequence #2: (Total time 60 seconds)

Start Prepare Report Done

45

The “Optimization”
★ The customer was worried about a 60 second long
query that ran at 3am every night.
★ Oh no, let‟s optimize it by adding an index!

46

The outcome:
★ Sequence #1: (Total time 0.02 seconds)

Start Update my status Done

★ Sequence #2: (Total time 10 seconds)

Start Prepare Report Done

47

Optimize what matters:
★ The (ideally user facing) tasks matter.
★ Write instrumentation so that you can think like a

sequence diagram does.

48

Limits of Sequence Diagrams:
★ Why you might not see everyone using a sequence
diagram in production -
✦ It breaks down when there‟s too many sub-tasks involved
in execution.

49

Introducing the Profile

50

Introducing the Profile (cont.)
★ The profile is related to the sequence diagram;
✦ What we do in a profile is aggregate similar sub-tasks into
one entry.
✦ Then we order the profile from total R of a given task.

51

Table Of Contents






52

The Method for Isolation
★ In general:
✦ Start at the user, measure time taken from the front to the
backend.
★ The absolutely certain method:
✦ Plod through the sequence diagram, measure each step.

53

The sequence diagram
★ Advantage:
✦ You will inevitably find the solution and be able to prove it.
★ Disadvantages:
✦ might take too much time
✦ things might be too hard or impossible to measure
✦ there might be too much information

54

Method for Isolation (cont.)
★ Slightly more practical preferred way:
✦ 1. Use a tool that makes it easy
✦ 2. If that's not possible, guess and measure.
✦ 3. If you can't measure anything, reason from logic.

55

Method #1: Instrument by time
★ Our favourite - and where you should start.
★ Only Con:

✦ Not always that easy to do in a complicated system.
✦ May not always be feasible to install instrumentation

56

Method #2: Guess and Measure
★ Look for a part of the system that “looks bad”
★ How much time does it consume?

★ Example: If logging in takes 2.1 seconds and DNS

takes 2 seconds, you found the problem.

57

Method #3 - Reason from Logic
★ Really hard, really slow.
★ If you can't do any of those, you can

✦ ask for help
✦ trial-and-error (more on this later)
★ We suggest asking for help.

58

Work on Unimportant Things
★ [DO] Apply Amdahl's Law: a 5% time consumer can't
benefit you more than 5%.
★ [DON’T] Apply "best practices"

✦ These can have unexpected side effects.
✦ Example: I see many sort_merge_passes, maybe I need to
increase sort_buffer_size.
✦ Example: I've heard that I should put the logs and data on
different disks.

60

Guess-and-don't-measure
★ If you skip measuring, then you cause problems.
✦ This is also known as trial-and-error.
★ This method has 3 interesting properties:
✦ It sometimes works.
✦ It often wastes a lot of time and makes things worse.
✦ It can cost you your job.

61

Guess and Feel-Good-Measure
★ Apply BS-Benchmarks:
✦ Time how long it takes to create 10,000 tables and
conclude InnoDB is slower.
✦ Write an application benchmark that doesn‟t mimic that of
your application (not enough data, single threaded only).

62

The best tool to make things easy
is a profiler.

System-wide
★ System-wide:
✦ oprofile, strace -c
★ Perl: Devel::NYTProf
★ Ruby: ruby -r profile (or just NewRelic)

★ PHP:

✦ Instrumentation-for-PHP -
http://code.google.com/p/instrumentation-for-php
✦ Xdebug + KCachegrind -- but not in production
✦ XHProf http://mirror.facebook.net/facebook/xhprof/
✦ New Relic - http://www.newrelic.com

64

Instrumentation Demo
★ We‟ve got it running here:
✦ http://204.236.183.243/my-movies

65

Table Of Contents






66

Real Life.
★ Custom written profiling code from boardreader.com.
✦ Works similar to what you can achieve with NewRelic and
XHProf or Instrumentation-for-PHP

67

User Tasks
★ If I had to guess it - I would say that they have a few
different tasks that people do:
✦ Loading Static Pages (like the home page)
✦ Browse by a popular topic.
✦ Returning results for a custom search.

71

Task Goals
★ Each of these items (in my mind) has a different
response time goal. Let's make some up for context:
✦ Loading the homepage 100ms.
✦ Browsing by a popular topic 500ms
✦ Search requests 2000ms (2 seconds)

72

Measuring Goals
★ Here's how boardreader.com stores the information:
CREATE TABLE `performance_log_090721` (
`ip` varchar(15) NOT NULL,
`server_ip` varchar(25) NOT NULL,
`page` varchar(3000) NOT NULL,
`utime` float NOT NULL,
`stime` float NOT NULL,
`wtime` float NOT NULL,
`mysql_time` float NOT NULL,
`sphinx_time` float NOT NULL,
`mysql_count_queries` int(11) NOT NULL,
`mysql_queries` text NOT NULL,
`sphinx_count_queries` int(11) NOT NULL,
`sphinx_queries` text NOT NULL,
..
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

73

Measuring Goals (cont.)
★ Here's an aggregation of a day's worth of search
responses:
mysql> select avg(wtime) request, avg(stime+utime)/avg(wtime) cpu
,avg(mysql_time)/avg(wtime) mysql, avg(sphinx_time)/avg(wtime)
sphinx, avg(wtime-stime-utime-sphinx_time-mysql_time)/avg(wtime)
rest from performance_log_090721 where page_type='search' G
*************************** 1. row ***************************
request: 1.2175869055517
cpu: 0.16983144536072
mysql: 0.1544487152423
sphinx: 0.61537297006254
rest: 0.060346869334443

74

Measuring Goals (cont.)
★ mysql> select date_format(logged,"%H") h ,round(avg(wtime),3) r,
round(avg(stime+utime)/avg(wtime),2) cpup ,round(avg(mysql_time)/avg(wtime),2) mp,
round(avg(sphinx_time)/avg(wtime),2) sp, round(avg(wtime-stime-utime-sphinx_time-
mysql_time)/avg(wtime),2) rst from performance_log_090721 where page_type='search' group by h;
+------+-------+------+------+------+------+
| h | r | cpup | mp | sp | rst |
+------+-------+------+------+------+------+
| 00 | 1.816 | 0.11 | 0.14 | 0.70 | 0.05 |
| 01 | 1.480 | 0.17 | 0.18 | 0.59 | 0.06 |
| 02 | 1.394 | 0.16 | 0.22 | 0.53 | 0.09 |
....
| 08 | 1.384 | 0.13 | 0.09 | 0.74 | 0.04 |
| 09 | 1.315 | 0.17 | 0.11 | 0.67 | 0.04 |
| 10 | 0.950 | 0.20 | 0.15 | 0.60 | 0.05 |
| 11 | 0.874 | 0.21 | 0.16 | 0.57 | 0.06 |
| 12 | 1.139 | 0.17 | 0.13 | 0.65 | 0.05 |
| 13 | 1.191 | 0.16 | 0.14 | 0.65 | 0.05 |
| 14 | 1.349 | 0.16 | 0.19 | 0.58 | 0.06 |
| 15 | 1.076 | 0.20 | 0.21 | 0.53 | 0.06 |
| 16 | 1.526 | 0.14 | 0.14 | 0.58 | 0.13 |
| 17 | 0.853 | 0.24 | 0.19 | 0.50 | 0.07 |
| 18 | 0.978 | 0.25 | 0.23 | 0.43 | 0.09 |
| 19 | 0.924 | 0.23 | 0.17 | 0.54 | 0.06 |
| 20 | 1.310 | 0.18 | 0.26 | 0.47 | 0.09 |
| 21 | 1.211 | 0.17 | 0.24 | 0.51 | 0.08 |
| 22 | 1.538 | 0.14 | 0.19 | 0.59 | 0.08 |
| 23 | 1.450 | 0.15 | 0.18 | 0.60 | 0.06 |
+------+-------+------+------+------+------+
75

Interpreting Results
★ Always interesting to see the difference in timing on a
production system with real load!
★ 1 hour as I've got on my slides is probably not the best

aggregate - some „blips‟ in service may not show up.

76

Interpreting Results (cont.)
★ Average (mean) is easy to do in MySQL - but 95th
percentile is probably better.
★ What is the problem with min/max/mean?

77

Actioning Results
★ It is only when you've got these metrics in place that
you can start to look at where to fix the problem.
★ This is where the toolchest comes in.

78

Table Of Contents






79

Top
★ Best use it to confirm what is really running.
✦ On a DB server mysqld should be at the top.
✦ Most of the time there should only be one mysqld process.
★ Example usage:
✦ top -bn1

81

ps
★ Quick Trick - verify that the sum of VSZ adds up to
roughly the amount of memory used by the system:
✦ ps -e -o vsz | awk '{size += $1}END{print(size)}'
★ Check for all running servers:
✦ ps aux | grep mysqld

82

free
★ Doesn‟t really show anything vmstat won‟t.
✦ But very handy one line math to show caches. Example:
$ free -m
total used free shared buffers cached
Mem: 32177 30446 1730 0 368 16649
-/+ buffers/cache: 13428 18748
Swap: 4095 2 4093

83

vmstat
★ Best run when server is busy. You can watch what is
happening right now. Example:
✦ vmstat 5
★ General-purpose, but a good way to sense what the
system as a whole is doing.

84

vmstat (cont.)
★ Most important bits:
✦ si/so should be zero
✦ bi/bo are blocks read and written, so you can see IO
✦ What does 12% cpu usage mean?

85

vmstat (cont)
$ vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 3 0 44604 15980 1585532 0 0 3 281 6 9 0 0 97 3 0
0 3 0 44232 16072 1586572 0 0 1 17718 110 138 0 5 0 94 1
0 4 0 46960 16152 1581112 0 0 30 18514 108 152 0 5 0 94 1
0 3 0 46712 16284 1581196 0 0 29 26166 159 201 0 7 0 91 2
0 3 0 47332 16388 1581220 0 0 1 18990 114 145 0 5 0 94 1

86

mpstat
★ Installed as part of sysstat.
★ More useful than vmstat because it shows individual

CPUs. Example:
✦ mpstat -P ALL 5

87

mpstat (cont)
# mpstat -P ALL 5
Linux 2.6.18-238.el5 (livecd.localdomain) 03/10/2011

11:01:49 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
11:01:54 PM all 10.31 0.00 4.64 0.00 0.00 0.17 0.00 84.89 1004.00
11:01:54 PM 0 0.20 0.00 0.00 0.00 0.00 0.20 0.00 99.60 1000.20
11:01:54 PM 1 0.00 0.00 0.00 0.00 0.00 0.40 0.00 99.60 0.20
11:01:54 PM 2 15.40 0.00 7.40 0.00 0.00 0.20 0.00 77.00 0.00
11:01:54 PM 3 8.40 0.00 3.60 0.00 0.00 0.20 0.00 87.80 0.00
11:01:54 PM 4 31.60 0.00 14.40 0.00 0.00 0.20 0.00 53.80 0.00
11:01:54 PM 5 6.20 0.00 2.40 0.00 0.00 0.20 0.00 91.20 3.60

88

netstat (cont.)
★ Show count of states:
✦ netstat -antp | awk '{print $6}' | sort | uniq -c | sort -rn
★ Show count of peers:
✦ netstat -antp | awk '{print $5}' | cut -d: -f1 | sort | uniq -c |
sort -rn

89

netstat
★ netstat -antp
✦ "ss -ant" works acceptably as well.

$ sudo netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 943/sshd
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN 965/sendmail: accep
tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 20237/mysqld
tcp 0 52 10.170.182.171:22 24.6.173.25:52944 ESTABLISHED 20470/sshd: ec2-use
tcp 0 0 :::22 :::* LISTEN 943/sshd

90

netstat (cont.)
★ Things to inspect:
✦ Are there a lot of sockets in TIME_WAIT?
✦ A lot going to port 53 (DNS) ?
✦ Who's connecting to mysqld?
★ You need to know the TCP handshake process:
✦ SYN, SYN-ACK, ACK, ESTABLISHED.

91

ping
★ useful for checking for packet loss.
★ let it run, watch the rtt, cancel,

★ look for missing packets.

★ icmp packets may be deprioritized

92

iostat
★ Much better IO statistics than what vmstat provides.
Example usage:
✦ iostat -dx 5 (need x for extended statistics)
✦ iostat -kx 5 (show CPU stats at the same time)
★ Main problem is that it lumps reads and writes
together.
✦ Look for what the disks are doing and ask if this is
reasonable.

93

iostat (cont.)
★ Main items to look at:
✦ What is the queue length?
✦ How much is being read and written?
✦ What is the average wait, and what is the service time?

94

iostat (cont)
avg-cpu: %user %nice %system %iowait %steal %idle
0.11 0.00 5.13 93.91 0.85 0.00

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
xvdap1 0.00 35.15 0.00 14.85 0.00 200.00 26.94 1.97 2.49 1.70 2.52
xvdap2 0.00 2819.23 0.00 302.24 0.00 32481.20 82.59 74.39 23.71 1.76 53.29
xvdap3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

95

Aspersa toolkit
Open source tools which improve upon the toolchest
✦ Disk statistics
• diskstat – does not lump reads/writes, interactive
• ioprofile – shows exactly where a process is using IO

✦ Diagnostics
• stalk / collect – get well scoped diagnostic data
• pmp – diagnose mutex contention
• sift – quickly filter information from collect
• summary / mysql-summary – important info at a glance

96

Scope of Problem:
★ Overnight the query performance went from <1ms to
50x worse.
★ Nothing changed in terms of server configuration,

schema, etc.
★ Tried throttling the server to 1/2 of its workload

✦ from 20k QPS to 10k QPS
✦ no improvement.

98

Train of thought
★ Change in config client doesn't know about?
★ Hardware problem such as a failing disk?

★ Load increase: data growth or QPS crossed a "tipping

point"?
★ Schema changes client doesn't know about (missing

index?)
★ Network component such as DNS?

99

Elimination of easy possibilities:
★ ALL queries are found to be slower in slow-query-log
✦ eliminates DNS as a possibility.
★ Queries are slow when run via Unix socket
✦ eliminates network.
★ No errors in dmesg or RAID controller
✦ suggests (doesn't eliminate) that hardware is not the
problem.

100

Diagnosing MySQL performance problems

More Related Content

What's hot

Similar to Diagnosing MySQL performance problems

More from Justin Swanhart