博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Linux 性能诊断 perf使用指南
阅读量:6293 次
发布时间:2019-06-22

本文共 71016 字,大约阅读时间需要 236 分钟。

标签

Linux , perf , 性能诊断 , stap , systemtap , strace , dtrace , dwarf , profiler , perf_events


背景

Linux在服务端已占据非常大的比例,很多业务很多服务都跑在Linux上面。

软件运行在Linux下,软件本身、以及Linux系统的性能诊断也成为热门的话题。

例如,你要如何回答这些问题

Why is the kernel on-CPU so much? What code-paths?  Which code-paths are causing CPU level 2 cache misses?  Are the CPUs stalled on memory I/O?  Which code-paths are allocating memory, and how much?  What is triggering TCP retransmits?  Is a certain kernel function being called, and how often?  What reasons are threads leaving the CPU?

又或者你是一名DBA或者开发人员,想知道数据库在跑某些benchmark时,性能瓶颈在哪里,是IO,是等待,还是网络,代码瓶颈在哪里?

在Linux下诊断的工具比较多,比如systemtap, dtrace, perf。

本文将介绍一下perf的用法,网上很多叫法如perf_events , perf profiler , Performance Counters for Linux。叫法不同,都指perf。

什么是perf

perf是Linux 2.6+内核中的一个工具,在内核源码包中的位置 tools/perf。

perf利用Linux的trace特性,可以用于实时跟踪,统计event计数(perf stat);或者使用采样(perf record),报告(perf report|script|annotate)的使用方式进行诊断。

perf命令行接口并不能利用所有的Linux trace特性,有些trace需要通过ftrace接口得到。

参考 

perf工作原理

这张图大致列出了perf支持的跟踪事件,从kernerl到user space,支持块设备、网络、CPU、文件系统、内存等,同时还支持系统调用,用户库的事件跟踪。

你可以使用perf list输出当前内核perf 支持的预置events

perf listList of pre-defined events (to be used in -e):  ref-cycles                                         [Hardware event]  alignment-faults                                   [Software event]  context-switches OR cs                             [Software event]  cpu-clock                                          [Software event]  cpu-migrations OR migrations                       [Software event]  dummy                                              [Software event]  emulation-faults                                   [Software event]  major-faults                                       [Software event]  minor-faults                                       [Software event]  page-faults OR faults                              [Software event]  task-clock                                         [Software event].....略.......  writeback:writeback_pages_written                  [Tracepoint event]  writeback:writeback_queue                          [Tracepoint event]  writeback:writeback_task_start                     [Tracepoint event]  writeback:writeback_task_stop                      [Tracepoint event]

perf background

我们看到perf支持这么多的事件和trace,它依赖了很多的接口来干这件事情。

1. Symbols

没有符号表,无法将内存地址翻译成函数和变量名。

例如,无符号表的跟踪显示如下

57.14%     sshd  libc-2.15.so        [.] connect                          |               --- connect                  |                            |--25.00%-- 0x7ff3c1cddf29                  |                            |--25.00%-- 0x7ff3bfe82761                  |          0x7ff3bfe82b7c                  |                            |--25.00%-- 0x7ff3bfe82dfc                   --25.00%-- [...]

有符号表的跟踪显示如下

57.14%     sshd  libc-2.15.so        [.] __GI___connect_internal               |               --- __GI___connect_internal                  |                            |--25.00%-- add_one_listen_addr.isra.0                  |                            |--25.00%-- __nscd_get_mapping                  |          __nscd_get_map_ref                  |                            |--25.00%-- __nscd_open_socket                   --25.00%-- [...]

如何安装符号表?

对于内核代码的符号表,在编译内核时,使用CONFIG_KALLSYMS=y。 检查如下

# cat /boot/config-`uname -r` |grep CONFIG_KALLSYMSCONFIG_KALLSYMS=yCONFIG_KALLSYMS_ALL=yCONFIG_KALLSYMS_EXTRA_PASS=y

对于用户安装软件的符号表,如果是yum安装的,可以安装对应的debuginfo包。

如果是用户自己编译的,例如使用GCC编译时加上-g选项。

2. perf annotate

perf annotate can generate sourcecode level information if the application is compiled with -ggdb.

3. Stack Traces (使用perf record -g收集stack traces)

要跟踪完整的stack,编译时需要注意几个东西。

Always compile with frame pointers. Omitting frame pointers is an evil compiler optimization that breaks debuggers, and sadly, is often the default. Without them, you may see incomplete stacks from perf_events, like seen in the earlier sshd symbols example. There are two ways to fix this: either using dwarf data to unwind the stack, or returning the frame pointers.

3.1 编译perf时包含libunwind和-g dwarf,需要3.9以上的内核版本。

Since about the 3.9 kernel, perf_events has supported a workaround for missing frame pointers in user-level stacks: libunwind, which uses dwarf. This can be enabled using "-g dwarf".

3.2 有些编译优化项会忽略frame pointer,所以编译软件时必须指定 -fno-omit-frame-pointer ,才能跟踪完整的stack trace.

The earlier sshd example was a default build of OpenSSH, which uses compiler optimizations (-O2), which in this case has omitted the frame pointer. Here's how it looks after recompiling OpenSSH with -fno-omit-frame-pointer

3.3 编译内核时包含 CONFIG_FRAME_POINTER=y

总结一下,要愉快的跟踪更完备的信息,就要在编译软件时打开符号表的支持(gcc -g),开启annotate的支持(gcc -ggdb),以及Stack trace的支持(gcc -fno-omit-frame-pointer)。

perf pre-defined event说明

Hardware [Cache] Events:   CPU相关计数器  CPU周期、指令重试,内存间隔周期、L2CACHE miss等  These instrument low-level processor activity based on CPU performance counters.   For example, CPU cycles, instructions retired, memory stall cycles, level 2 cache misses, etc.   Some will be listed as Hardware Cache Events.Software Events:   内核相关计数器  These are low level events based on kernel counters.   For example, CPU migrations, minor faults, major faults, etc.Tracepoint Events:   内核ftrace框架相关,例如系统调用,TCP事件,文件系统IO事件,块设备事件等。    根据LIBRARY归类。如sock表示socket事件。    This are kernel-level events based on the ftrace framework. These tracepoints are placed in interesting and logical locations of the kernel, so that higher-level behavior can be easily traced.   For example, system calls, TCP events, file system I/O, disk I/O, etc.   These are grouped into libraries of tracepoints;     eg, "sock:" for socket events, "sched:" for CPU scheduler events.Dynamic Tracing:   动态跟踪,可以在代码中的任何位置创建事件跟踪节点。很好很强大。    内核跟踪使用kprobe,user-level跟踪使用uprobe。    Software can be dynamically instrumented, creating events in any location.   For kernel software, this uses the kprobes framework.   For user-level software, uprobes.Timed Profiling:   采样频度,按指定频率采样,被用于perf record。    Snapshots can be collected at an arbitrary frequency, using perf record -FHz.   This is commonly used for CPU usage profiling, and works by creating custom timed interrupt events.

了解了perf event后,我们可以更精细的,有针对性的对事件进行跟踪,采样,报告。

当然,你也可以不指定事件,全面采样。

build perf

例如centos你可以使用yum安装,也可以使用源码安装。

perf在内核源码的tools/perf中,所以下载与你的内核大版本一致的内核源码即可

uname -a wget https://cdn.kernel.org/pub/linux/kernel/v3.x/linux-3.10.104.tar.xztar -xvf linux-3.10.104.tar.xzcd linux-3.10.104/tools/perf/

安装依赖库,有一个小窍门可以找到依赖的库

$cat Makefile |grep found                msg := $(warning No libelf found, disables 'probe' tool, please install elfutils-libelf-devel/libelf-dev);                msg := $(error No gnu/libc-version.h found, please install glibc-dev[el]/glibc-static);                msg := $(warning No libdw.h found or old libdw.h found or elfutils is older than 0.138, disables dwarf support. Please install new elfutils-devel/libdw-dev);        msg := $(warning No libunwind found, disabling post unwind support. Please install libunwind-dev[el] >= 0.99);                msg := $(warning No libaudit.h found, disables 'trace' tool, please install audit-libs-devel or libaudit-dev);                msg := $(warning slang not found, disables TUI support. Please install slang-devel or libslang-dev);                msg := $(warning GTK2 not found, disables GTK2 support. Please install gtk2-devel or libgtk2.0-dev);  $(if $(1),$(warning No $(1) was found))                                                msg := $(warning No bfd.h/libbfd found, install binutils-dev[el]/zlib-static to gain symbol demangling)                msg := $(warning No numa.h found, disables 'perf bench numa mem' benchmark, please install numa-libs-devel or libnuma-dev);

通常依赖 gcc make bison flex elfutils libelf-dev libdw-dev libaudit-dev python-dev binutils-dev

perf依赖的kernel宏

并不是每个开关都需要,但是有些没有就不方便或者功能缺失,例如没有打开符号表的话,看到的是一堆内存地址。

# for perf_events:CONFIG_PERF_EVENTS=y# for stack traces:CONFIG_FRAME_POINTER=y# kernel symbols:CONFIG_KALLSYMS=y# tracepoints:CONFIG_TRACEPOINTS=y# kernel function trace:CONFIG_FTRACE=y# kernel-level dynamic tracing:CONFIG_KPROBES=yCONFIG_KPROBE_EVENTS=y# user-level dynamic tracing:CONFIG_UPROBES=yCONFIG_UPROBE_EVENTS=y# full kernel debug info:CONFIG_DEBUG_INFO=y# kernel lock tracing:CONFIG_LOCKDEP=y# kernel lock tracing:CONFIG_LOCK_STAT=y# kernel dynamic tracepoint variables:CONFIG_DEBUG_INFO=y

一些开关的用途介绍

1. Kernel-level symbols are in the kernel debuginfo package, or when the kernel is compiled with CONFIG_KALLSYMS.

2. The kernel stack traces are incomplete. Now a similar profile with CONFIG_FRAME_POINTER=y

3. 当我们使用perf record [stack traces (-g)]时,可以跟踪stack,但是如果内核编译时没有指定CONFIG_FRAME_POINTER=y,perf report时就会看到缺失的信息。

不包含CONFIG_FRAME_POINTER=y时

99.97%  swapper  [kernel.kallsyms]  [k] default_idle            |            --- default_idle     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16                  |               --- iowrite16                   __write_nocancel                   (nil)

包含CONFIG_FRAME_POINTER=y时 (Much better -- the entire path from the write() syscall (__write_nocancel) to iowrite16() can be seen.)

99.97%  swapper  [kernel.kallsyms]  [k] default_idle            |            --- default_idle                cpu_idle               |                         |--87.50%-- start_secondary               |                          --12.50%-- rest_init                          start_kernel                          x86_64_start_reservations                          x86_64_start_kernel     0.03%     sshd  [kernel.kallsyms]  [k] iowrite16               |               --- iowrite16                   vp_notify                   virtqueue_kick                   start_xmit                   dev_hard_start_xmit                   sch_direct_xmit                   dev_queue_xmit                   ip_finish_output                   ip_output                   ip_local_out                   ip_queue_xmit                   tcp_transmit_skb                   tcp_write_xmit                   __tcp_push_pending_frames                   tcp_sendmsg                   inet_sendmsg                   sock_aio_write                   do_sync_write                   vfs_write                   sys_write                   system_call_fastpath                   __write_nocancel

4. 如果要使用动态跟踪,即跟踪任意指定代码,则需要打开这些开关:

For kernel analysis, using CONFIG_KPROBES=y and CONFIG_KPROBE_EVENTS=y, to enable kernel dynamic tracing. and CONFIG_FRAME_POINTER=y, for frame pointer-based kernel stacks.

For user-level analysis, CONFIG_UPROBES=y and CONFIG_UPROBE_EVENTS=y, for user-level dynamic tracing.

5. 如果打开了CONFIG_DEBUG_INFO,则可以在动态跟踪中打印内核变量的值。

If your kernel has debuginfo (CONFIG_DEBUG_INFO=y), you can fish out kernel variables from functions. This is a simple example of examining a size_t (integer)

例如

1. Listing variables available for tcp_sendmsg():# perf probe -V tcp_sendmsgAvailable variables at tcp_sendmsg        @
size_t size struct kiocb* iocb struct msghdr* msg struct sock* sk2. Creating a probe for tcp_sendmsg() with the "size" variable:# perf probe --add 'tcp_sendmsg size'Added new event: probe:tcp_sendmsg (on tcp_sendmsg with size)3. You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1

通过以下命令可以查看linux的config

cat /boot/config-`uname -r`## Automatically generated make config: don't edit# Linux kernel version: 2.6.32-573.el6.x86_64# Thu Jul 23 15:38:20 2015#CONFIG_64BIT=y# CONFIG_X86_32 is not setCONFIG_X86_64=yCONFIG_X86=yCONFIG_OUTPUT_FORMAT="elf64-x86-64"CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"CONFIG_GENERIC_CMOS_UPDATE=y.......

perf使用说明

perf --help

先了解一下概貌

perf 命令用法还是挺简单的,根据功能区分了COMMAND,每个COMMAND有各自的用法。

用得比较多的有list, record, report, script, stat, top。

usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are:   annotate        Read perf.data (created by perf record) and display annotated code   archive         Create archive with object files with build-ids found in perf.data file   bench           General framework for benchmark suites   buildid-cache   Manage build-id cache.   buildid-list    List the buildids in a perf.data file   data            Data file related processing   diff            Read perf.data files and display the differential profile   evlist          List the event names in a perf.data file   inject          Filter to augment the events stream with additional information   kmem            Tool to trace/measure kernel memory properties   kvm             Tool to trace/measure kvm guest os   list            List all symbolic event types   lock            Analyze lock events   mem             Profile memory accesses   record          Run a command and record its profile into perf.data   report          Read perf.data (created by perf record) and display the profile   sched           Tool to trace/measure scheduler properties (latencies)   script          Read perf.data (created by perf record) and display trace output   stat            Run a command and gather performance counter statistics   test            Runs sanity tests.   timechart       Tool to visualize total system behavior during a workload   top             System profiling tool.   probe           Define new dynamic tracepoints   trace           strace inspired tool See 'perf help COMMAND' for more information on a specific command.

perf help COMMAND

要得到每个command的用法也蛮简单,可以使用perf help COMMAND得到。

例如

perf help recordPERF-RECORD(1)                    perf Manual                   PERF-RECORD(1)NAME       perf-record - Run a command and record its profile into perf.dataSYNOPSIS       perf record [-e 
| --event=EVENT] [-l] [-a]
perf record [-e
| --event=EVENT] [-l] [-a] —
[
]DESCRIPTION This command runs a command and gathers a performance counter profile from it, into perf.data - without displaying anything. This file can then be inspected later on, using perf report.OPTIONS
... Any command you can specify in a shell......

perf top跟踪实时信息

跟踪时可以指定事件,CPU,以及是否跟踪stack trace。

perf top -ag        -a, --all-cpus           System-wide collection. (default)       -g           Enables call-graph (stack chain/backtrace) recording.

输出如下

Samples: 240  of event 'cpu-clock', Event count (approx.): 19122881                                                                                                                                                                            Children      Self  Shared Object      Symbol  +   14.64%    14.64%  [kernel]           [k] _spin_unlock_irqrestore   +   10.91%    10.91%  libslang.so.2.2.1  [.] SLtt_smart_puts       +    6.02%     6.02%  perf               [.] symbols__insert   +    6.02%     6.02%  [kernel]           [k] kallsyms_expand_symbol   +    6.01%     6.01%  [kernel]           [k] copy_page   +    3.96%     0.64%  libc-2.12.so       [.] _int_malloc   +    3.61%     3.61%  [kernel]           [k] number          +    3.31%     3.31%  [kernel]           [k] clear_page   +    2.71%     2.71%  [kernel]           [k] pointer     ....

输入 ? 可以得到top的帮助介绍

│                                                                                     │UP/DOWN/PGUP                                                 ││PGDN/SPACE    Navigate                                       ││q/ESC/CTRL+C  Exit browser                                   ││                                                             ││For multiple event sessions:                                 ││                                                             ││TAB/UNTAB     Switch events                                  ││                                                             ││For symbolic views (--sort has sym):                         ││                                                             ││->            Zoom into DSO/Threads & Annotate current symbol││<-            Zoom out                                       ││a             Annotate current symbol                        ││C             Collapse all callchains                        ││d             Zoom into current DSO                          ││E             Expand all callchains                          ││F             Toggle percentage of filtered entries          ││H             Display column headers                         ││P             Print histograms to perf.hist.N                ││t             Zoom into current Thread                       ││V             Verbose (DSO names in callchains, etc)         ││z             Toggle zeroing of samples                      ││/             Filter symbol by name

输入E全部展开,展开后可以得到stack trace的结果, 如果发现有地址的信息,但是没有符号表的信息,可能是软件编译时没有指定-g,如果stack信息不完整,则软件编译时需要加-fno-omit-frame-pointer(参考前面章节的介绍)。

Samples: 1K of event 'cpu-clock', Event count (approx.): 17510648        Children      Self  Shared Object        Symbol                      -   28.51%    28.51%  [kernel]             [k] _spin_unlock_irqrestore    -→_spin_unlock_irqrestore [kernel]         17.24% __write_nocancel libc-2.12.so                                                                                                                                       13.79% 0xe7cd libpthread-2.12.so                                       10.34% 0x7f670b15b923 unknown                                          10.34% __fdatasync_nocancel libc-2.12.so                                  0x1b00000017 unknown        4                                       10.34% __libc_fork libc-2.12.so                                        6.90% __select libc-2.12.so                                            3.45% 0x7f670b15b822 unknown                                           3.45% 0x7f670b14d560 unknown                                           3.45% 0x7f670b11ede7 unknown                                           3.45% 0x7f670b11eb5d unknown                                              0x2ad1 pam_unix.so                                                     0xdcdd4c000000000 unknown                                           3.45% __clone libc-2.12.so                                             3.45% __libc_writev libc-2.12.so                                       3.45% __read_nocancel libc-2.12.so                                     3.45% 0x7b294 libslang.so.2.2.1                                        3.45% 0x7efee6ee2560 unknown

输入C全部收起

Samples: 240  of event 'cpu-clock', Event count (approx.): 19122881                                                                                                                                                                            Children      Self  Shared Object      Symbol  +   14.64%    14.64%  [kernel]           [k] _spin_unlock_irqrestore   +   10.91%    10.91%  libslang.so.2.2.1  [.] SLtt_smart_puts       +    6.02%     6.02%  perf               [.] symbols__insert   +    6.02%     6.02%  [kernel]           [k] kallsyms_expand_symbol   +    6.01%     6.01%  [kernel]           [k] copy_page   +    3.96%     0.64%  libc-2.12.so       [.] _int_malloc   +    3.61%     3.61%  [kernel]           [k] number          +    3.31%     3.31%  [kernel]           [k] clear_page   +    2.71%     2.71%  [kernel]           [k] pointer     ....

Symbol列[]中字符代表的含义如下,通常是k或者.表示kernel和user space事件。

[.]: user level[k]: kernel level[g]: guest kernel level (virtualization)[u]: guest os user space[H]: hypervisor

perf record 收集统计信息

perf record用来收集统计信息,通常的用法包括

1. 使用record收集统计信息,可以收集全局的,指定PID或者线程ID的,指定CPU的,指定USER ID的。

-a, --all-cpus           System-wide collection from all CPUs.       -p, --pid=           Record events on existing process ID (comma separated list).       -t, --tid=           Record events on existing thread ID (comma separated list). This option also disables inheritance by default. Enable it by adding --inherit.       -u, --uid=           Record events in threads owned by uid. Name or number.

2. 收集间隔,可以指定采样的频率,采样的EVENT数,采样时间。

-c, --count=           Event period to sample.       -F, --freq=           Profile at this frequency.

3. 采集的详细程度,可以指定event,使用CPU实时调度策略的进程(可以参考RHEL 讲CGROUP cpu部分的文章),是否跟踪stack chain,

-r, --realtime=           Collect data with this RT SCHED_FIFO priority.       -g           Enables call-graph (stack chain/backtrace) recording.       --call-graph           Setup and enable call-graph (stack chain/backtrace) recording, implies -g.               Allows specifying "fp" (frame pointer) or "dwarf"               (DWARF′s CFI - Call Frame Information) or "lbr"               (Hardware Last Branch Record facility) as the method to collect               the information used to show the call graphs.               In some systems, where binaries are build with gcc               --fomit-frame-pointer, using the "fp" method will produce bogus               call graphs, using "dwarf", if available (perf tools linked to               the libunwind library) should be used instead.               Using the "lbr" method doesn′t require any compiler options. It               will produce call graphs from the hardware LBR registers. The               main limition is that it is only available on new Intel               platforms, such as Haswell. It can only get user call chain. It               doesn′t work with branch stack sampling at the same time.       -e, --event=           Select the PMU event. Selection can be:           ·   a symbolic event name (use perf list to list all events)           ·   a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.           ·   a symbolically formed PMU event like pmu/param1=0x3,param2/ where param1, param2, etc are defined as formats for the PMU in /sys/bus/event_sources/devices/
/format/*. · a symbolically formed event like pmu/config=M,config1=N,config3=K/ where M, N, K are numbers (in decimal, hex, octal format). Acceptable values for each of ′config′, ′config1′ and ′config2′ are defined by corresponding entries in /sys/bus/event_sources/devices/
/format/* param1 and param2 are defined as formats for the PMU in: /sys/bus/event_sources/devices/
/format/* · a group of events surrounded by a pair of brace ("{event1,event2,...}"). Each event is separated by commas and the group should be quoted to prevent the shell interpretation. You also need to use --group on "perf report" to view group events together. --filter=
Event filter.

例子

开始跑一个benchmark,同时收集统计信息(指定-a, -g选项)。  CFLAGS='-g -ggdb -fno-omit-frame-pointer' ./configure --prefix=/home/digoal/pgsql9.6make world -j 32make install-world -j 32初始化略pgbench -i -s 100跟踪perf record -ag  Ctrl-C,退出perf record,统计信息已输出到perf.data。

perf report

解读前面收集到的perf.data.

常用的开关如下,--tui是交互式的文本显示窗口,--stdio是文本显示窗口。

-i, --input=           Input file name. (default: perf.data)       -v, --verbose           Be more verbose. (show symbol address, etc)       -n, --show-nr-samples           Show the number of samples for each symbol       -g [type,min], --call-graph           Display call chains using type and min percent threshold. type can be either:           ·   flat: single column, linear exposure of call chains.           ·   graph: use a graph tree, displaying absolute overhead rates.           ·   fractal: like graph, but displays relative rates. Each branch of the tree is considered as a new profiled object.               Default: fractal,0.5.       --stdio           Use the stdio interface.       --tui           Use the TUI interface, that is integrated with annotate and allows zooming into DSOs or threads, among other features. Use of --tui requires a tty, if one is not present, as when piping to other           commands, the stdio interface is used.

交互式显示例子,看概貌挺方便的 (常用的交互命令: E扩展,C收敛,q退出)

#perf report -g --tuiEvents: 52K cycles+     11.17%         postgres  postgres                             [.] hash_seq_search     [.]表示user call, [k]表示kernel call+      4.41%          pgbench  libc-2.12.so                         [.] __GI_vfprintf                           +      4.02%         postgres  postgres                             [.] NextCopyFrom                            +      3.27%         postgres  [kernel.kallsyms]                    [k] copy_user_generic_string                +      2.83%         postgres  postgres                             [.] CopyReadLineText                        +      2.25%         postgres  postgres                             [.] pg_comp_crc32c_sse42                    +      2.18%         postgres  libc-2.12.so                         [.] memcpy                                  +      1.99%         postgres  postgres                             [.] CopyReadAttributesText                  +      1.72%         postgres  [xfs]                                [k] ftrace_raw_init_event_xfs_dir2_leafn_add+      1.65%         postgres  postgres                             [.] heap_fill_tuple                         +      1.52%         postgres  libc-2.12.so                         [.] __GI_____strtoll_l_internal             +      1.26%         postgres  postgres                             [.] heap_form_tuple                         +      1.22%          pgbench  libc-2.12.so                         [.] _IO_default_xsputn_internal             +      1.17%         postgres  postgres                             [.] AllocSetAlloc                           +      1.16%         postgres  postgres                             [.] heap_compute_data_size                  +      1.10%         postgres  postgres                             [.] heap_multi_insert                       +      1.03%         postgres  postgres                             [.] pg_atoi                                 +      1.03%         postgres  libc-2.12.so                         [.] __memset_sse2                           +      1.01%         postgres  postgres                             [.] pg_verify_mbstr_len                     +      0.95%          pgbench  libc-2.12.so                         [.] _itoa_word                              +      0.89%         postgres  postgres                             [.] int4in                                  +      0.83%         postgres  postgres                             [.] resetStringInfo                         +      0.81%         postgres  postgres                             [.] InputFunctionCall                       +      0.81%         postgres  postgres                             [.] hash_search_with_hash_value             +      0.79%         postgres  postgres                             [.] ExecConstraints                         +      0.71%         postgres  postgres                             [.] CopyFrom                                +      0.71%          pgbench  libc-2.12.so                         [.] __strchrnul                             +      0.67%         postgres  postgres                             [.] PageAddItemExtended                     +      0.66%         postgres  [kernel.kallsyms]                    [k] find_get_page                           +      0.65%         postgres  postgres                             [.] CopyReadLine                            +      0.62%         postgres  [kernel.kallsyms]                    [k] page_fault                              +      0.58%         postgres  postgres                             [.] heap_prepare_insert                     +      0.57%         postgres  postgres                             [.] heap_prune_chain                        +      0.52%         postgres  postgres                             [.] bpcharin                                +      0.52%          pgbench  libpq.so.5.9                         [.] PQputCopyData                           +      0.52%         postgres  postgres                             [.] AllocSetFreeIndex                       +      0.52%         postgres  postgres                             [.] pq_getbytes                             +      0.51%         postgres  postgres                             [.] RelationPutHeapTuple                    +      0.49%         postgres  postgres                             [.] BufferGetBlockNumber                    +      0.47%         postgres  postgres                             [.] enlargeStringInfo                       +      0.47%         postgres  postgres                             [.] bpchar_input                            +      0.46%         postgres  postgres                             [.] CopyGetData                             +      0.45%         postgres  postgres                             [.] lazy_scan_heap                          +      0.45%         postgres  [kernel.kallsyms]                    [k] radix_tree_lookup_slot                  +      0.44%          pgbench  libpq.so.5.9                         [.] PQputline                               +      0.44%         postgres  postgres                             [.] CopyLoadRawBuf                          +      0.42%         postgres  postgres                             [.] HeapTupleSatisfiesVacuum                +      0.42%         postgres  [kernel.kallsyms]                    [k] _spin_lock_irq                          +      0.42%         postgres  [kernel.kallsyms]                    [k] list_del                                +      0.38%       flush-8:16  [xfs]                                [k] ftrace_raw_init_event_xfs_dir2_leafn_add+      0.37%         postgres  postgres                             [.] ExecStoreTuple                          +      0.36%         postgres  postgres                             [.] appendBinaryStringInfo                  +      0.35%         postgres  postgres                             [.] pq_getmessage                           +      0.35%          pgbench  libpq.so.5.9                         [.] pqPutMsgEnd                             Press '?' for help on key bindings

文本显示例子,看细节挺方便

#perf report -v -n --showcpuutilization -g --stdio

首先显示了一些异常,如果你发现少了符号表或者什么的,可以根据提示安装debuginfo,或者重新编译内核或软件。

Looking at the vmlinux_path (6 entries long)dso__load_sym: cannot get elf header.Using /proc/kallsyms for symbolsLooking at the vmlinux_path (6 entries long)No kallsyms or vmlinux with build-id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 was found[bianque_driver] with build id 3187a0b0fc53e27c19f9fad3e63f9437402e5548 not found, continuing without symbolsLooking at the vmlinux_path (6 entries long)No kallsyms or vmlinux with build-id e6c06734499e665685cd28ac846d8d69d95cce8c was found[missing_slash] with build id e6c06734499e665685cd28ac846d8d69d95cce8c not found, continuing without symbols/lib64/ld-2.12.so was updated, restart the long running apps that use it!/usr/lib64/libsyslog-ng-3.6.so.0.0.0 was updated, restart the long running apps that use it!/disk1/gpdb_20160101/bin/postgres was updated, restart the long running apps that use it!Looking at the vmlinux_path (6 entries long)No kallsyms or vmlinux with build-id 253ed41ca3926c96e4181ac4df30cf3cde868089 was found[pdeath_signal] with build id 253ed41ca3926c96e4181ac4df30cf3cde868089 not found, continuing without symbolsno symbols found in /sbin/iscsid, maybe install a debug package?no symbols found in /bin/egrep, maybe install a debug package?no symbols found in /usr/sbin/dmidecode, maybe install a debug package?/opt/aegis/lib64/aegis_monitor.so was updated, restart the long running apps that use it!

报告的细节如下

# Events: 52K cycles## Overhead  Samples    sys    us            Command                                   Shared Object                                                               # 事件百分比 采样百分比  系统消耗占比  系统消耗u秒  命令  共享对象# ........ ..........  ...............  ............................................  ...............................................................#第一条   事件百分比  采样百分比  系统消耗占比  系统消耗u秒    命令                                         共享对象                 11.17%     0.00%     5.56%       2173               postgres                              /home/digoal/pgsql9.6/bin/postgres       0x50eb35         B [.] hash_seq_search                   |                   --- hash_seq_search  (输出call stack trace,需要在编译软件时开启-g开关, -fno-omit-frame-pointer开关)                       GetRunningTransactionLocks                       LogStandbySnapshot    (有两个分支会调用 LogStandbySnapshot)                      |                                |--50.23%-- CreateCheckPoint    (分支开销占比)                      |          CheckpointerMain                      |          AuxiliaryProcessMain                      |          StartChildProcess                      |          reaper                      |          __restore_rt                      |          PostmasterMain                      |          startup_hacks                      |          __libc_start_main                      |                                 --49.77%-- BackgroundWriterMain  (分支开销占比)                                 AuxiliaryProcessMain                                 StartChildProcess                                 reaper                                 __restore_rt                                 PostmasterMain                                 startup_hacks                                 __libc_start_main下面这条全部是系统开销       4.41%     0.00%     4.41%       1808            pgbench                                   /lib64/libc-2.12.so                          0x44005          B [.] __GI_vfprintf                    |                    --- __GI_vfprintf                       |                                 |--99.48%-- __vsnprintf                       |          |          后面。。。。。。略

perf script

一条条打印perf.data内的数据,输出的是perf record收集到的原始数据。

生成热力图、火焰图也需要perf script的输出,从最原始的信息中提取数据,生成svg。

perf-script - Read perf.data (created by perf record) and display trace output  主要的options       -G, --hide-call-graph           When printing symbols do not display call chain.    -v, --verbose         be more verbose (show symbol address, etc)    -L, --Latency         show latency attributes (irqs/preemption disabled, etc)    -d, --debug-mode      do various checks like samples ordering and lost events    -D, --dump-raw-trace  dump raw trace in ASCII#perf script -v -G|wc -l52958 刚好等于前面使用perf report输出的# Events: 52K cycles  详细内容就不列出了,非常多

perf annotate

解读前面收集到的perf.data,辅以汇编指令显示。

perf-annotate - Read perf.data (created by perf record) and display annotated code         -i, --input=           Input file name. (default: perf.data)       -v, --verbose           Be more verbose. (Show symbol address, etc)       -D, --dump-raw-trace           Dump raw trace in ASCII.       -l, --print-line           Print matching source lines (may be slow).       -P, --full-paths           Don’t shorten the displayed pathnames.       --stdio           Use the stdio interface.       --tui           Use the TUI interface Use of --tui requires a tty, if one is not present, as when piping to other commands, the stdio interface is used. This interfaces starts by centering on the line with more           samples, TAB/UNTAB cycles through the lines with more samples.

例子

Sorted summary for file /home/digoal/pgsql9.6/bin/postgres----------------------------------------------   32.43 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)   16.20 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)   14.16 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)   10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1406   10.68 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1)    9.97 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1412    3.76 /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) Percent |      Source code & Disassembly of /home/digoal/pgsql9.6/bin/postgres------------------------------------------------                          50eac0: 466                          50eac4: 1                          50eac7: 3                          50eaca: 1                          50eaec: 435                          50eaf5: 12                          50eafb: 7                          50eb00: 20                          50eb08: 5                          50eb10: 8                          50eb14: 1                          50eb18: 5                          50eb1b: 4                          50eb1e: 15                          50eb22: 466                          50eb26: 2                          50eb2e: 2                          50eb32: 6                          50eb35: 618                          50eb38: 707                          50eb3c: 164                          50eb41: 1415                          h->sum: 4363         :         :         :         :      Disassembly of section .text:         :         :      000000000090e9cc 
: : register_seq_scan(hashp); : } : : void * : hash_seq_search(HASH_SEQ_STATUS *status) : { 0.00 : 90e9cc: 55 push %rbp 0.00 : 90e9cd: 48 89 e5 mov %rsp,%rbp 0.00 : 90e9d0: 48 83 ec 60 sub $0x60,%rsp 0.00 : 90e9d4: 48 89 7d a8 mov %rdi,-0x58(%rbp) : long segment_ndx; : HASHSEGMENT segp; : uint32 curBucket; : HASHELEMENT *curElem; : : if ((curElem = status->curEntry) != NULL) 0.00 : 90e9d8: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9dc: 48 8b 40 10 mov 0x10(%rax),%rax 0.00 : 90e9e0: 48 89 45 d8 mov %rax,-0x28(%rbp) 0.00 : 90e9e4: 48 83 7d d8 00 cmpq $0x0,-0x28(%rbp) 0.00 : 90e9e9: 74 3a je 90ea25
: { : /* Continuing scan of curBucket... */ : status->curEntry = curElem->link; 0.00 : 90e9eb: 48 8b 45 d8 mov -0x28(%rbp),%rax 0.00 : 90e9ef: 48 8b 10 mov (%rax),%rdx 0.00 : 90e9f2: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9f6: 48 89 50 10 mov %rdx,0x10(%rax) : if (status->curEntry == NULL) /* end of this bucket */ 0.00 : 90e9fa: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90e9fe: 48 8b 40 10 mov 0x10(%rax),%rax 0.00 : 90ea02: 48 85 c0 test %rax,%rax 0.00 : 90ea05: 75 11 jne 90ea18
: ++status->curBucket; 0.00 : 90ea07: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea0b: 8b 40 08 mov 0x8(%rax),%eax 0.00 : 90ea0e: 8d 50 01 lea 0x1(%rax),%edx 0.00 : 90ea11: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea15: 89 50 08 mov %edx,0x8(%rax) : return (void *) ELEMENTKEY(curElem); 0.00 : 90ea18: 48 8b 45 d8 mov -0x28(%rbp),%rax 0.00 : 90ea1c: 48 83 c0 10 add $0x10,%rax 0.00 : 90ea20: e9 54 01 00 00 jmpq 90eb79
: } : : /* : * Search for next nonempty bucket starting at curBucket. : */ : curBucket = status->curBucket; 0.00 : 90ea25: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea29: 8b 40 08 mov 0x8(%rax),%eax 0.00 : 90ea2c: 89 45 e4 mov %eax,-0x1c(%rbp) : hashp = status->hashp; 0.00 : 90ea2f: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea33: 48 8b 00 mov (%rax),%rax 0.00 : 90ea36: 48 89 45 d0 mov %rax,-0x30(%rbp) : hctl = hashp->hctl; 0.00 : 90ea3a: 48 8b 45 d0 mov -0x30(%rbp),%rax 0.00 : 90ea3e: 48 8b 00 mov (%rax),%rax 0.00 : 90ea41: 48 89 45 c8 mov %rax,-0x38(%rbp) : ssize = hashp->ssize; 0.00 : 90ea45: 48 8b 45 d0 mov -0x30(%rbp),%rax 0.00 : 90ea49: 48 8b 40 50 mov 0x50(%rax),%rax 0.00 : 90ea4d: 48 89 45 c0 mov %rax,-0x40(%rbp) : max_bucket = hctl->max_bucket; 0.00 : 90ea51: 48 8b 45 c8 mov -0x38(%rbp),%rax 0.00 : 90ea55: 8b 80 10 03 00 00 mov 0x310(%rax),%eax 0.00 : 90ea5b: 89 45 bc mov %eax,-0x44(%rbp) : : if (curBucket > max_bucket) 0.00 : 90ea5e: 8b 45 e4 mov -0x1c(%rbp),%eax 0.00 : 90ea61: 3b 45 bc cmp -0x44(%rbp),%eax 0.00 : 90ea64: 76 16 jbe 90ea7c
: { : hash_seq_term(status); 0.00 : 90ea66: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ea6a: 48 89 c7 mov %rax,%rdi 0.00 : 90ea6d: e8 09 01 00 00 callq 90eb7b
: return NULL; /* search is done */ 0.00 : 90ea72: b8 00 00 00 00 mov $0x0,%eax 0.00 : 90ea77: e9 fd 00 00 00 jmpq 90eb79
: } : : /* : * first find the right segment in the table directory. : */ : segment_num = curBucket >> hashp->sshift; 0.00 : 90ea7c: 48 8b 45 d0 mov -0x30(%rbp),%rax 0.00 : 90ea80: 8b 40 58 mov 0x58(%rax),%eax 0.00 : 90ea83: 8b 55 e4 mov -0x1c(%rbp),%edx 0.00 : 90ea86: 89 c1 mov %eax,%ecx 0.00 : 90ea88: d3 ea shr %cl,%edx 0.00 : 90ea8a: 89 d0 mov %edx,%eax 0.00 : 90ea8c: 89 c0 mov %eax,%eax 0.00 : 90ea8e: 48 89 45 f8 mov %rax,-0x8(%rbp) : segment_ndx = MOD(curBucket, ssize); 0.00 : 90ea92: 8b 45 e4 mov -0x1c(%rbp),%eax 0.00 : 90ea95: 48 8b 55 c0 mov -0x40(%rbp),%rdx 0.00 : 90ea99: 48 83 ea 01 sub $0x1,%rdx 0.00 : 90ea9d: 48 21 d0 and %rdx,%rax 0.00 : 90eaa0: 48 89 45 f0 mov %rax,-0x10(%rbp) : : segp = hashp->dir[segment_num]; 0.00 : 90eaa4: 48 8b 45 d0 mov -0x30(%rbp),%rax 0.00 : 90eaa8: 48 8b 40 08 mov 0x8(%rax),%rax 0.00 : 90eaac: 48 8b 55 f8 mov -0x8(%rbp),%rdx 0.00 : 90eab0: 48 c1 e2 03 shl $0x3,%rdx 0.00 : 90eab4: 48 01 d0 add %rdx,%rax 0.00 : 90eab7: 48 8b 00 mov (%rax),%rax 0.00 : 90eaba: 48 89 45 e8 mov %rax,-0x18(%rbp) : * Pick up the first item in this bucket's chain. If chain is not empty : * we can begin searching it. Otherwise we have to advance to find the : * next nonempty bucket. We try to optimize that case since searching a : * near-empty hashtable has to iterate this loop a lot. : */ : while ((curElem = segp[segment_ndx]) == NULL) 0.00 : 90eabe: eb 62 jmp 90eb22
: { : /* empty bucket, advance to next */ : if (++curBucket > max_bucket) /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1406 10.68 : 90eac0: 83 45 e4 01 addl $0x1,-0x1c(%rbp) 0.02 : 90eac4: 8b 45 e4 mov -0x1c(%rbp),%eax 0.07 : 90eac7: 3b 45 bc cmp -0x44(%rbp),%eax 0.02 : 90eaca: 76 20 jbe 90eaec
: { : status->curBucket = curBucket; 0.00 : 90eacc: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90ead0: 8b 55 e4 mov -0x1c(%rbp),%edx 0.00 : 90ead3: 89 50 08 mov %edx,0x8(%rax) : hash_seq_term(status); 0.00 : 90ead6: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90eada: 48 89 c7 mov %rax,%rdi 0.00 : 90eadd: e8 99 00 00 00 callq 90eb7b
: return NULL; /* search is done */ 0.00 : 90eae2: b8 00 00 00 00 mov $0x0,%eax 0.00 : 90eae7: e9 8d 00 00 00 jmpq 90eb79
: } : if (++segment_ndx >= ssize) /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1412 9.97 : 90eaec: 48 83 45 f0 01 addq $0x1,-0x10(%rbp) 0.00 : 90eaf1: 48 8b 45 f0 mov -0x10(%rbp),%rax 0.28 : 90eaf5: 48 3b 45 c0 cmp -0x40(%rbp),%rax 0.00 : 90eaf9: 7c 27 jl 90eb22
: { : segment_num++; 0.16 : 90eafb: 48 83 45 f8 01 addq $0x1,-0x8(%rbp) : segment_ndx = 0; 0.46 : 90eb00: 48 c7 45 f0 00 00 00 movq $0x0,-0x10(%rbp) 0.00 : 90eb07: 00 : segp = hashp->dir[segment_num]; 0.11 : 90eb08: 48 8b 45 d0 mov -0x30(%rbp),%rax 0.00 : 90eb0c: 48 8b 40 08 mov 0x8(%rax),%rax 0.18 : 90eb10: Can't annotate copy_user_generic_string: No vmlinux file with build id 7f980c04dad5ff7955a0035ceddb54b83d0b1ab1 was found in the path.Please use 'perf buildid-cache -av vmlinux' or --vmlinux vmlinux.symbol__annotate: filename=/root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be, sym=__GI_vfprintf, start=0x7fed22c73d10, end=0x7fed22c78e78annotating [0xba0c30] /lib64/libc-2.12.so : [0x1c7c310] __GI_vfprintfExecuting: objdump --start-address=0x0000000000043d10 --stop-address=0x0000000000048e78 -dS -C /root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be|grep -v /root/.debug/.build-id/3a/c348a69f62bfc2280da1a8188173961bb2e9be|expa 48 8b 55 f8 mov -0x8(%rbp),%rdx 0.02 : 90eb14: 48 c1 e2 03 shl $0x3,%rdx 0.11 : 90eb18: 48 01 d0 add %rdx,%rax 0.09 : 90eb1b: 48 8b 00 mov (%rax),%rax 0.34 : 90eb1e: 48 89 45 e8 mov %rax,-0x18(%rbp) : * Pick up the first item in this bucket's chain. If chain is not empty : * we can begin searching it. Otherwise we have to advance to find the : * next nonempty bucket. We try to optimize that case since searching a : * near-empty hashtable has to iterate this loop a lot. : */ : while ((curElem = segp[segment_ndx]) == NULL) /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 10.68 : 90eb22: 48 8b 45 f0 mov -0x10(%rbp),%rax 0.05 : 90eb26: 48 8d 14 c5 00 00 00 lea 0x0(,%rax,8),%rdx 0.00 : 90eb2d: 00 0.05 : 90eb2e: 48 8b 45 e8 mov -0x18(%rbp),%rax 0.14 : 90eb32: 48 01 d0 add %rdx,%rax 14.16 : 90eb35: 48 8b 00 mov (%rax),%rax 16.20 : 90eb38: 48 89 45 d8 mov %rax,-0x28(%rbp) /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 3.76 : 90eb3c: 48 83 7d d8 00 cmpq $0x0,-0x28(%rbp) /home/digoal/postgresql-9.6.1/src/backend/utils/hash/dynahash.c:1403 (discriminator 1) 32.43 : 90eb41: 0f 84 79 ff ff ff je 90eac0
: segp = hashp->dir[segment_num]; : } : } : : /* Begin scan of curBucket... */ : status->curEntry = curElem->link; 0.00 : 90eb47: 48 8b 45 d8 mov -0x28(%rbp),%rax 0.00 : 90eb4b: 48 8b 10 mov (%rax),%rdx 0.00 : 90eb4e: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90eb52: 48 89 50 10 mov %rdx,0x10(%rax) : if (status->curEntry == NULL) /* end of this bucket */ 0.00 : 90eb56: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90eb5a: 48 8b 40 10 mov 0x10(%rax),%rax 0.00 : 90eb5e: 48 85 c0 test %rax,%rax 0.00 : 90eb61: 75 04 jne 90eb67
: ++curBucket; 0.00 : 90eb63: 83 45 e4 01 addl $0x1,-0x1c(%rbp) : status->curBucket = curBucket; 0.00 : 90eb67: 48 8b 45 a8 mov -0x58(%rbp),%rax 0.00 : 90eb6b: 8b 55 e4 mov -0x1c(%rbp),%edx 0.00 : 90eb6e: 89 50 08 mov %edx,0x8(%rax) : return (void *) ELEMENTKEY(curElem); 0.00 : 90eb71: 48 8b 45 d8 mov -0x28(%rbp),%rax 0.00 : 90eb75: 48 83 c0 10 add $0x10,%rax : } 0.00 : 90eb79: c9 leaveq Sorted summary for file /lib64/libc-2.12.so---------------------------------------------- 4.34 ??:0 3.71 ??:0 3.24 ??:0 2.93 ??:0 2.93 ??:0 略

perf bench

用来测试系统的一些常见指标的性能(如IPC, message or pipe, memcpy)。

perf-bench - General framework for benchmark suitesSYNOPSIS       perf bench [
]
[
]SUBSYSTEM sched Scheduler and IPC mechanisms. SUITES FOR sched messaging Suite for evaluating performance of scheduler and IPC mechanisms. Based on hackbench by Rusty Russell. Options of messaging -p, --pipe Use pipe() instead of socketpair() -t, --thread Be multi thread instead of multi process -g, --group= Specify number of groups -l, --loop= Specify number of loops Example of messaging .ft C % perf bench sched messaging # run with default options (20 sender and receiver processes per group) (10 groups == 400 processes run) Total time:0.308 sec % perf bench sched messaging -t -g 20 # be multi-thread, with 20 groups (20 sender and receiver threads per group) (20 groups == 800 threads run) Total time:0.582 sec .ft pipe Suite for pipe() system call. Based on pipe-test-1m.c by Ingo Molnar. Options of pipe -l, --loop= Specify number of loops. Example of pipe .ft C % perf bench sched pipe (executing 1000000 pipe operations between two tasks) Total time:8.091 sec 8.091833 usecs/op 123581 ops/sec % perf bench sched pipe -l 1000 # loop 1000 (executing 1000 pipe operations between two tasks) Total time:0.016 sec 16.948000 usecs/op 59004 ops/sec .ft

例子

测试进程或线程通信性能

使用200个进程#perf bench sched messaging -g 200 -l 100# Running sched/messaging benchmark...# 20 sender and receiver processes per group# 200 groups == 8000 processes run     Total time: 2.665 [sec]
使用200个线程代替进程#perf bench sched messaging -g 200 -l 100 -t# Running sched/messaging benchmark...# 20 sender and receiver threads per group# 200 groups == 8000 threads run     Total time: 2.025 [sec]
使用线程, 使用pipe()代替socketpair()#perf bench sched messaging -g 200 -l 100 -t -p# Running sched/messaging benchmark...# 20 sender and receiver threads per group# 200 groups == 8000 threads run     Total time: 1.028 [sec]
使用进程, 使用pipe()代替socketpair()#perf bench sched messaging -g 200 -l 100 -p# Running sched/messaging benchmark...# 20 sender and receiver processes per group# 200 groups == 8000 processes run     Total time: 1.126 [sec]

从测试来看,线程, pipe()通信效率是最高的。

测试pipe()

#perf bench sched pipe -l 100000# Running sched/pipe benchmark...# Executed 100000 pipe operations between two tasks     Total time: 0.785 [sec]       7.851860 usecs/op         127358 ops/sec

测试所有,包括memcpy

#perf bench all# Running sched/messaging benchmark...# 20 sender and receiver processes per group# 10 groups == 400 processes run     Total time: 0.102 [sec]# Running sched/pipe benchmark...# Executed 1000000 pipe operations between two tasks     Total time: 7.967 [sec]       7.967202 usecs/op         125514 ops/sec# Running mem/memcpy benchmark...# Copying 1MB Bytes from 0x7f360c80c010 to 0x7f360f5d9010 ...     855.431993 MB/Sec

perf stat

perf 除了可以采样(使用perf record)(包括call stack trace),还可以用于event计数。

perf stat就是用于event计数的,可以跟踪指定命令的event计数。

NAME       perf-stat - Run a command and gather performance counter statisticsSYNOPSIS       perf stat [-e 
| --event=EVENT] [-a]
perf stat [-e
| --event=EVENT] [-a] —
[
]DESCRIPTION This command runs a command and gathers performance counter statistics from it.跟踪指定command,或者pid, tid,或指定cgroup的event计数
... Any command you can specify in a shell. -r, --repeat=
重复执行N次命令 repeat command and print average + stddev (max: 100) -p, --pid=
stat events on existing process id -t, --tid=
stat events on existing thread id -G name, --cgroup name monitor only in the container (cgroup) called "name". This option is available only in per-cpu mode. The cgroup filesystem must be mounted. All threads belonging to container "name" are monitored when they run on the monitored CPUs. Multiple cgroups can be provided. Each cgroup is applied to the corresponding event, i.e., first cgroup to first event, second cgroup to second event and so on. It is possible to provide an empty cgroup (monitor all the time) using, e.g., -G foo,,bar. Cgroups must have corresponding events, i.e., they always refer to events defined earlier on the command line.只跟踪指定事件 -e, --event= Select the PMU event. Selection can be a symbolic event name (use perf list to list all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a hexadecimal event descriptor.跟踪所有CPU -a, --all-cpus system-wide collection from all CPUs输出相关 -c, --scale scale/normalize counter values -B, --big-num print large numbers with thousands′ separators according to locale -v, --verbose be more verbose (show counter open errors, etc)

例子

124331 digoal  20   0 93.0g  38g  38g S  0.0 20.6   0:28.21 /home/digoal/pgsql9.6/bin/postgres                                                                                                                                           124589 digoal  20   0  125m  972  464 S  0.0  0.0   0:00.03 postgres: logger process                                                                                                                                                       124592 digoal  20   0 93.0g 2.1g 2.1g S  0.0  1.1   0:11.64 postgres: checkpointer process                                                                                                                                                 124593 digoal  20   0 93.0g 2.1g 2.1g S  0.0  1.1   0:13.18 postgres: writer process                                                                                                                                                       124594 digoal  20   0 93.0g 129m 128m S  0.0  0.1   0:03.82 postgres: wal writer process                                                                                                                                                   124595 digoal  20   0 93.0g 2172 1360 S  0.0  0.0   0:00.43 postgres: autovacuum launcher process                                                                                                                                          124596 digoal  20   0  127m 1076  548 S  0.0  0.0   0:00.33 postgres: archiver process   failed on 00000001000000820000002D                                                                                                                124597 digoal  20   0  127m 1132  560 S  0.0  0.0   0:00.93 postgres: stats collector processpgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 1000
跟踪10秒 wal writer进程#perf stat -p 124594 -a -c -B -v sleep 10task-clock: 240196875907 240196865902 240196818048context-switches: 966775 240197106316 240197074990CPU-migrations: 24926 240197787932 240197772224page-faults: 350142 240196568475 240196568475cycles: 376379394416 239735813498 198569141526stalled-cycles-frontend: 305098988594 239733245755 199380423076stalled-cycles-backend: 205940630988 239730830300 160173568784instructions: 233667165879 239725298183 200717526856branches: 38305737741 239719766523 201249886312branch-misses: 524692310 239712708191 199567372892 Performance counter stats for process id '124594':     240196.875907 task-clock                #   23.997 CPUs utilized           [100.00%]           966,775 context-switches          #    0.004 M/sec                   [100.00%]            24,926 CPU-migrations            #    0.000 M/sec                   [100.00%]           350,142 page-faults               #    0.001 M/sec                     376,379,394,416 cycles                    #    1.567 GHz                     [82.83%]    305,098,988,594 stalled-cycles-frontend   #   81.06% frontend cycles idle    [83.17%]   205,940,630,988 stalled-cycles-backend    #   54.72% backend  cycles idle    [66.81%]   233,667,165,879 instructions              #    0.62  insns per cycle     注意一个cycle只执行了0.62条指令,说明有大量的等待,并没有充分发挥CPU的性能,也说明了系统存在瓶颈。                                                    #    1.31  stalled cycles per insn [83.73%]  每个指令的空闲CPU周期,越大说明CPU越闲,没干活。      38,305,737,741 branches                  #  159.476 M/sec                   [83.95%]       524,692,310 branch-misses             #    1.37% of all branches         [83.25%]      10.009649400 seconds time elapsed如果insns per cycle很大,但是实际软件的运行效率确很低时,可能是类似SPIN LOCK,导致CPU空转,实际没有干活。

如果你想跟踪PostgreSQL数据库,可以把数据库的所有进程塞到CGROUP里,然后使用perf stat -G cgname统计整个CGROUP。

perf stat还支持-e指定事件,事件支持通配符。

$ sudo perf stat -e block:block_rq_*,syscalls:sys_enter_write,syscalls:sys_enter_fsync -a -r 5 -- psql -q -U postgres craig -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;"; Performance counter stats for 'psql -U postgres craig -c drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;' (5 runs):     0 block:block_rq_abort                       [100.00%]     0 block:block_rq_requeue                     [100.00%]    97 block:block_rq_complete      ( +- 14.82% ) [100.00%]    96 block:block_rq_insert        ( +- 14.97% ) [100.00%]     98 block:block_rq_issue         ( +- 14.67% ) [100.00%]     0 block:block_rq_remap                       [100.00%]10,607 syscalls:sys_enter_write     ( +-  0.17% ) [100.00%]     1 syscalls:sys_enter_fsync                                    0.908835058 seconds time elapsed    ( +- 18.31% )或者perf stat -e block:*,syscalls:* -a -r 5 -- psql -q -U postgres craig -c "drop table if exists x; create table x as select a FROM generate_series(1,1000000) a;";

perf event modifier

使用instruction:modifier可以指定要跟踪的instruction在哪里?在kernel space或user space,又或者在虚拟机,虚拟机OS,宿主机OS等。

modifier用法如下,写在event:后面。

Modifiers Description Example
u monitor at priv level 3, 2, 1 (user) event:u
k monitor at priv level 0 (kernel) event:k
h monitor hypervisor events on a virtualization environment event:h
H monitor host machine on a virtualization environment event:H
G monitor guest machine on a virtualization environment event:G

例子 -e instructions:u

perf stat -e instructions:u dd if=/dev/zero of=/dev/null count=100000

perf高级用法 - 动态跟踪

perf probe可以实现动态跟踪,指哪打哪。静态跟踪是预置的,而动态跟踪是补充预置不足的。

比如你想跟踪kernel的某个function, 甚至某一行代码,某些变量的值。

或者你想跟踪用户软件的某个function,甚至某一行代码,某些变量的值。

首先要添加需要动态跟踪的对象(function, var, ...)

然后record,和report分析,这和前面的用法是一样的。

例子

Listing variables available for tcp_sendmsg():# perf probe -V tcp_sendmsgAvailable variables at tcp_sendmsg        @
size_t size struct kiocb* iocb struct msghdr* msg struct sock* skCreating a probe for tcp_sendmsg() with the "size" variable:# perf probe --add 'tcp_sendmsg size'Added new event: probe:tcp_sendmsg (on tcp_sendmsg with size)You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1Tracing this probe:# perf record -e probe:tcp_sendmsg -a^C[ perf record: Woken up 1 times to write data ][ perf record: Captured and wrote 0.052 MB perf.data (~2252 samples) ]# perf script# ========# captured on: Fri Jan 31 23:49:55 2014# hostname : dev1# os release : 3.13.1-ubuntu-12-opt# perf version : 3.13.1# arch : x86_64# nrcpus online : 2# nrcpus avail : 2# cpudesc : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz# cpuid : GenuineIntel,6,44,2# total memory : 1796024 kB# cmdline : /usr/bin/perf record -e probe:tcp_sendmsg -a # event : name = probe:tcp_sendmsg, type = 2, config = 0x1dd, config1 = 0x0, config2 = ...# HEADER_CPU_TOPOLOGY info available, use -I to display# HEADER_NUMA_TOPOLOGY info available, use -I to display# pmu mappings: software = 1, tracepoint = 2, breakpoint = 5# ========# sshd 1301 [001] 502.424719: probe:tcp_sendmsg: (ffffffff81505d80) size=b0 sshd 1301 [001] 502.424814: probe:tcp_sendmsg: (ffffffff81505d80) size=40 sshd 2371 [000] 502.952590: probe:tcp_sendmsg: (ffffffff81505d80) size=27 sshd 2372 [000] 503.025023: probe:tcp_sendmsg: (ffffffff81505d80) size=3c0 sshd 2372 [001] 503.203776: probe:tcp_sendmsg: (ffffffff81505d80) size=98 sshd 2372 [001] 503.281312: probe:tcp_sendmsg: (ffffffff81505d80) size=2d0 sshd 2372 [001] 503.461358: probe:tcp_sendmsg: (ffffffff81505d80) size=30 sshd 2372 [001] 503.670239: probe:tcp_sendmsg: (ffffffff81505d80) size=40 sshd 2372 [001] 503.742565: probe:tcp_sendmsg: (ffffffff81505d80) size=140 sshd 2372 [001] 503.822005: probe:tcp_sendmsg: (ffffffff81505d80) size=20 sshd 2371 [000] 504.118728: probe:tcp_sendmsg: (ffffffff81505d80) size=30 sshd 2371 [000] 504.192575: probe:tcp_sendmsg: (ffffffff81505d80) size=70[...]The size is shown as hexadecimal.

跟踪某行代码

# perf probe -L tcp_sendmsg
0 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t size) 2 { struct iovec *iov; struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; 6 int iovlen, flags, err, copied = 0; 7 int mss_now = 0, size_goal, copied_syn = 0, offset = 0; bool sg; long timeo;[...] 79 while (seglen > 0) { int copy = 0; 81 int max = size_goal; skb = tcp_write_queue_tail(sk); 84 if (tcp_send_head(sk)) { 85 if (skb->ip_summed == CHECKSUM_NONE) max = mss_now; 87 copy = max - skb->len; } 90 if (copy <= 0) { new_segment:[...]# perf probe -V tcp_sendmsg:81Available variables at tcp_sendmsg:81 @
bool sg int copied int copied_syn int flags int mss_now int offset int size_goal long int timeo size_t seglen struct iovec* iov struct sock* sk unsigned char* fromNow lets trace line 81, with the seglen variable that is checked in the loop:# perf probe --add 'tcp_sendmsg:81 seglen'Added new event: probe:tcp_sendmsg (on tcp_sendmsg:81 with seglen)You can now use it in all perf tools, such as: perf record -e probe:tcp_sendmsg -aR sleep 1# perf record -e probe:tcp_sendmsg -a^C[ perf record: Woken up 1 times to write data ][ perf record: Captured and wrote 0.188 MB perf.data (~8200 samples) ]# perf script sshd 4652 [001] 2082360.931086: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x80 app_plugin.pl 2400 [001] 2082360.970489: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x20 postgres 2422 [000] 2082360.970703: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x52 app_plugin.pl 2400 [000] 2082360.970890: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x7b postgres 2422 [001] 2082360.971099: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0xb app_plugin.pl 2400 [000] 2082360.971140: probe:tcp_sendmsg: (ffffffff81642ca9) seglen=0x55[...]

跟踪用户软件的指定function

# perf probe -x /lib/x86_64-linux-gnu/libc-2.15.so --add mallocAdded new event:  probe_libc:malloc    (on 0x82f20)You can now use it in all perf tools, such as:    perf record -e probe_libc:malloc -aR sleep 1Tracing it system-wide:# perf record -e probe_libc:malloc -a^C[ perf record: Woken up 12 times to write data ][ perf record: Captured and wrote 3.522 MB perf.data (~153866 samples) ]The report:# perf report -n[...]# Samples: 45K of event 'probe_libc:malloc'# Event count (approx.): 45158## Overhead       Samples          Command  Shared Object      Symbol# ........  ............  ...............  .............  ..........#    42.72%         19292       apt-config  libc-2.15.so   [.] malloc    19.71%          8902             grep  libc-2.15.so   [.] malloc     7.88%          3557             sshd  libc-2.15.so   [.] malloc     6.25%          2824              sed  libc-2.15.so   [.] malloc     6.06%          2738            which  libc-2.15.so   [.] malloc     4.12%          1862  update-motd-upd  libc-2.15.so   [.] malloc     3.72%          1680             stat  libc-2.15.so   [.] malloc     1.68%           758            login  libc-2.15.so   [.] malloc     1.21%           546        run-parts  libc-2.15.so   [.] malloc     1.21%           545               ls  libc-2.15.so   [.] malloc     0.80%           360        dircolors  libc-2.15.so   [.] malloc     0.56%           252               tr  libc-2.15.so   [.] malloc     0.54%           242              top  libc-2.15.so   [.] malloc     0.49%           222       irqbalance  libc-2.15.so   [.] malloc     0.44%           200             dpkg  libc-2.15.so   [.] malloc     0.38%           173         lesspipe  libc-2.15.so   [.] malloc     0.29%           130  update-motd-fsc  libc-2.15.so   [.] malloc     0.25%           112            uname  libc-2.15.so   [.] malloc     0.24%           108              cut  libc-2.15.so   [.] malloc     0.23%           104           groups  libc-2.15.so   [.] malloc     0.21%            94  release-upgrade  libc-2.15.so   [.] malloc     0.18%            82        00-header  libc-2.15.so   [.] malloc     0.14%            62             mesg  libc-2.15.so   [.] malloc     0.09%            42  update-motd-reb  libc-2.15.so   [.] malloc     0.09%            40             date  libc-2.15.so   [.] malloc     0.08%            35             bash  libc-2.15.so   [.] malloc     0.08%            35         basename  libc-2.15.so   [.] malloc     0.08%            34          dirname  libc-2.15.so   [.] malloc     0.06%            29               sh  libc-2.15.so   [.] malloc     0.06%            26        99-footer  libc-2.15.so   [.] malloc     0.05%            24              cat  libc-2.15.so   [.] malloc     0.04%            18             expr  libc-2.15.so   [.] malloc     0.04%            17         rsyslogd  libc-2.15.so   [.] malloc     0.03%            12             stty  libc-2.15.so   [.] malloc     0.00%             1             cron  libc-2.15.so   [.] mallocThis shows the most malloc() calls were by apt-config, while I was tracing.User: malloc() with sizeAs of the Linux 3.13.1 kernel, this is not supported yet:# perf probe -x /lib/x86_64-linux-gnu/libc-2.15.so --add 'malloc size'Debuginfo-analysis is not yet supported with -x/--exec option.  Error: Failed to add events. (-38)As a workaround, you can access the registers (on Linux 3.7+). For example, on x86_64:# perf probe -x /lib64/libc-2.17.so '--add=malloc size=%di'       probe_libc:malloc    (on 0x800c0 with size=%di)

perf tracepoint

event中的一种类型,实际上是一些比较常见的系统调用。

不在里面的可以使用前面介绍的动态跟踪的方式进行跟踪。

支持哪些tracepoint

perf list | awk -F: '/Tracepoint event/ { lib[$1]++ } END {for (l in lib) { printf "  %-16s %d\n", l, lib[l] } }' | sort | column    block          18       jbd2           11       kvmmmu         9        napi           1        sched          15       skb            3        timer          12       writeback      16    ext4           46       kmem           42       mce            1        net            4        scsi           5        sock           2        udp            1        xfs            314    irq            5        kvm            21       module         5        power          3        signal         2        syscalls       548      workqueue      4perf list ......  xfs:xfs_attr_list_sf                               [Tracepoint event]  xfs:xfs_attr_list_sf_all                           [Tracepoint event]  xfs:xfs_attr_list_leaf                             [Tracepoint event]  xfs:xfs_attr_list_leaf_end                         [Tracepoint event]  xfs:xfs_attr_list_full                             [Tracepoint event]  xfs:xfs_attr_list_add                              [Tracepoint event]......

主要包含以下tracepoint subtype

block: block device I/Oext3, ext4: file system operationskmem: kernel memory allocation eventsrandom: kernel random number generator eventssched: CPU scheduler eventssyscalls: system call enter and exitstask: task events

例子

I used perf_events to record the block request (disk I/O) issue and completion static tracepoints:# perf record -e block:block_rq_issue -e block:block_rq_complete -a sleep 120[ perf record: Woken up 36 times to write data ][ perf record: Captured and wrote 8.885 MB perf.data (~388174 samples) ]# perf script[...]     randread.pl  2522 [000]  6011.824759: block:block_rq_issue: 254,16 R 0 () 7322849 + 16 [randread.pl]     randread.pl  2520 [000]  6011.824866: block:block_rq_issue: 254,16 R 0 () 26144801 + 16 [randread.pl]         swapper     0 [000]  6011.828913: block:block_rq_complete: 254,16 R () 31262577 + 16 [0]     randread.pl  2521 [000]  6011.828970: block:block_rq_issue: 254,16 R 0 () 70295937 + 16 [randread.pl]         swapper     0 [000]  6011.835862: block:block_rq_complete: 254,16 R () 26144801 + 16 [0]     randread.pl  2520 [000]  6011.835932: block:block_rq_issue: 254,16 R 0 () 5495681 + 16 [randread.pl]         swapper     0 [000]  6011.837988: block:block_rq_complete: 254,16 R () 7322849 + 16 [0]     randread.pl  2522 [000]  6011.838051: block:block_rq_issue: 254,16 R 0 () 108589633 + 16 [randread.pl]         swapper     0 [000]  6011.850615: block:block_rq_complete: 254,16 R () 108589633 + 16 [0][...]

绘制perf火焰图

使用perf report -tui或者-stdio输出的文本不够直观的话,使用火焰图可以很直观的表现出哪些代码是瓶颈所在。

压测$pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 100收集统计信息#perf record -a -g -v sleep 30

生成火焰图

# git clone https://github.com/brendangregg/FlameGraph      # or download it from github# mv perf.data FlameGraph/# cd FlameGraph# perf script | ./stackcollapse-perf.pl > out.perf-folded# cat out.perf-folded | ./flamegraph.pl > perf-kernel.svg

绘制perf热力图

压测$pgbench -M prepared -n -r -P 1 -c 32 -j 32 -T 100收集统计信息#perf record -a -g -v sleep 30

生成热力图

# git clone https://github.com/brendangregg/HeatMap      # or download it from github# mv perf.data HeatMap/# cd HeatMap# perf script | awk '{ gsub(/:/, "") } $5 ~ /issue/ { ts[$6, $10] = $4 }    $5 ~ /complete/ { if (l = ts[$6, $9]) { printf "%.f %.f\n", $4 * 1000000,    ($4 - l) * 1000000; ts[$6, $10] = 0 } }' > out.lat_us# ./trace2heatmap.pl --unitstime=us --unitslat=us --maxlat=50000 out.lat_us > out.svg

小结

要完备的跟踪和打印跟踪(符号表、call stack trace、汇编指令)信息,建议内核编译时加上

CONFIG_KALLSYMS=yCONFIG_FRAME_POINTER=y

编译perf时需要支持libunwind, 并加上

gcc -g dwarf

软件编译时加上

gcc -g -ggdb -fno-omit-frame-pointer

如果是yum安装的软件,可以安装对应的debuginfo包。

参考

1. 

2. 

3. 

4. 

5. 火焰图

6. 热力图

7. 

8. 

转载地址:http://tntta.baihongyu.com/

你可能感兴趣的文章
Mellanox公司计划利用系统芯片提升存储产品速度
查看>>
白帽子守护网络安全,高薪酬成大学生就业首选!
查看>>
ARM想将芯片装进人类大脑 降低能耗是一大挑战
查看>>
Oracle数据库的备份方法
查看>>
Selenium 自动登录考勤系统
查看>>
关于如何以编程的方式执行TestNG
查看>>
智能照明造福千家万户 家居智能不再是梦
查看>>
物联网如何跳出“看起来很美”?
查看>>
浅谈MySQL 数据库性能优化
查看>>
《UNIX/Linux 系统管理技术手册(第四版)》——1.10 其他的权威文档
查看>>
灵动空间 创享生活
查看>>
《UNIX网络编程 卷1:套接字联网API(第3版)》——8.6 UDP回射客户程序:dg_cli函数...
查看>>
不要将时间浪费到编写完美代码上
查看>>
《算法基础:打开算法之门》一3.4 归并排序
查看>>
高德开放平台开放源代码 鼓励开发者创新
查看>>
《高并发Oracle数据库系统的架构与设计》一2.5 索引维护
查看>>
Firefox 是 Pwn2own 2014 上攻陷次数最多的浏览器
查看>>
阿里感悟(十八)- 应届生Review
查看>>
话说模式匹配(5) for表达式中的模式匹配
查看>>
《锋利的SQL(第2版)》——1.7 常用函数
查看>>