![]() |
¥Î¡¼¥È/CUDA/Titan¤Ç¥×¥í¥Õ¥¡¥¤¥ëhttp://pepper.is.sci.toho-u.ac.jp/pepper/index.php?%A5%CE%A1%BC%A5%C8%2FCUDA%2FTitan%A4%C7%A5%D7%A5%ED%A5%D5%A5%A1%A5%A4%A5%EB |
![]() |
¥Î¡¼¥È
ˬÌä¼Ô¡¡4062¡¡¡¡ºÇ½ª¹¹¿·¡¡2013-07-31 (¿å) 11:49:35
Á°¤Î¥Ú¡¼¥¸¡¡¥Î¡¼¥È/CUDA/CUDA5.5+Fedora19
GUI¥Ù¡¼¥¹¤Î¥×¥í¥Õ¥¡¥¤¥é¤Ë¤Ä¤¤¤Æ¤Ï¡¢¤³¤ì¤«¤é¡£
VNC¤ÎƳÆþ¤Ë¶ìÀï¡¡¢Í¡¡¥Î¡¼¥È/CUDA/Titan¤ÇVNC¡Ê±ó³Ö¥¦¥£¥ó¥É¥¦¡Ë¡¡¢Í¡¡¤¤¤Ã¤¿¤¤¤Ê¤¼¡©¡¡¡Ê¤É¤¦¤âGNOME desktop¤¬Fedora19¤Î¥Õ¥¡¥¤¥ë¤Î°ÌÃ֤Ȥ«¤È¤¦¤Þ¤¯¹ç¤Ã¤Æ¤¤¤Ê¤¤¤ß¤¿¤¤¤Êµ¤¤¬¤¹¤ë¤¬¡¢VNC¤Ç¤Ê¤±¤ì¤Ð¡á¥Í¥¤¥Æ¥£¥Ö¤Î¥Ç¥£¥¹¥×¥ì¥¤¤Ê¤é¡á¤Á¤ã¤ó¤Èư¤¯¤Î¤Ç¡¢²¿¤¬°ã¤¦¤Î¤«¤Ê¡©¡Ë
¥¹¥ì¥Ã¥É¤ÏÉáÄ̤Υ¹¥ì¥Ã¥É¤Î°ÕÌ£¤ÈƱ¤¸¤Ç¡¢¥×¥í¥°¥é¥à¤Îή¤ì¡£
¥Ï¡¼¥É¤Ï¡¢
¥×¥í¥°¥é¥à¾å¤Ï¡¢
»²¹Íµ»ö
»²¹Í¤Ë¤·¤¿¥µ¥¤¥È¤Ï
nvprof --help ¤È¤¹¤ë¤È¡¢¤É¤ó¤Ê¥Ñ¥é¥á¥¿¤¬»ØÄê¤Ç¤¤ë¤«¤¬½Ð¤Æ¤¯¤ë¡£
Usage: nvprof [options] [CUDA-application] [application-arguments] Options: -o, --output-profile <file name> Output the result file which can be imported later or opened by the NVIDIA Visual Profiler. "%p" in the file name string is replaced with the process ID of the application being profiled. "%h" in the file name string is replaced with the hostname of the system. "%%" in the file name string is replaced with "%". Any other character following "%" is illegal. By default, this option disables the summary output. NOTE: If the application being profiled creates child processes, or if '--profile-all-processes' is used, the "%p" format is needed to get correct output files for each process. -i, --import-profile <file name> Import a result profile from a previous run. -s, --print-summary Print a summary of the profiling result on screen. NOTE: This is the default unless "--output-profile" or the print trace options are used. --print-gpu-trace Print individual kernel invocations (including CUDA memcpy's/memset's) and sort them in chronological order. In event/metric profiling mode, show events/metrics for each kernel invocation. --print-api-trace Print CUDA runtime/driver API trace. --csv Use comma-separated values in the output. -u, --normalized-time-unit <s|ms|us|ns|col|auto> Specify the unit of time that will be used in the output. Allowed values: s - second, ms - millisecond, us - microsecond, ns - nanosecond col - a fixed unit for each column auto (default) - nvprof chooses the scale for each time value based on its length -t, --timeout <seconds> Set an execution timeout (in seconds) for the CUDA application. NOTE: Timeout starts counting from the moment the CUDA driver is initialized. If the application doesn't call any CUDA APIs, timeout won't be triggered. --demangling <on|off> Turn on/off C++ name demangling of kernel names. Allowed values: on - turn on demangling (default) off - turn off demangling --events <event names> Specify the events to be profiled on certain device(s). Multiple event names separated by comma can be specified. Which device(s) are profiled is controlled by the "--devices" option. Otherwise events will be collected on all devices. For a list of available events, use "--query-events". Use "--devices" and "--kernels" to select a specific kernel invocation. --metrics <metric names> Specify the metrics to be profiled on certain device(s). Multiple metric names separated by comma can be specified. Which device(s) are profiled is controlled by the "--devices" option. Otherwise metrics will be collected on all devices. For a list of available metrics, use "--query-metrics". Use "--devices" and "--kernels" to select a specific kernel invocation. --analysis-metrics Collect profiling data that can be imported to Visual Profiler's "analysis" mode. NOTE: Use "--output-profile" to specify an output file. --devices <device ids> This option changes the scope of subsequent "--events", "--metrics", "--query-events" and "--query-metrics" options. Allowed values: all - change scope to all valid devices comma-separated device IDs - change scope to specified devices --kernels <kernel path syntax> This option changes the scope of subsequent "--events", "--metrics" options The syntax is as following: <context id/name>:<stream id/name>:<kernel name> :<invocation> The context/stream IDs, names and invocation count can be regular expressions. Empty string matches any number or characters. If <context id/name> or <stream id/name> is a number, it's matched against both the context/stream id and name specified by the NVTX library. Otherwise it's matched against the context/stream name. Example: --kernels "1:foo:bar:2" - profile any kernel whose name contains "bar" and was the 2nd instance on context 1 and on stream named "foo". --query-events List all the events available on the device(s). Device(s) queried can be controlled by the "--devices" option. --query-metrics List all the metrics available on the device(s). Device(s) queried can be controlled by the "--devices" option. --concurrent-kernels <on|off> Turn on/off concurrent kernel execution. If concurrent kernel execution is off, all kernels running on one device will be serialized. Allowed values: on - turn on concurrent kernel execution (default) off - turn off concurrent kernel execution --profile-from-start <on|off> Enable/disable profiling from the start of the application. If it's disabled, the application can use {cu,cuda}Profiler{Start,Stop} to turn on/off profiling. Allowed values: on - enable profiling from start (default) off - disable profiling from start --aggregate-mode <on|off> This option turns on/off aggregate mode for events and metrics specified by subsequent "--events" and "--metrics" options. Those event/metric values will be collected for each domain instance, instead of the whole device. Allowed values: on - turn on aggregate mode (default) off - turn off aggregate mode --system-profiling <on|off> Turn on/off power, clock, and thermal profiling. Allowed values: on - turn on system profiling off - turn off system profiling (default) --log-file <file name> Make nvprof send all its output to the specified file, or one of the standard channels. The file will be overwritten. If the file doesn't exist, a new one will be created. "%1" as the whole file name indicates standard output channel (stdout). "%2" as the whole file name indicates standard error channel (stderr). NOTE: This is the default. "%p" in the file name string is replaced with nvprof's process ID. "%h" in the file name string is replaced with the hostname of the system. "%%" in the file name is replaced with "%". Any other character following "%" is illegal. --quiet Suppress all nvprof output. --profile-child-processes Profile the application and all child processes launched by it. --profile-all-processes Profile all processes launched by the same user who launched this nvprof instance. NOTE: Only one instance of nvprof can run with this option at the same time. Under this mode, there's no need to specify an application to run. -V --version Print version information of this tool. -h, --help Print this help information.
¬Äê¤Ç¤¤ë¥¤¥Ù¥ó¥È¤Î¼ïÎà¤Ë¤Ä¤¤¤Æ¤Ï¡¢
nvprof --query-events ./gpupi
¤È¤¹¤ë¤Èɽ¼¨¤µ¤ì¤ë¡£
Available Events: Name Description Device 0 (GeForce GTX TITAN): Domain domain_a: tex0_cache_sector_queries: Number of texture cache 0 requests. This increments by 1 for each 32-byte access. tex1_cache_sector_queries: Number of texture cache 1 requests. This increments by 1 for each 32-byte access. tex2_cache_sector_queries: Number of texture cache 2 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units. tex3_cache_sector_queries: Number of texture cache 3 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units. tex0_cache_sector_misses: Number of texture cache 0 misses. This increments by 1 for each 32-byte access. tex1_cache_sector_misses: Number of texture cache 1 misses. This increments by 1 for each 32-byte access. tex2_cache_sector_misses: Number of texture cache 2 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units. tex3_cache_sector_misses: Number of texture cache 3 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units. rocache_subp0_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. Increments per warp. rocache_subp1_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. Increments per warp. rocache_subp2_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 2 of read-only data cache. Increments per warp. rocache_subp3_gld_warp_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 3 of read-only data cache. Increments per warp. rocache_subp0_gld_warp_count_64b: Number of 64-bit global load requests via slice 0 of read-only data cache. Increments per warp. rocache_subp1_gld_warp_count_64b: Number of 64-bit global load requests via slice 1 of read-only data cache. Increments per warp. rocache_subp2_gld_warp_count_64b: Number of 64-bit global load requests via slice 2 of read-only data cache. Increments per warp. rocache_subp3_gld_warp_count_64b: Number of 64-bit global load requests via slice 3 of read-only data cache.Increments per warp. rocache_subp0_gld_warp_count_128b: Number of 128-bit global load requests via slice 0 of read-only data cache. Increments per warp. rocache_subp1_gld_warp_count_128b: Number of 128-bit global load requests via slice 1 of read-only data cache. Increments per warp. rocache_subp2_gld_warp_count_128b: Number of 128-bit global load requests via slice 2 of read-only data cache. Increments per warp. rocache_subp3_gld_warp_count_128b: Number of 128-bit global load requests via slice 3 of read-only data cache. Increments per warp. rocache_subp0_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp1_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp2_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp3_gld_thread_count_32b: Number of 8-bit, 16-bit, and 32-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp0_gld_thread_count_64b: Number of 64-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp1_gld_thread_count_64b: Number of 64-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp2_gld_thread_count_64b: Number of 64-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp3_gld_thread_count_64b: Number of 64-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp0_gld_thread_count_128b: Number of 128-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp1_gld_thread_count_128b: Number of 128-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp2_gld_thread_count_128b: Number of 128-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. rocache_subp3_gld_thread_count_128b: Number of 128-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction. elapsed_cycles_sm: Elapsed clocks Domain domain_b: fb_subp0_read_sectors: Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access. fb_subp1_read_sectors: Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access. fb_subp0_write_sectors: Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access. fb_subp1_write_sectors: Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access. l2_subp0_write_sector_misses: Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_write_sector_misses: Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_write_sector_misses: Number of write misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_write_sector_misses: Number of write misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_sector_misses: Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_sector_misses: Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_sector_misses: Number of read misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_sector_misses: Number of read misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_write_l1_sector_queries: Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_write_l1_sector_queries: Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_write_l1_sector_queries: Number of write requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_write_l1_sector_queries: Number of write requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_l1_sector_queries: Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_l1_sector_queries: Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_l1_sector_queries: Number of read requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_l1_sector_queries: Number of read requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_l1_hit_sectors: Number of read requests from L1 that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_tex_sector_queries: Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_tex_sector_queries: Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_tex_sector_queries: Number of read requests from Texture cache to slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_tex_sector_queries: Number of read requests from Texture cache to slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_tex_hit_sectors: Number of read requests from Texture cache that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_read_sysmem_sector_queries: Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_read_sysmem_sector_queries: Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_read_sysmem_sector_queries: Number of system memory read requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_read_sysmem_sector_queries: Number of system memory read requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_write_sysmem_sector_queries: Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access. l2_subp1_write_sysmem_sector_queries: Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access. l2_subp2_write_sysmem_sector_queries: Number of system memory write requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access. l2_subp3_write_sysmem_sector_queries: Number of system memory write requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access. l2_subp0_total_read_sector_queries: Total read requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp1_total_read_sector_queries: Total read requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp2_total_read_sector_queries: Total read requests to slice 2 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp3_total_read_sector_queries: Total read requests to slice 3 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp0_total_write_sector_queries: Total write requests to slice 0 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp1_total_write_sector_queries: Total write requests to slice 1 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp2_total_write_sector_queries: Total write requests to slice 2 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. l2_subp3_total_write_sector_queries: Total write requests to slice 3 of L2 cache. This includes requests from L1, Texture cache, system memory. This increments by 1 for each 32-byte access. Domain domain_c: gld_inst_8bit: Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks. gld_inst_16bit: Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks. gld_inst_32bit: Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks. gld_inst_64bit: Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks. gld_inst_128bit: Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks. gst_inst_8bit: Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks. gst_inst_16bit: Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks. gst_inst_32bit: Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks. gst_inst_64bit: Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks. gst_inst_128bit: Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks. rocache_gld_inst_8bit: Total number of 8-bit global load via read-only data cache that are executed by all the threads across all thread blocks. rocache_gld_inst_16bit: Total number of 16-bit global load via read-only data cache that are executed by all the threads across all thread blocks. rocache_gld_inst_32bit: Total number of 32-bit global load via read-only data cache that are executed by all the threads across all thread blocks. rocache_gld_inst_64bit: Total number of 64-bit global load via read-only data cache that are executed by all the threads across all thread blocks. rocache_gld_inst_128bit: Total number of 128-bit global load via read-only data cache that are executed by all the threads across all thread blocks. Domain domain_d: prof_trigger_00: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_01: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_02: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_03: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_04: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_05: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_06: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. prof_trigger_07: User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp. warps_launched: Number of warps launched on a multiprocessor. threads_launched: Number of threads launched on a multiprocessor. inst_issued1: Number of single instruction issued per cycle inst_issued2: Number of dual instructions issued per cycle inst_executed: Number of instructions executed, do not include replays. thread_inst_executed: Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction. not_predicated_off_thread_inst_executed: Number of not predicated off instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads that execute this instruction. atom_count: Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction. gred_count: Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction shared_load: Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor. shared_store: Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor. local_load: Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor. local_store: Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor. gld_request: Number of executed load instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space. gst_request: Number of executed store instructions where the state space is not specified and hence generic addressing is used, increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space. active_cycles: Number of cycles a multiprocessor has at least one active warp. This event can increment by 0 - 1 on each cycle. active_warps: Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64. sm_cta_launched: Number of thread blocks launched on a multiprocessor. local_load_transactions: Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B. local_store_transactions: Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B. l1_shared_load_transactions: Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B. l1_shared_store_transactions: Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B. __l1_global_load_transactions: Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B. __l1_global_store_transactions: Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B. l1_local_load_hit: Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively. l1_local_load_miss: Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively. l1_local_store_hit: Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively. l1_local_store_miss: Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively. l1_global_load_hit: Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively. l1_global_load_miss: Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively. uncached_global_load_transaction: Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B. global_store_transaction: Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B. shared_load_replay: Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be loaded in one cycle (256 bytes). shared_store_replay: Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) or when there is no conflict but the total number of words accessed by all threads in the warp executing that instruction exceed the number of words that can be stored in one cycle. global_ld_mem_divergence_replays: global ld is replayed due to divergence global_st_mem_divergence_replays: global st is replayed due to divergence
#THREADs per block=4, #BLOCKs per grid=4¡¡¤ä¤½¤ì°Ê²¼¤Ç¤Ï¼Â¹Ô»þ¥¨¥é¡¼¤Ë¤Ê¤ë¡£
#THREADs per block=8, #BLOCKs per grid=4
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==23989== NVPROF is profiling process 23989, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=8, BLOCK=4 ==23989== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =13760.455078(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5755.757812(ms) host·×»»·ë²Ì = 3.141592653590 ==23989== Profiling application: ./gpupi ==23989== Profiling result: ==23989== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 4 4 4 1 threads_launched 32 32 32 1 inst_executed 1277165633 1277165633 1277165633 1 inst_issued1 2426404937 2426404937 2426404937 1 inst_issued2 130023454 130023454 130023454 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_warps 2.1919e+10 2.1919e+10 2.1919e+10
#THREADs per block=4, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==25010== NVPROF is profiling process 25010, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=4, BLOCK=8 ==25010== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =14021.825195(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5759.763184(ms) host·×»»·ë²Ì = 3.141592653590 ==25010== Profiling application: ./gpupi ==25010== Profiling result: ==25010== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 8 8 8 1 threads_launched 32 32 32 1 inst_executed 2554331261 2554331261 2554331261 1 inst_issued1 4852809857 4852809857 4852809857 1 inst_issued2 260046905 260046905 260046905 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_warps 4.4812e+10 4.4812e+10 4.4812e+10 ==25010== Warning: One or more event counters overflowed. Rerun with "--print-gpu-trace" for detail.
#THREADs per block=8, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==14764== NVPROF is profiling process 14764, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=8, BLOCK=8 ==14764== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =7617.770020(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5757.698242(ms) host·×»»·ë²Ì = 3.141592653590 ==14764== Profiling application: ./gpupi ==14764== Profiling result: ==14764== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 8 8 8 1 threads_launched 64 64 64 1 inst_executed 1277165693 1277165693 1277165693 1 inst_issued1 2426404993 2426404993 2426404993 1 inst_issued2 130023481 130023481 130023481 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 2.2276e+10 2.2276e+10 2.2276e+10 1 active_warps 2.2276e+10 2.2276e+10 2.2276e+10
#THREADs per block=8, #BLOCKs per grid=64
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==15821== NVPROF is profiling process 15821, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=8, BLOCK=64 ==15821== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =1343.349976(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5754.979980(ms) host·×»»·ë²Ì = 3.141592653590 ==15821== Profiling application: ./gpupi ==15821== Profiling result: ==15821== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 64 64 64 1 threads_launched 512 512 512 1 inst_executed 1277166533 1277166533 1277166533 1 inst_issued1 2429240687 2429240687 2429240687 1 inst_issued2 130023880 130023880 130023880 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 5007483146 5007483146 5007483146 1 active_warps 2.1803e+10 2.1803e+10 2.1803e+10
#THREADs per block=8, #BLOCKs per grid=256
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==15185== NVPROF is profiling process 15185, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=8, BLOCK=256 ==15185== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =1256.272949(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5755.872070(ms) host·×»»·ë²Ì = 3.141592653590 ==15185== Profiling application: ./gpupi ==15185== Profiling result: ==15185== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 256 256 256 1 threads_launched 2048 2048 2048 1 inst_executed 1277169413 1277169413 1277169413 1 inst_issued1 2428738320 2428738320 2428738320 1 inst_issued2 130025057 130025057 130025057 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 2615620088 2615620088 2615620088 1 active_warps 2.8689e+10 2.8689e+10 2.8689e+10
#THREADs per block=64, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==17098== NVPROF is profiling process 17098, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=64, BLOCK=8 ==17098== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =1336.156006(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5754.904785(ms) host·×»»·ë²Ì = 3.141592653590 ==17098== Profiling application: ./gpupi ==17098== Profiling result: ==17098== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 16 16 16 1 threads_launched 512 512 512 1 inst_executed 319291637 319291637 319291637 1 inst_issued1 606601457 606601457 606601457 1 inst_issued2 32505973 32505973 32505973 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 2777686703 2777686703 2777686703 1 active_warps 5538596160 5538596160 5538596160
#THREADs per block=256, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==17438== NVPROF is profiling process 17438, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=256, BLOCK=8 ==17438== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =640.708984(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5755.224121(ms) host·×»»·ë²Ì = 3.141592653590 ==17438== Profiling application: ./gpupi ==17438== Profiling result: ==17438== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 64 64 64 1 threads_launched 2048 2048 2048 1 inst_executed 319292357 319292357 319292357 1 inst_issued1 609564842 609564842 609564842 1 inst_issued2 32506306 32506306 32506306 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 661725553 661725553 661725553 1 active_warps 5252705640 5252705640 5252705640
#THREADs per block=512, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==17777== NVPROF is profiling process 17777, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=512, BLOCK=8 ==17777== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =238.332993(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5756.290039(ms) host·×»»·ë²Ì = 3.141592653590 ==17777== Profiling application: ./gpupi ==17777== Profiling result: ==17777== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 128 128 128 1 threads_launched 4096 4096 4096 1 inst_executed 319293317 319293317 319293317 1 inst_issued1 606658886 606658886 606658886 1 inst_issued2 32506698 32506698 32506698 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 496165075 496165075 496165075 1 active_warps 7915666172 7915666172 7915666172
#THREADs per block=1024, #BLOCKs per grid=8
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==18118== NVPROF is profiling process 18118, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=1024, BLOCK=8 ==18118== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =235.184998(ms) GPU·×»»·ë²Ì = 3.141592653590 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5759.145996(ms) host·×»»·ë²Ì = 3.141592653590 ==18118== Profiling application: ./gpupi ==18118== Profiling result: ==18118== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 256 256 256 1 threads_launched 8192 8192 8192 1 inst_executed 319295237 319295237 319295237 1 inst_issued1 606604750 606604750 606604750 1 inst_issued2 32507329 32507329 32507329 1 thread_inst_executed 1.0217e+10 1.0217e+10 1.0217e+10 1 active_cycles 485943938 485943938 485943938 1 active_warps 1.5533e+10 1.5533e+10 1.5533e+10
#THREADs per block=1024, #BLOCKs per grid=12
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==19145== NVPROF is profiling process 19145, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=1024, BLOCK=12 ==19145== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =161.235001(ms) GPU·×»»·ë²Ì = 3.141592428766 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5762.456055(ms) host·×»»·ë²Ì = 3.141592428766 ==19145== Profiling application: ./gpupi ==19145== Profiling result: ==19145== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 384 384 384 1 threads_launched 12288 12288 12288 1 inst_executed 319261445 319261445 319261445 1 inst_issued1 589668833 589668833 589668833 1 inst_issued2 40957360 40957360 40957360 1 thread_inst_executed 1.0216e+10 1.0216e+10 1.0216e+10 1 active_cycles 485207288 485207288 485207288 1 active_warps 1.5506e+10 1.5506e+10 1.5506e+10
#THREADs per block=1024, #BLOCKs per grid=13
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==19486== NVPROF is profiling process 19486, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=1024, BLOCK=13 ==19486== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =149.960007(ms) GPU·×»»·ë²Ì = 3.141591485519 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5758.164062(ms) host·×»»·ë²Ì = 3.141591485519 ==19486== Profiling application: ./gpupi ==19486== Profiling result: ==19486== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 416 416 416 1 threads_launched 13312 13312 13312 1 inst_executed 320855813 320855813 320855813 1 inst_issued1 576697370 576697370 576697370 1 inst_issued2 48234685 48234685 48234685 1 thread_inst_executed 1.0267e+10 1.0267e+10 1.0267e+10 1 active_cycles 484923626 484923626 484923626 1 active_warps 1.5496e+10 1.5496e+10 1.5496e+10
#THREADs per block=1024, #BLOCKs per grid=14
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==19895== NVPROF is profiling process 19895, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=1024, BLOCK=14 ==19895== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =139.876007(ms) GPU·×»»·ë²Ì = 3.141592017752 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5757.771973(ms) host·×»»·ë²Ì = 3.141592017752 ==19895== Profiling application: ./gpupi ==19895== Profiling result: ==19895== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 448 448 448 1 threads_launched 14336 14336 14336 1 inst_executed 320861189 320861189 320861189 1 inst_issued1 576705383 576705383 576705383 1 inst_issued2 48235663 48235663 48235663 1 thread_inst_executed 1.0268e+10 1.0268e+10 1.0268e+10 1 active_cycles 485014897 485014897 485014897 1 active_warps 1.5492e+10 1.5492e+10 1.5492e+10
¤³¤ì¤Ç»þ´Ö¤Ï14022ms(GPU4*8)/140ms(GPU1024*14) = 100ÇÜ¡¢¡¡¡¡¥¹¥ì¥Ã¥É¿ô¤Ï(1024*14)/(4*8) = 448ÇÜ
#THREADs per block=1024, #BLOCKs per grid=15
$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi ==20234== NVPROF is profiling process 20234, command: ./gpupi GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5 THREAD=1024, BLOCK=15 ==20234== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics. ·×»»»þ´Ö =250.242004(ms) GPU·×»»·ë²Ì = 3.141592428766 ¥Û¥¹¥È¤Î·×»»»þ´Ö =5756.960938(ms) host·×»»·ë²Ì = 3.141592428766 ==20234== Profiling application: ./gpupi ==20234== Profiling result: ==20234== Event result: Invocations Event Name Min Max Avg Device "GeForce GTX TITAN (0)" Kernel: Kernel(double*) 1 warps_launched 480 480 480 1 threads_launched 15360 15360 15360 1 inst_executed 319262885 319262885 319262885 1 inst_issued1 589669996 589669996 589669996 1 inst_issued2 40957835 40957835 40957835 1 thread_inst_executed 1.0216e+10 1.0216e+10 1.0216e+10 1 active_cycles 485185993 485185993 485185993 1 active_warps 1.6542e+10 1.6542e+10 1.6542e+10
¤è¤¯Ê¬¤«¤é¤Ê¤¤¤±¤ì¤É¡¢»²¹Íµ»ö¤È¤·¤Æ¡¡¹â®±é»»µ Âè25²ó ¡ÖKepler²òÀ⤽¤Î2 ¡ÁKeplerÀ¤Âå¤Î¿·µ¡Ç½¡Á¡×¡¡¤Ë¤¤¤í¤¤¤íµÄÏÀ¤·¤Æ¤¢¤ë¡£
¼ê»ý¤Á¤Î¿ôÃÍÀÑʬ¥×¥í¥°¥é¥à¤Ç¬¤Ã¤¿·ë²Ì¤Ï¡¢²¼¿Þ¤Î¤è¤¦¤Ç¤¢¤Ã¤¿¡£