ノート
訪問者 2555  最終更新 2013-07-31 (水) 11:49:35

前のページ ノート/CUDA/CUDA5.5+Fedora19

Titan上でのプロファイル(2013-07-13)

GUIベースのプロファイラについては、これから。
VNCの導入に苦戦 ⇒ ノート/CUDA/TitanでVNC(遠隔ウィンドウ) ⇒ いったいなぜ? (どうもGNOME desktopがFedora19のファイルの位置とかとうまく合っていないみたいな気がするが、VNCでなければ=ネイティブのディスプレイなら=ちゃんと動くので、何が違うのかな?)

スレッドとブロックとグリッドとワープ(とりあえず復習)

スレッドは普通のスレッドの意味と同じで、プログラムの流れ。

ハードは、

プログラム上は、

参考記事

コマンドラインで使うプロファイラ nvprof (CUDA 5.5 についてきたバージョン)

参考にしたサイトは

nvprof --help とすると、どんなパラメタが指定できるかが出てくる。

Usage: nvprof [options] [CUDA-application] [application-arguments]
Options:
 -o,  --output-profile <file name>
                           Output the result file which can be imported later
                           or opened by the NVIDIA Visual Profiler.

                           "%p" in the file name string is replaced with the
                           process ID of the application being profiled.

                           "%h" in the file name string is replaced with the
                           hostname of the system.

                           "%%" in the file name string is replaced with "%".

                           Any other character following "%" is illegal.

                           By default, this option disables the summary output.

                           NOTE: If the application being profiled creates
                           child processes, or if '--profile-all-processes' is
                           used, the "%p" format is needed to get correct
                           output files for each process.

 -i,  --import-profile <file name>
                           Import a result profile from a previous run.

 -s,  --print-summary      Print a summary of the profiling result on screen.

                           NOTE: This is the default unless "--output-profile"
                           or the print trace options are used.

      --print-gpu-trace    Print individual kernel invocations (including CUDA
                           memcpy's/memset's) and sort them in
                           chronological order. In event/metric profiling mode,
                           show events/metrics for each kernel invocation.

      --print-api-trace    Print CUDA runtime/driver API trace.

      --csv                Use comma-separated values in the output.

 -u,  --normalized-time-unit <s|ms|us|ns|col|auto>
                           Specify the unit of time that will be used in the
                           output.
                           Allowed values:
                               s - second, ms - millisecond, us - microsecond,
                               ns - nanosecond
                               col - a fixed unit for each column
                               auto (default) - nvprof chooses the scale for
                               each time value based on its length

 -t,  --timeout <seconds>  Set an execution timeout (in seconds) for the CUDA
                           application.

                           NOTE: Timeout starts counting from the moment the
                           CUDA driver is initialized. If the application
                           doesn't call any CUDA APIs, timeout won't be
                           triggered.

      --demangling <on|off>
                           Turn on/off C++ name demangling of kernel names.
                           Allowed values:
                               on - turn on demangling (default)
                               off - turn off demangling

      --events <event names>
                           Specify the events to be profiled on certain
                           device(s). Multiple event names separated by comma
                           can be specified. Which device(s) are profiled is
                           controlled by the "--devices" option. Otherwise
                           events will be collected on all devices.
                           For a list of available events, use
                           "--query-events".
                           Use "--devices" and "--kernels" to select a
                           specific kernel invocation.

      --metrics <metric names>
                           Specify the metrics to be profiled on certain
                           device(s). Multiple metric names separated by comma
                           can be specified. Which device(s) are profiled is
                           controlled by the "--devices" option. Otherwise
                           metrics will be collected on all devices.
                           For a list of available metrics, use
                           "--query-metrics".
                           Use "--devices" and "--kernels" to select a
                           specific kernel invocation.

      --analysis-metrics   Collect profiling data that can be imported to
                           Visual Profiler's "analysis" mode.

                           NOTE: Use "--output-profile" to specify an output
                           file.

      --devices <device ids>
                           This option changes the scope of subsequent
                           "--events", "--metrics", "--query-events" and
                           "--query-metrics" options.
                           Allowed values:
                               all - change scope to all valid devices
                               comma-separated device IDs - change scope to
                               specified devices

      --kernels <kernel path syntax>
                           This option changes the scope of subsequent
                           "--events", "--metrics" options
                           The syntax is as following:
                               <context id/name>:<stream id/name>:<kernel name>
                               :<invocation>
                           The context/stream IDs, names and invocation count
                           can be regular expressions. Empty string matches
                           any number or characters.
                           If <context id/name> or <stream id/name> is a
                           number, it's matched against both the context/stream
                           id and name specified by the NVTX library. Otherwise
                           it's matched against the context/stream name.
                           Example: --kernels "1:foo:bar:2" -
                               profile any kernel whose name contains "bar"
                               and was the 2nd instance on context 1 and on
                               stream named "foo".

      --query-events       List all the events available on the device(s).
                           Device(s) queried can be controlled by the
                           "--devices" option.

      --query-metrics      List all the metrics available on the device(s).
                           Device(s) queried can be controlled by the
                           "--devices" option.

      --concurrent-kernels <on|off>
                           Turn on/off concurrent kernel execution.
                           If concurrent kernel execution is off, all kernels
                           running on one device will be serialized.
                           Allowed values:
                               on - turn on concurrent kernel execution
                                    (default)
                               off - turn off concurrent kernel execution

      --profile-from-start <on|off>
                           Enable/disable profiling from the start of the
                           application. If it's disabled, the application can
                           use {cu,cuda}Profiler{Start,Stop} to turn on/off
                           profiling.
                           Allowed values:
                               on - enable profiling from start (default)
                               off - disable profiling from start

      --aggregate-mode <on|off>
                           This option turns on/off aggregate mode for events
                           and metrics specified by subsequent "--events" and
                           "--metrics" options. Those event/metric values will
                           be collected for each domain instance, instead of
                           the whole device.
                           Allowed values:
                               on - turn on aggregate mode (default)
                               off - turn off aggregate mode

      --system-profiling <on|off>
                           Turn on/off power, clock, and thermal profiling.
                           Allowed values:
                               on - turn on system profiling
                               off - turn off system profiling (default)

      --log-file <file name>
                           Make nvprof send all its output to the specified
                           file, or one of the standard channels. The file will
                           be overwritten. If the file doesn't exist, a new
                           one will be created.

                           "%1" as the whole file name indicates standard
                           output channel (stdout).

                           "%2" as the whole file name indicates standard
                           error channel (stderr).

                           NOTE: This is the default.

                           "%p" in the file name string is replaced with
                           nvprof's process ID.

                           "%h" in the file name string is replaced with
                           the hostname of the system.

                           "%%" in the file name is replaced with "%".

                           Any other character following "%" is illegal.

      --quiet              Suppress all nvprof output.

      --profile-child-processes
                           Profile the application and all child processes
                           launched by it.

      --profile-all-processes
                           Profile all processes launched by the same user who
                           launched this nvprof instance.

                           NOTE: Only one instance of nvprof can run with this
                           option at the same time. Under this mode, there's
                           no need to specify an application to run.
 -V   --version            Print version information of this tool.

 -h,  --help               Print this help information.

測定できるイベントの種類については、

nvprof --query-events ./gpupi

とすると表示される。

Available Events:
                           Name   Description
Device 0 (GeForce GTX TITAN):
       Domain domain_a:
      tex0_cache_sector_queries:  Number of texture cache 0 requests. This increments by 1 for each 32-byte access.

      tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.

      tex2_cache_sector_queries:  Number of texture cache 2 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.

      tex3_cache_sector_queries:  Number of texture cache 3 requests. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.

       tex0_cache_sector_misses:  Number of texture cache 0 misses. This increments by 1 for each 32-byte access.

       tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.

       tex2_cache_sector_misses:  Number of texture cache 2 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.

       tex3_cache_sector_misses:  Number of texture cache 3 misses. This increments by 1 for each 32-byte access. Value will be 0 for devices that contain only 2 texture units.

rocache_subp0_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp2_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 2 of read-only data cache. Increments per warp.

rocache_subp3_gld_warp_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 3 of read-only data cache. Increments per warp.

rocache_subp0_gld_warp_count_64b:  Number of 64-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_64b:  Number of 64-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp2_gld_warp_count_64b:  Number of 64-bit global load requests via slice 2 of read-only data cache. Increments per warp.

rocache_subp3_gld_warp_count_64b:  Number of 64-bit global load requests via slice 3 of read-only data cache.Increments per warp.

rocache_subp0_gld_warp_count_128b:  Number of 128-bit global load requests via slice 0 of read-only data cache. Increments per warp.

rocache_subp1_gld_warp_count_128b:  Number of 128-bit global load requests via slice 1 of read-only data cache. Increments per warp.

rocache_subp2_gld_warp_count_128b:  Number of 128-bit global load requests via slice 2 of read-only data cache. Increments per warp.

rocache_subp3_gld_warp_count_128b:  Number of 128-bit global load requests via slice 3 of read-only data cache. Increments per warp.

rocache_subp0_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp2_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp3_gld_thread_count_32b:  Number of 8-bit, 16-bit, and 32-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp0_gld_thread_count_64b:  Number of 64-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_64b:  Number of 64-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp2_gld_thread_count_64b:  Number of 64-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp3_gld_thread_count_64b:  Number of 64-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp0_gld_thread_count_128b:  Number of 128-bit global load requests via slice 0 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp1_gld_thread_count_128b:  Number of 128-bit global load requests via slice 1 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp2_gld_thread_count_128b:  Number of 128-bit global load requests via slice 2 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

rocache_subp3_gld_thread_count_128b:  Number of 128-bit global load requests via slice 3 of read-only data cache. For each instruction it increments by the number of threads in the warp that execute the instruction.

              elapsed_cycles_sm:  Elapsed clocks

       Domain domain_b:
          fb_subp0_read_sectors:  Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.

          fb_subp1_read_sectors:  Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.

         fb_subp0_write_sectors:  Number of DRAM write requests to sub partition 0, increments by 1 for 32 byte access.

         fb_subp1_write_sectors:  Number of DRAM write requests to sub partition 1, increments by 1 for 32 byte access.

   l2_subp0_write_sector_misses:  Number of write misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp1_write_sector_misses:  Number of write misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp2_write_sector_misses:  Number of write misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp3_write_sector_misses:  Number of write misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp0_read_sector_misses:  Number of read misses in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp1_read_sector_misses:  Number of read misses in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp2_read_sector_misses:  Number of read misses in slice 2 of L2 cache. This increments by 1 for each 32-byte access.

    l2_subp3_read_sector_misses:  Number of read misses in slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_write_l1_sector_queries:  Number of write requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_write_l1_sector_queries:  Number of write requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp2_write_l1_sector_queries:  Number of write requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp3_write_l1_sector_queries:  Number of write requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_l1_sector_queries:  Number of read requests from L1 to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_read_l1_sector_queries:  Number of read requests from L1 to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp2_read_l1_sector_queries:  Number of read requests from L1 to slice 2 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp3_read_l1_sector_queries:  Number of read requests from L1 to slice 3 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp0_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp1_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp2_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.

   l2_subp3_read_l1_hit_sectors:  Number of read requests from L1 that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_tex_sector_queries:  Number of read requests from Texture cache to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_read_tex_sector_queries:  Number of read requests from Texture cache to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp2_read_tex_sector_queries:  Number of read requests from Texture cache to slice 2 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp3_read_tex_sector_queries:  Number of read requests from Texture cache to slice 3 of L2 cache. This increments by 1 for each 32-byte access.

  l2_subp0_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 0 of L2 cache. This increments by 1 for each 32-byte access.

  l2_subp1_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 1 of L2 cache. This increments by 1 for each 32-byte access.

  l2_subp2_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 2 of L2 cache. This increments by 1 for each 32-byte access.

  l2_subp3_read_tex_hit_sectors:  Number of read requests from Texture cache that hit in slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_read_sysmem_sector_queries:  Number of system memory read requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_read_sysmem_sector_queries:  Number of system memory read requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp2_read_sysmem_sector_queries:  Number of system memory read requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp3_read_sysmem_sector_queries:  Number of system memory read requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_write_sysmem_sector_queries:  Number of system memory write requests to slice 0 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp1_write_sysmem_sector_queries:  Number of system memory write requests to slice 1 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp2_write_sysmem_sector_queries:  Number of system memory write requests to slice 2 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp3_write_sysmem_sector_queries:  Number of system memory write requests to slice 3 of L2 cache. This increments by 1 for each 32-byte access.

l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp1_total_read_sector_queries:  Total read requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp2_total_read_sector_queries:  Total read requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp3_total_read_sector_queries:  Total read requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp1_total_write_sector_queries:  Total write requests to slice 1 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp2_total_write_sector_queries:  Total write requests to slice 2 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp3_total_write_sector_queries:  Total write requests to slice 3 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

       Domain domain_c:
                  gld_inst_8bit:  Total number of 8-bit global load instructions that are executed by all the threads across all thread blocks.

                 gld_inst_16bit:  Total number of 16-bit global load instructions that are executed by all the threads across all thread blocks.

                 gld_inst_32bit:  Total number of 32-bit global load instructions that are executed by all the threads across all thread blocks.

                 gld_inst_64bit:  Total number of 64-bit global load instructions that are executed by all the threads across all thread blocks.

                gld_inst_128bit:  Total number of 128-bit global load instructions that are executed by all the threads across all thread blocks.

                  gst_inst_8bit:  Total number of 8-bit global store instructions that are executed by all the threads across all thread blocks.

                 gst_inst_16bit:  Total number of 16-bit global store instructions that are executed by all the threads across all thread blocks.

                 gst_inst_32bit:  Total number of 32-bit global store instructions that are executed by all the threads across all thread blocks.

                 gst_inst_64bit:  Total number of 64-bit global store instructions that are executed by all the threads across all thread blocks.

                gst_inst_128bit:  Total number of 128-bit global store instructions that are executed by all the threads across all thread blocks.

          rocache_gld_inst_8bit:  Total number of 8-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

         rocache_gld_inst_16bit:  Total number of 16-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

         rocache_gld_inst_32bit:  Total number of 32-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

         rocache_gld_inst_64bit:  Total number of 64-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

        rocache_gld_inst_128bit:  Total number of 128-bit global load via read-only data cache that are executed by all the threads across all thread blocks.

       Domain domain_d:
                prof_trigger_00:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_01:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_02:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_03:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_04:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_05:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_06:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                prof_trigger_07:  User profiled generic trigger that can be inserted in any place of the code to collect the related information. Increments per warp.

                 warps_launched:  Number of warps launched on a multiprocessor.

               threads_launched:  Number of threads launched on a multiprocessor.

                   inst_issued1:  Number of single instruction issued per cycle

                   inst_issued2:  Number of dual instructions issued per cycle

                  inst_executed:  Number of instructions executed, do not include replays.

           thread_inst_executed:  Number of instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads in the warp that execute the instruction.

not_predicated_off_thread_inst_executed:  Number of not predicated off instructions executed by all threads, does not include replays. For each instruction it increments by the number of threads that execute this instruction.

                     atom_count:  Number of warps executing atomic reduction operations. Increments by one if at least one thread in a warp executes the instruction.

                     gred_count:  Number of warps executing reduction operations on global and shared memory. Increments by one if at least one thread in a warp executes the instruction

                    shared_load:  Number of executed load instructions where state space is specified as shared, increments per warp on a multiprocessor.

                   shared_store:  Number of executed store instructions where state space is specified as shared, increments per warp on a multiprocessor.

                     local_load:  Number of executed load instructions where state space is specified as local, increments per warp on a multiprocessor.

                    local_store:  Number of executed store instructions where state space is specified as local, increments per warp on a multiprocessor.

                    gld_request:  Number of executed load instructions where the state space is not specified and hence generic addressing is used, 
                                  increments per warp on a multiprocessor. It can include the load operations from global,local and shared state space.

                    gst_request:  Number of executed store instructions where the state space is not specified and hence generic addressing is used,
                                  increments per warp on a multiprocessor. It can include the store operations to global,local and shared state space.

                  active_cycles:  Number of cycles a multiprocessor has at least one active warp. This event can increment by 0 - 1 on each cycle.

                   active_warps:  Accumulated number of active warps per cycle. For every cycle it increments by the number of active warps in the cycle which can be in the range 0 to 64.

                sm_cta_launched:  Number of thread blocks launched on a multiprocessor.

        local_load_transactions:  Number of local load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

       local_store_transactions:  Number of local store transactions to L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

    l1_shared_load_transactions:  Number of shared load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

   l1_shared_store_transactions:  Number of shared store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

  __l1_global_load_transactions:  Number of global load transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

 __l1_global_store_transactions:  Number of global store transactions from L1 cache. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

              l1_local_load_hit:  Number of cache lines that hit in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_local_load_miss:  Number of cache lines that miss in L1 cache for local memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

             l1_local_store_hit:  Number of cache lines that hit in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

            l1_local_store_miss:  Number of cache lines that miss in L1 cache for local memory store accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32,64 and 128 bit accesses by a warp respectively.

             l1_global_load_hit:  Number of cache lines that hit in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

            l1_global_load_miss:  Number of cache lines that miss in L1 cache for global memory load accesses. In case of perfect coalescing this increments by 1,2, and 4 for 32, 64 and 128 bit accesses by a warp respectively.

uncached_global_load_transaction:  Number of uncached global load transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

       global_store_transaction:  Number of global store transactions. Increments by 1 per transaction. Transaction can be 32/64/96/128B.

             shared_load_replay:  Replays caused due to shared load bank conflict (when the addresses for two or more shared memory load requests fall in the same memory bank) 
                               or when there is no conflict but the total number of words accessed by all threads in the warp executing that 
                               instruction exceed the number of words that can be loaded in one cycle (256 bytes).

            shared_store_replay:  Replays caused due to shared store bank conflict (when the addresses for two or more shared memory store requests fall in the same memory bank) 
                               or when there is no conflict but the total number of words accessed by all threads in the warp executing that 
                               instruction exceed the number of words that can be stored in one cycle.

global_ld_mem_divergence_replays:  global ld is replayed due to divergence
 
global_st_mem_divergence_replays:  global st is replayed due to divergence

コマンドラインプロファイラ nvprof を gpupi について試してみた

#THREADs per block=4, #BLOCKs per grid=4 やそれ以下では実行時エラーになる。

#THREADs per block=8, #BLOCKs per grid=4

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==23989== NVPROF is profiling process 23989, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=8, BLOCK=4
==23989== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =13760.455078(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5755.757812(ms)
host計算結果 = 3.141592653590
==23989== Profiling application: ./gpupi
==23989== Profiling result:
==23989== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched           4           4           4
          1                threads_launched          32          32          32
          1                   inst_executed  1277165633  1277165633  1277165633
          1                    inst_issued1  2426404937  2426404937  2426404937
          1                    inst_issued2   130023454   130023454   130023454
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                    active_warps  2.1919e+10  2.1919e+10  2.1919e+10

#THREADs per block=4, #BLOCKs per grid=8

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==25010== NVPROF is profiling process 25010, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=4, BLOCK=8
==25010== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =14021.825195(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5759.763184(ms)
host計算結果 = 3.141592653590
==25010== Profiling application: ./gpupi
==25010== Profiling result:
==25010== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched           8           8           8
          1                threads_launched          32          32          32
          1                   inst_executed  2554331261  2554331261  2554331261
          1                    inst_issued1  4852809857  4852809857  4852809857
          1                    inst_issued2   260046905   260046905   260046905
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                    active_warps  4.4812e+10  4.4812e+10  4.4812e+10
==25010== Warning: One or more event counters overflowed. Rerun with "--print-gpu-trace" for detail.

#THREADs per block=8, #BLOCKs per grid=8

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==14764== NVPROF is profiling process 14764, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=8, BLOCK=8
==14764== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =7617.770020(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5757.698242(ms)
host計算結果 = 3.141592653590
==14764== Profiling application: ./gpupi
==14764== Profiling result:
==14764== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched           8           8           8
          1                threads_launched          64          64          64
          1                   inst_executed  1277165693  1277165693  1277165693
          1                    inst_issued1  2426404993  2426404993  2426404993
          1                    inst_issued2   130023481   130023481   130023481
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles  2.2276e+10  2.2276e+10  2.2276e+10
          1                    active_warps  2.2276e+10  2.2276e+10  2.2276e+10

#THREADs per block=8, #BLOCKs per grid=64

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==15821== NVPROF is profiling process 15821, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=8, BLOCK=64
==15821== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =1343.349976(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5754.979980(ms)
host計算結果 = 3.141592653590
==15821== Profiling application: ./gpupi
==15821== Profiling result:
==15821== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched          64          64          64
          1                threads_launched         512         512         512
          1                   inst_executed  1277166533  1277166533  1277166533
          1                    inst_issued1  2429240687  2429240687  2429240687
          1                    inst_issued2   130023880   130023880   130023880
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles  5007483146  5007483146  5007483146
          1                    active_warps  2.1803e+10  2.1803e+10  2.1803e+10

#THREADs per block=8, #BLOCKs per grid=256

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==15185== NVPROF is profiling process 15185, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=8, BLOCK=256
==15185== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =1256.272949(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5755.872070(ms)
host計算結果 = 3.141592653590
==15185== Profiling application: ./gpupi
==15185== Profiling result:
==15185== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         256         256         256
          1                threads_launched        2048        2048        2048
          1                   inst_executed  1277169413  1277169413  1277169413
          1                    inst_issued1  2428738320  2428738320  2428738320
          1                    inst_issued2   130025057   130025057   130025057
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles  2615620088  2615620088  2615620088
          1                    active_warps  2.8689e+10  2.8689e+10  2.8689e+10

#THREADs per block=64, #BLOCKs per grid=8

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==17098== NVPROF is profiling process 17098, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=64, BLOCK=8
==17098== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =1336.156006(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5754.904785(ms)
host計算結果 = 3.141592653590
==17098== Profiling application: ./gpupi
==17098== Profiling result:
==17098== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched          16          16          16
          1                threads_launched         512         512         512
          1                   inst_executed   319291637   319291637   319291637
          1                    inst_issued1   606601457   606601457   606601457
          1                    inst_issued2    32505973    32505973    32505973
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles  2777686703  2777686703  2777686703
          1                    active_warps  5538596160  5538596160  5538596160

#THREADs per block=256, #BLOCKs per grid=8

$ nvprof --events  thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==17438== NVPROF is profiling process 17438, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=256, BLOCK=8
==17438== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =640.708984(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5755.224121(ms)
host計算結果 = 3.141592653590
==17438== Profiling application: ./gpupi
==17438== Profiling result:
==17438== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched          64          64          64
          1                threads_launched        2048        2048        2048
          1                   inst_executed   319292357   319292357   319292357
          1                    inst_issued1   609564842   609564842   609564842
          1                    inst_issued2    32506306    32506306    32506306
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles   661725553   661725553   661725553
          1                    active_warps  5252705640  5252705640  5252705640

#THREADs per block=512, #BLOCKs per grid=8

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==17777== NVPROF is profiling process 17777, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=512, BLOCK=8
==17777== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =238.332993(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5756.290039(ms)
host計算結果 = 3.141592653590
==17777== Profiling application: ./gpupi
==17777== Profiling result:
==17777== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         128         128         128
          1                threads_launched        4096        4096        4096
          1                   inst_executed   319293317   319293317   319293317
          1                    inst_issued1   606658886   606658886   606658886
          1                    inst_issued2    32506698    32506698    32506698
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles   496165075   496165075   496165075
          1                    active_warps  7915666172  7915666172  7915666172

#THREADs per block=1024, #BLOCKs per grid=8

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==18118== NVPROF is profiling process 18118, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=1024, BLOCK=8
==18118== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =235.184998(ms)
GPU計算結果 = 3.141592653590
ホストの計算時間 =5759.145996(ms)
host計算結果 = 3.141592653590
==18118== Profiling application: ./gpupi
==18118== Profiling result:
==18118== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         256         256         256
          1                threads_launched        8192        8192        8192
          1                   inst_executed   319295237   319295237   319295237
          1                    inst_issued1   606604750   606604750   606604750
          1                    inst_issued2    32507329    32507329    32507329
          1            thread_inst_executed  1.0217e+10  1.0217e+10  1.0217e+10
          1                   active_cycles   485943938   485943938   485943938
          1                    active_warps  1.5533e+10  1.5533e+10  1.5533e+10

#THREADs per block=1024, #BLOCKs per grid=12

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==19145== NVPROF is profiling process 19145, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=1024, BLOCK=12
==19145== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =161.235001(ms)
GPU計算結果 = 3.141592428766
ホストの計算時間 =5762.456055(ms)
host計算結果 = 3.141592428766
==19145== Profiling application: ./gpupi
==19145== Profiling result:
==19145== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         384         384         384
          1                threads_launched       12288       12288       12288
          1                   inst_executed   319261445   319261445   319261445
          1                    inst_issued1   589668833   589668833   589668833
          1                    inst_issued2    40957360    40957360    40957360
          1            thread_inst_executed  1.0216e+10  1.0216e+10  1.0216e+10
          1                   active_cycles   485207288   485207288   485207288
          1                    active_warps  1.5506e+10  1.5506e+10  1.5506e+10

#THREADs per block=1024, #BLOCKs per grid=13

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==19486== NVPROF is profiling process 19486, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=1024, BLOCK=13
==19486== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =149.960007(ms)
GPU計算結果 = 3.141591485519
ホストの計算時間 =5758.164062(ms)
host計算結果 = 3.141591485519
==19486== Profiling application: ./gpupi
==19486== Profiling result:
==19486== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         416         416         416
          1                threads_launched       13312       13312       13312
          1                   inst_executed   320855813   320855813   320855813
          1                    inst_issued1   576697370   576697370   576697370
          1                    inst_issued2    48234685    48234685    48234685
          1            thread_inst_executed  1.0267e+10  1.0267e+10  1.0267e+10
          1                   active_cycles   484923626   484923626   484923626
          1                    active_warps  1.5496e+10  1.5496e+10  1.5496e+10

#THREADs per block=1024, #BLOCKs per grid=14

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==19895== NVPROF is profiling process 19895, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=1024, BLOCK=14
==19895== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =139.876007(ms)
GPU計算結果 = 3.141592017752
ホストの計算時間 =5757.771973(ms)
host計算結果 = 3.141592017752
==19895== Profiling application: ./gpupi
==19895== Profiling result:
==19895== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         448         448         448
          1                threads_launched       14336       14336       14336
          1                   inst_executed   320861189   320861189   320861189
          1                    inst_issued1   576705383   576705383   576705383
          1                    inst_issued2    48235663    48235663    48235663
          1            thread_inst_executed  1.0268e+10  1.0268e+10  1.0268e+10
          1                   active_cycles   485014897   485014897   485014897
          1                    active_warps  1.5492e+10  1.5492e+10  1.5492e+10

これで時間は14022ms(GPU4*8)/140ms(GPU1024*14) = 100倍、  スレッド数は(1024*14)/(4*8) = 448倍

#THREADs per block=1024, #BLOCKs per grid=15

$ nvprof --events thread_inst_executed,,active_cycles,warps_launched,threads_launched,active_cycles,active_warps,inst_executed,inst_issued1,inst_issued2 ./gpupi
==20234== NVPROF is profiling process 20234, command: ./gpupi
GPU Device 0: "GeForce GTX TITAN" with compute capability 3.5

THREAD=1024, BLOCK=15
==20234== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
計算時間 =250.242004(ms)
GPU計算結果 = 3.141592428766
ホストの計算時間 =5756.960938(ms)
host計算結果 = 3.141592428766
==20234== Profiling application: ./gpupi
==20234== Profiling result:
==20234== Event result:
Invocations                      Event Name         Min         Max         Avg
Device "GeForce GTX TITAN (0)"
        Kernel: Kernel(double*)
          1                  warps_launched         480         480         480
          1                threads_launched       15360       15360       15360
          1                   inst_executed   319262885   319262885   319262885
          1                    inst_issued1   589669996   589669996   589669996
          1                    inst_issued2    40957835    40957835    40957835
          1            thread_inst_executed  1.0216e+10  1.0216e+10  1.0216e+10
          1                   active_cycles   485185993   485185993   485185993
          1                    active_warps  1.6542e+10  1.6542e+10  1.6542e+10

よく分からないけれど、参考記事として 高速演算記 第25回 「Kepler解説その2 〜Kepler世代の新機能〜」 にいろいろ議論してある。

手持ちの数値積分プログラムで測った結果は、下図のようであった。

2013-07-31_Titan_performance.png


添付ファイル: file2013-07-31_Titan_performance.png 295件 [詳細] file2013-07-31_Titan_performance.xls 89件 [詳細]

トップ   編集 凍結 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2013-07-31 (水) 11:49:35 (1425d)