Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

CUDA, Supercomputing for the Masses: Part 4


The CUDA Memory Model

Figure 1 schematically illustrates a thread that executes on the device has access to global memory and the on-chip memory through the memory types.

Figure 1

Each multiprocessor, illustrated as Block (0, 0) and Block (1, 0) above, contains the following four memory types:

  • One set of local registers per thread.
  • A parallel data cache or shared memory that is shared by all the threads and implements the shared memory space.
  • A read-only constant cache that is shared by all the threads and speeds up reads from the constant memory space, which is implemented as a read-only region of device memory. (Constant memory will be discussed in a later column. Until then, please refer to section 5.1.2.2 of the CUDA Programming Guide for more information.)
  • A read-only texture cache that is shared by all the processors and speeds up reads from the texture memory space, which is implemented as a read-only region of device memory. (Texture memory will be discussed in a subsequent article. Until then, refer to section 5.1.2.3 of the CUDA Programming Guide for more information.)

Don't be confused by the fact the illustration includes a block labeled "local memory" within the multi-processor. Local memory implies "local in the scope of each thread". It is a memory abstraction, not an actual hardware component of the multi-processor. In actuality, local memory gets allocated in global memory by the compiler and delivers the same performance as any other global memory region. Local memory is basically used by the compiler to keep anything the programmer considers local to the thread but does not fit in faster memory for some reason. Normally, automatic variables declared in a kernel reside in registers, which provide very fast access. In some cases the compiler might choose to place these variables local memory, which might be the case when there are too many register variables, an array contains more than four elements, some structure or array would consume too much register space, or when the compiler cannot determine if an array is indexed with constant quantities.

Be careful because local memory can cause slow performance. Inspection of the ptx assembly code (obtained by compiling with the -ptx or -keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. If it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture.

Until the next column installment, I recommend using the occupancy calculator to get a solid understanding of how the execution model and kernel launch execution configuration affects the number of registers and amount of shared memory.

For More Information

Click here for more information on CUDA and here for more information on NVIDIA.


Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He has worked in massively parallel computing at several national laboratories and as co-founder of several startups. He can be reached at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.