No announcement yet.

Let's talk about: HSA·103 - UMA, NUMA, hUMA y HSA

  • Filter
  • Time
  • Show
Clear All
new posts

  • Let's talk about: HSA·103 - UMA, NUMA, hUMA y HSA

    After the pause HSA series of articles, we will discuss the provisions or memory configurations used in parallel architectures, or where more than one CU can simultaneously access computer RAM.

    Originally by my partner f5inet in
    Sorry for Bing translator from Spanish Languaje

    Accelerators, let's make a little history:
    The PC has been classically a UMA system, that is, uniform memory access (Uniform Memory Access), where the only core which outfitting, all memory accessed through its MMU. When we say that a system is UMA, it means your (or their) computational units (CUs) can access all the RAM freely and smoothly.

    It is true that the PC architecture has arranged all his life several accelerators, as the first SoundBlaster cards or the first graphics accelerators, such as 3dfx Voodoo, ATi Radeon or nVidia RivaTNT preHD series and Geforce 1-2-3-4- FX, but the memory of these accelerators were away from the CPUs of these PC and accessed through another subsystem called DMA (Direct memory access Direct memory access) which was no more than a chip / specific subsystem manager make large transfers of data between system modules accelerators and main memory.

    Every good CU its foreign ministry deserves:
    The DMA subsystem also handles disk access and talk to USB devices and basically all external 'memory' RAM what is not and what is not visible or accessible directly by the CUs system.

    That said, just to settle terms, if a CU wants to access the RAM, will through its MMU, and if you want to access other memory beyond their direct scope, conduct a request to the subsystem DMA to bring the data RAM, where you can operate freely with them, or will send the data via DMA to said accelerator for the accelerator to operate with them.

    UMA for all, and all via UMA:
    Of course, what many CUs operate simultaneously on the same RAM brings a new problem: COLLISION.

    If two or more CUs want to simultaneously access the same region of RAM, can be stepped on each other and simultaneously change the RAM with operating, usually with disastrous results, such as why a CU modify the contents of RAM while said L2 cache content is another CU and such modification is not notified.

    For that, and from the intel i386, was added to the MMU various extensions for which a CU could 'protect' parts of memory for full access, ushering in a concept called 'protected mode', by which , if a CU needs what a region of the RAM is not touch, mark the page as exclusive access MMU. If another CU asks the MMU access to the same region of the RAM, the MMU notifies the CU which had so far access to what that CU empty their caches in RAM and release to another CU can access it .

    To all this, is called UMA, or how to maintain the consistency of a single bit of RAM to the simultaneous access of different CUs which were not designed to talk to each other, and what their main task is to compute, not engage in bureaucracy. The bureaucracy and the MMU is dedicated.

    Is all this GPU is yours ?:
    But graphics accelerators grew to the point where CUs are already complete, and not simply ASICs / DSPs / specific Accelerators, but are fully programmable. At this point, a GPU consists of various CUs, and have their own pool of memory. In dedicated graphics AMD GCN 1.0, even they have their own MMU. But the GPU RAM is physically separated from the RAM of the CPU, which are two separate pools each with its own RAM CUs and their own MMUs:

    This configuration is called Configuration NUMA (Non Uniform Memory Access), or access to non-unified memory where a dependent on a MMU, CUs do not know the configuration RAM RAM another pool governed by the other MMU. In this configuration, the DMA subsystems are used to move data between different pools of RAM, what one or the other CUs operate with such data. A splendid example of a recent successful NUMA configuration, would be the IBM Cell PlayStation3

    Halfway between specific accelerator and programmable CU, the SPE does not have direct access to the RAM, since neither have MMU, nor its MMU are synchronized with the main MMU of the CPU, so can only operate in their small pit 256KB RAM. If they need any external RAM data need to boot a request via DMA to bring data to your pool 256KB to operate with such data

    Of course, such a transfer RAM between different pools adds latency and complexity to the processes that need to use simultaneously LCUs and TCUs. And it is here where HSA and Huma shine

    Huma to the rescue

    In an APU supports HSA, the MMUs CPU and GPU work together to provide the same virtual pool of RAM, notifying including changes pages and assignments over and perform, so not to make any kind copy data between different pools of RAM, it is simply passed to the GPU a pointer to the data you have to work, and the GPU can directly access data and work with them.

    BiG Porras
    Originally by my partner f5inet in
    Last edited by BiG Porras; 04-11-2016, 03:47 PM.