No announcement yet.

Xbox One Secret Sauce™ Hardware

  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Gaming consoles mentioned using Ray tracing chips
    Attached Files
    Last edited by srenia_ia; 07-01-2015, 05:52 PM.


    • srenia_ia
      Editing a comment
      PowerVR GR6500 is the first member of the new Wizard family of cores designed to bring real-time, interactive ray traced graphics to a broad range of consumer and mobile platforms, as well as gaming consoles and mainstream gaming PCs, workstations and servers.

      David Helgason, CEO of Unity Technologies says: “Unity integrated Imagination’s ray tracing software into the Unity 5 editor due to its high performance on the broadest range of consumer laptops, ease of integration and unique future hardware roadmap. We will continue to work with Imagination to fully utilize the PowerVR Ray Tracing GPU IP within the Unity game engine to further enhance the in-game experience and simplify content creation for our developers.”

      “While PowerVR Wizard ray tracing GPUs are a disruptive technology, they remain as familiar and easy to use as the PowerVR GPUs that developers are already using. With PowerVR Wizard IP, the ray tracing capabilities are exposed directly through the existing programmable shading hardware, so that software developers can take advantage of the ultra-fast ray tracing for some effects, while continuing to leverage the substantial investment in software designed to run on GPUs already available in the market today.”
      Last edited by srenia_ia; 07-01-2015, 05:38 PM.

    • srenia_ia
      Editing a comment

      Partner Type Strategic Partners

      Products Browsers Gaming GPU compute Navigation/Automotive Operating Systems PowerVR Ecosystem Partners PowerVR Graphics PowerVR Video User Interface

      Markets Automotive Digital Home Embedded/Emerging Markets Handheld Multimedia Home Electronics Mobile Phones Tablets

      Design Services Hardware Providers Software Developers

    • srenia_ia
      Editing a comment

      With Imagination’s Wizard Ray Tracing family, patented custom hardware engines ensure the GPU’s processor clusters are kept free to run sophisticated shaders, producing stunning results with minimal developer effort. The unique hardware ray tracing processors, including coherence gathering engines, fixed function intersection testing array and accelerated scene hierarchy generator, receive rays from the shaders, determines their intersection and initiates more shaders. All of this is made possible with entirely dynamic scenes through the addition of a novel hardware pipeline to build scene hierarchies required for ray tracing in parallel with other operations, all while remaining compatible with existing programmable vertex operations.

      With all of these additions, Wizard cores address the real-world use cases of ray tracing with fully interactive dynamic content and photorealistic lighting. The net effect is an architecture that can perform ray tracing operations around 100 times as efficiently as performing the same operation using GPGPU functionality.

      PowerVR GR6500 delivers:
      •Ray tracing: 300 MRPS (million rays per second) and 100 MDTPS (million dynamic triangles per second) at 600 MHz

      While power-efficient enough for mobile applications, the PowerVR Wizard architecture is able to scale up to configurations capable of rendering interactive cinema for high-end gaming machines and consumer devices.

      Says Yasuhiro Kondo, general manager, Arcade Machine Development, SEGA Co Ltd.: “SEGA welcomes the announcement of PowerVR Ray Tracing IP from Imagination Technologies. We expect that the Wizard cores will create a great revolution in the graphics experience of the gaming market.”

      So 100 times better at ray tracing than GPU compute. Got to love MS using this IP in the Xbox One. 100 times better at ray tracing than PS4. Well, what can you say but game over PS4.
      Last edited by srenia_ia; 07-01-2015, 06:38 PM.

    • srenia_ia
      Editing a comment

      Vision Platform

      Our vision platform is a complete, integrated vision platform, combining GPU, video and vision cores, that saves power and bandwidth for today's camera applications, and can provide the basis for next-generation context-aware applications such as facial and gesture recognition, augmented reality and more.

      Imagination’s PowerVR graphics technologies are licensed by world-leading companies to power iconic products delivering the best in smartphone, tablet, TV and console apps, including the most advanced user interfaces and highest performance gaming.

      By using PowerVR video technologies, our customers are able to build the efficient decoding required for 4K broadcasts into their silicon today. They can also deliver better image quality using the industry’s first complete IP cores with 10-bit colour depth support throughout.

      Notice the best console app! This is 2014 console using PowerVR IP.

      Visualizer Delivering ray tracing to people who ‘make’ Visualizer is a new B2C product brand from Imagination. Its core product is Visualizer for SketchUp, a revolutionary new app which allows anybody to take virtual photos of their 3D designs with a ‘real life look’ in SketchUp with only one click. At its core, Visualizer for SketchUp relies on Imagination’s highly optimized PowerVR ray tracing software to produce photorealistic pictures in realtime. Visualizer is not only delivering a compelling product to the SketchUp community but also helping drive our efforts in creating an ecosystem around our ray tracing technologies.

      HoloLens using Ray tracing from Imagination IP! This ray tracing IP is big guys and comfortably can be linked with the One. The fact that Intel uses these GPU's cements how much Imagination IP is being used by MS. The One is based in mobile tech scaled up for the best gaming possible within its thermals.
      Last edited by srenia_ia; 07-01-2015, 07:27 PM.

    • QuazL
      Editing a comment
      It should be noted that the article speaks of console(s) - Plural.

    • srenia_ia
      Editing a comment

      Partner Type IP Core Licensees

      Products MIPS Processors PowerVR Graphics Silicon Vendors Silicon Vendors

      Markets Connected home Digital Home Gaming Home Electronics Mobile

      Design Services Hardware Providers Software Developers

      Academic Links No terms selected

      Your right Sony uses the IP license but isn't a Strategic Partner. :-). Lot more involvement with MS as being almost a legal partner with MS. This is the smoking gun of the major hidden hardware. The GPU is Power VR setup on a AMD GPU. A mobile GPU scaled up. :-)

      A strategic partnership is a formal alliance between two commercial enterprises, usually formalized by one or more business contracts but falls short of forming a legal partnership or, agency, or corporate affiliate relationship.
      Last edited by srenia_ia; 07-02-2015, 05:35 PM.

    • srenia_ia
      Editing a comment

      2015 or later main release predicted on this tech. Directx 12.x, Windows 10 and RT 2015/16 on the One. :-)

  • #77
    3 cache part on X1
    eSRAM that has direct link to CPU which is moderate BW ( Jaguar can access it too, logically localted near Jaguar)
    eSRAM that does not have direct link to CPU, which is xtremely high BW, High BW already higher that moderate, xtremely is going to be more, logically close to gfx core or like my diagram it is stacked
    Color Buffer and Depfth buffer which is similar to what on X360, that has dedicated cache, with access to it faster than to eSRAM (both type)

    you can cross checked to XDK, the link is on page one of this thread

    Click image for larger version

Name:	eSRAM_highspeed_with_CPU.jpg
Views:	1
Size:	232.7 KB
ID:	3008 Click image for larger version

Name:	eSRAM_highspeed_no_CPU.jpg
Views:	1
Size:	207.5 KB
ID:	3009 Click image for larger version

Name:	CB_DB_Substantial_cache.jpg
Views:	1
Size:	261.7 KB
ID:	3010


    • QuazL
      Editing a comment
      I see where you are going with this, MisterC!

      Very confusing:

      1. The first picture states that the CPU CAN be a "memory client" unless they are saying that a CPU is a "memory client" of the system as a whole and not a "memory client" to the ESRAM.
      2. GPU has direct access to 32 MB of ESRAM. Then states NO CPU access to ESRAM
      3. It specifically states that depth and color buffers have access to an even faster cache than ESRAM although smaller.
      4. One moment they say that there is "moderate difference" between ESRAM and main RAM
      5. Then they say it is extremely fast.

      Very Confusing.

      Question for debate:

      1. Is the CPU a "memory client" of ESRAM?
      2. What is the name of the cache for depth and color buffer?
      3. Is ESRAM moderately faster than Main Ram or Extremely Fast?
      4. How big are these "substantial" caches that are NOT big enough to hold entire render targets?
      Last edited by QuazL; 07-02-2015, 01:10 PM. Reason: Better wording.

    • mistercteam
      Editing a comment
      Because there is 2 eSRAM
      the one with CPU client is the one near Jaguar
      just like the one showed in hotchip

      the Extremely high BW is >>>> 102GB or 204 because it is the one stacked with gfxcore
      this is one of the X1 secret,

      MS clever to name both as eSRAM, but means 2 variant

      just like MS name GPU block for all

      when the block like we see on CHipworks is actually a backend like DX12 reguirement per intel paper

      basically currently X1 only utilize mainSOC, but not the real eSRAM block which has GPU + CPU too a PIM like device

      look at my slide above in prev post

      The 32MB eSRAM that moderate BW is near Jag in 4 layer, it is why site like anantech ask
      why there is 8x4 in 4x256 contoller as esram usually use full 1024bit

      now we got all wanswer

      xtremely High BW is not 102 or 204 which is less than eDRAM BW

      it is why insider somehow clearly right in this case, but he does not know the how the final block will be

    • mistercteam
      Editing a comment
      CPU is memory client for moderate BW eSRAM you can check the hotchip slide
      there is small lines from 8MB x 4 embbeded SRAM
      interestingly of course they named as embedded SRAM too

      infact you can guess between 2 variant esram which esram can be accesed by CPU/Jag sure the moderate speed SRAM one.
      as from chipwork it is positioned below Jag, and on Hotchip there is small lines to it

      but the super fassssssttt one is named as enhanched SRAM
      yep as SRAM is always designed to be more low latency

      plus as it is part of Gfx core block MS wont showed in diagram
      just like why MS have to showed 64KB SRAM on audio vector cores as it is clue, those 64 KB SRAM take from slice of faast SRAM one
      Last edited by mistercteam; 07-02-2015, 02:51 PM.

  • #78
    Now a comparison of small eSRAM which is in 1T = 8MB seem accurate, and will certainly have moderate BW !!

    Click image for larger version

Name:	0_sram_fact1.jpg
Views:	1
Size:	60.0 KB
ID:	3039


  • #79
    I think this is a specific question we can ask, as toms has already said it, the research which is public is pointing to it also. Let's start tweeting devs etc. See if we can get some confirmation ?


    • LuvOfThaGame
      Editing a comment
      I agree but with extreme caution. We've been burned by one wrong letter or word many times. We would need to make sure and have the terminology perfect in order to get an accurate response. Would be an interesting collaboration to all agree on the perfect 140ch to ask a question. Lol. But would save a misunderstanding.

    • F00xm4n
      Editing a comment
      Yeah good point, but a categorical no means we can concentrate elsewhere, but a no answer or confirmation is such great news. Can you imagine a confirmation, I think it's worth it.

    • LuvOfThaGame
      Editing a comment
      I agree as well. Just want to make sure we ask the right question, in right way. Confirmation would be very exciting.

    • mistercteam
      Editing a comment
      i already ask them remember, but they are tight lipped, only sometimes favourite my twitter
      for example i ask Holmes, why the res of media on Halo 5 all 1440p
      he not answer any of it, he can saying for example that just for magazine, but no he not answer it.

      so i think better if you can, dm the developer, not via public twitter timeline
      as sometimes in public timeline others will see the answer not in same prespective
      so it has to be careful.

  • #80
    Loving the way your thread is coming together, C! Very clear and well organized in topic. I'm going to have to camp out in here once finished. Lots of BOOMAGE!


    • David Michael
      Editing a comment
      was going to like more likes?

  • #81
    ok the big picture of X1 is

    2 Big block (physical 2 eSRAM 3DIC/Stacked , per block 4 Cluster )

    CPU side on this block act as Command Processor based on CPU like core , ARM or PowerA2
    per block Hold 16 Integer ALU support 2 context (@ 32bit)
    means = total 32SQ with 2 context
    CPU control Scalar Part of VSP

    VSP side
    there is 48 VSP block with
    each VSP virtually has 16 SPU
    SPU physically has Scalar part and Vector part

    CPU + 768 Scalar part of SPU is basically the HP-APU this is ~ 1.2TF-1.3TF plus massive Integer operation for program control branch etc

    768 Vector part of SPU is basically the whole Gfx core
    this 768 Vector part of SPU is partitioned into 6 CU group, 2 CU group belong for CB-DB
    4 CU group belong to 48 TCP

    Currently with DX11 without locked part
    System/APP can only access the 2CU group which is connect directly to CB-DB


    • #82
      with my reference to IBM SRAM tech in 22nm + FF or SOI
      i am shocked to see that IBM 22nm SRAM is 0.6 umm2 per bit cell
      is bit smaller than TSMC in 14/16nm which is 0.7umm2

      then suddenly i remember that Xbox world magazine said X1
      as main part 28nm and other part 22nm Aha .......

      How can XBW magazine know, suddenly we have strong case/proof
      that indeed eSRAM is 3D ic and at least 22nm or 20nm SOI

      some reference: you can check that IBM 22nm is ~0.6umm2

      Semiaccurate OBAn test chip is huge, 500-600mm2
      22nm IBM SRAM is 0.6 umm2 smaller than even TSMC at 14/16nm
      XBW last issue hinted 16 core per block in 4 cluster, and monster CPU, and 22nm !!!

      Click image for larger version

Name:	wXaZ2jb.jpg
Views:	1
Size:	218.0 KB
ID:	3213 Click image for larger version

Name:	shSMxIh.jpg
Views:	1
Size:	211.2 KB
ID:	3214 Click image for larger version

Name:	6X0RFss.jpg
Views:	1
Size:	202.5 KB
ID:	3215
      Last edited by mistercteam; 07-03-2015, 10:11 AM.


      • #83
        Be aware this patent is about proof of TCC/TCP which is listed on XDK
        as we know people downplay TCP/TCC as non existed on AMD GCN

        so this is about understanding why data movement etc, of course there is no Xbox name in patent
        but about how it works ..... that matters for me

        finally PATENT proof of a GPU with 24 TCP with TCP/TA/TD
        1 is modified from this concept more on secret sauce !!!!

        full image:

        What it tells : simple this is confirming my digging, that the TCP control per CU as it is actually L1 , and TA/TD on X1 modified a bit, plus look at only 1 TCC on this patent, X1 has 16 TCC !!! (because X1
        can support 16 Virtual adress that why there is GPUMMU, PS4 or AMD GCN only has 1 TCC so far like this image below as they still using IOMMU !!!

        from XDK

        X1 modified into 3 separate block rather than TA/TD/TCP
        each have its own ALUs

        Below is my old slide about it
        full res,

        Notice X1 texture unit is virtual as processed by at least 3 block
        you can guess which part is belong to 12 SC
        which belong to 2 Block in hot chip
        which belong to back-end

        Click image for larger version

Name:	pTxU6Ex.jpg
Views:	1
Size:	132.8 KB
ID:	3347

        this patent is about confirming
        1. That TCP is L1 it is why per CU, X1 has 48 TCP
        2. TCC is maintain Address segment it is why called as L2, this patent use only 1 TCC, just like most AMD or PS4 GPU , X1 has 16 TCC
        3. this patent is old model but still important to showed point 1 & 2, this old model represent current texture flow in GCN TA (texture address) --->TCP (cache) ---> TD (Texture destination) X1 split as 3 block
        4. point 3 is the most important differences compare PS4 or GCN, it is why X1 has 3 Block for texture processing, texture is general, it can be any data not exlcusive for texture

        OK basically this what happened in X1
        1. TA block (processed by 12 SC, 768 SCALAR+Brand), as named suggest the producer , clock 426-853 mhz
        2. TCP block the middle process ( processed by 48 CU ) (It is confirmed by XDK showed 48 TCP), clock low 426 Mhz (it is why Gfx core is bottom, CPU at top below super fast eSRAM), Gfx core can be adaptive to higher but target
        low clock !!
        3. TD block The back end process (24 CU) it is confirmed by XDK as 24TD, this is shared with Jaguar block (main SOC), so this block when PIM block of Point 1-2 not used, behave as GPU, low clock too

        in normal operation (when unlocked , with low clock of TCP/TD) this is the X1 performance
        TA block 1 , as we know produce 1.3 TF Scalar flops (this is act like CPU on Gfx core)
        TCP Block , produce 2.6 TF at 426 Mhz
        TD block, produce 1.3 TF at 426 Mhz
        Total 5.2 TF, with MainSOC (which has TD block) : Gfxcore is 1:3 (like insider or local cloud rumor back then)

        Large spare of headroom for 10 years (block TCP and TD can be clocked high later on, but can not high all the time, especially the TCP block is bottom in eSRAM PIM block to reduce heat)

        timeline and DX11
        2013-2015 DX11 Current Active block is:
        -Jaguar MainSOC --> 100-200 Gflops
        -TD Block (When DX12 not used) this block is used to do flops

        Main GPU block has (Gfx core + CPU like core block (12 SC+4CP))

        Gfx core =
        Gfx core is TCP block + TD block = TCP Block(part in PIM ESRAM) + TD block(what you see in Xray),TD main function when DX12 is for backend processing, but on DX 11 as gfx core for Jaguar
        so = 48 CU + 24 CU (it is fit with 6 CU group, each group is 12 ) = It is the reason Insider rumored 8970 xtx, but as it is low clock it is behave like 7970/680 = 2.6TF (+ 1.3TF backend flops processing)

        Act as CPU =12 SC/Scheduler/CP
        768 Scalar ALU = 1.3 TF (it is why X1 rumored to have 1.2 TF CPU), as DX12 need Very powerfull CPU, X1 move this CPU like core into GPU block

        Simple Slide is coming, as for me, i am already close to 80-90%,
        Last edited by mistercteam; 07-05-2015, 12:47 PM.


        • revben
          Editing a comment
          I agree on the component, but disagree on the block configuration.

      • #84
        To Mister X,

        In VERY BASIC Terms this is a 'little' bit of what MisterC is trying to explain.
        Click image for larger version

Name:	Explanation_1.png
Views:	1
Size:	116.5 KB
ID:	3376

        Click image for larger version

Name:	Explanation_2.png
Views:	1
Size:	86.5 KB
ID:	3377

        The key to the DSP cores and the accelerators lies in what MS have said about Halo 5 and how the Physics / lighting is 'linked' to 60fps. My view is this has to do with using the other cores and DSP's etc with the limitations of Dx11 at the moment and are very timing critical.. This also explains why all MS first part games have looked significantly better than third party stuff (as without DX12) coding for these other Compute Cores (as MS calls them) would be a BIG pain in the ass.

        However that is where DX12 will let Programmers see all the cores in the system as compute cores and will be able to code for them really easily, which was the goal of AMD etc with HSA and you will also see they refer to all cpu / gpu cores as Compute Units. Same thing with the xbox one, and that is why Windows 10 and DX12 will be such as big deal.

        Last edited by G-Force; 07-05-2015, 01:21 PM.


        • Misterx
          Editing a comment
          That's better

          Keep it going guys

          I remember some one we all know once said:"coding for Cell is x5 harder, coding for X-engine not simple too, x3 harder compared to pc/360."

        • mistercteam
          Editing a comment
          Yep ....
          also in the picture it is why PS4 or other GCN still lock to 1 Virtual address because IOMMU only serve 1 TCC

        • BiG Porras
          Editing a comment
          great paper, thanks!

        • LuvOfThaGame
          Editing a comment
          I really like this diagram. Shows clearly how media/public sees X1 verses reality. X1 is only 1.3TF if you strip it of its customizations. But that is NOT reality. :D

      • #85
        G-Force: I fixed your Pony/Media edition of Xbox SOC a little bit. I think you are expecting them too much ;----). This is how media and ponies sees Xbox One:


        • G-Force
          Editing a comment
          I know :), I was just in a hurry to put something up, after looking at it I was going to do what you have done :)

        • QuazL
          Editing a comment
          Very funny!

        • LuvOfThaGame
          Editing a comment
          Haha. Ain't that right! That is some funny truth right there!

        • David Michael
          Editing a comment

      • #86
        Basically from bird eye view it is like this vs PS4


        • revben
          Editing a comment
          I do not agree with how you are trying to fit anything within the block diagram from JOHN SELL.

        • LuvOfThaGame
          Editing a comment
          mistercteam thank you. Loving the new diagrams.

          revben could you be more specific. MrC is very detailed in his explanation for his arguments. To have a good discussion, could you go into more detail on what you disagree with so good counter arguments can be made?

        • mistercteam
          Editing a comment
          revben that simple diagram, the part where actual ALUs is also can be counted as Flops
          look at Audio block as example, 1 brach (ALU) block + 1 Scalar (ALU block) + 4 Vector ALU block that explain MS hinted the ALU splitted into like that in very inner level

          the 768 scalar is from Scalar part that like i said similar to CPU rumor back then
          this part in front end will be difficult to know where actual block are
          check my prev post about ILP-TLP or Flexible scalar unit

          but the real thing is
          there is 48 TCP beside 24 TD (all is CUs)
          48 TCP is the middle processing
          24 TD is back-end processing

          you can check from above or john sell which L2 (TCC) are
          and which are CB-DB (L2)

          remember CB-DB actually also L2 beside there is fixed function ,
          TCC path is what called as 2nd texture unit (there is 4 block of it 4x L2 16 way,which per L2 has 4 TCC)
          TD-CB-DB path is CUs that for back end

          the other accelerator will be for controlling etc, that wht i am not covering as Flops
          *) that why i said a bird eye view
          Last edited by mistercteam; 07-06-2015, 10:52 PM.

        • revben
          Editing a comment
          mistercteam I do not believe TD has it own CU as in the XDX it says that TD,TA, and TCP are within the CU. I agree on all the functions etc, like DB/CB having it own alu and cache given that TCC nor TCP link to it. And there is separate TRIANGLE/VERTEX BLOCK with 8 VGT blocks, with proable with own cache, but I am not sure given that the vertex read which is 16 elements/clock!!! goes through the TCP, TCC,TD and TA. So I agree on the blocks, just not the configuration.

        • mistercteam
          Editing a comment
          you miss my point i am not saying TD has CUs
          it is the marking of for CB-DB
          the actual CUs is on CB-DB

          just like TA , , the actual block is on 12 SC that generated 768 Scalar

          this is streaming model dont think like old paradigma
          where TA-TCP-TD is all happen in one block called 1 CU

          Address generating and modifier data to be no dependancy ready is need ALU
          it is virtual, MS use same name just to mark the process

          The not so confuse is yes TCP
          but TA/TD will bit confuse as on GCN it is part on same as CU too

          XDK never said TA/TD/TCP in one block it is virtual (it is processed at least in 3 block!!!)
          it is why iam shown real GCN Texture unit reside in 1 CU VS XDK !!!!
          Last edited by mistercteam; 07-07-2015, 01:28 AM.

      • #87
        revben see this slide in more clarity

        TA - TCP - TD
        that comparison of 1 CU GCN model
        and XDK Virtual block

        thats why i said check those slide
        then see the patent too
        MS modified the original CU into streaming model

        it is why they said (check the slide), called it as texture unit is wrong as the unit does not exist
        they named same TA - TCP - TD is to maintain the flow process that actually happened in CU
        but now they change it into streaming model .....

        XDK never said TA-TCP-TD in one block
        they said all happened in at least 3 Virtual block

        TA - TCP - TD is only marking for it
        Last edited by mistercteam; 07-07-2015, 01:41 AM.


        • #88
          This is lowest level of Processing Element in X1
          you will see the similarity of start with Intel newest Gfx core
          and also on Echelon
          and basically all future GPGPU system

          Be cautious this is just simple representation
          focus on ALUs side dont focus on TCP-TA-TD, focus as it is

          *) INFACT MS hinted very seriously when they showed Audio Block (SCALAR + Branch Control + 2 x Vector unit ( 4 Wide))
          *) it is also why MS said The branch is as costly as the computation in TA Block sense (producer), as on GCN 1 Branch can cost a lot more than just 1 ALU

          *) in reality the real Flops processing is on TCP ---> 2.6TF (768 x 4 x 2 x 0.426 = 2.6TF)

          Last edited by mistercteam; 07-07-2015, 04:16 AM.


          • #89
            About IOMMU on GCN
            this is official GCN white paper from AMD page 11

            " GCN incorporates an I/O Memory Management Unit (IOMMU)"


            • #90
              Click image for larger version

Name:	1718469367a097e5d77774abcceba465.png
Views:	1
Size:	122.6 KB
ID:	3523

              Click image for larger version

Name:	056121d648b027c197b148130f05be0e.png
Views:	1
Size:	106.9 KB
ID:	3521

              Ray Tracing being included into Vulkan. Directx 12.x on the One already included with MS and strategic partnership with imgtec
              Attached Files
              Last edited by srenia_ia; 07-07-2015, 06:30 PM.


              • srenia_ia
                Editing a comment

                A MS tiled based graphics never used... All PowerVR GPUs are based on Imagination’s unique Tile Based Deferred Rendering (TBDR) architecture; the only true deferred rendering GPU architecture in the world.


                Also this tidbit should raise some eyebrows:

                "All PowerVR GPUs are managed by firmware which controls all higher-level GPU events. This approach offers numerous advantages including full offloading of virtually all interrupt handling from the main host CPU while maintaining maximum flexibility.

                Rogue GPUs feature a dedicated multi-threaded microcontroller to run the microkernel, which allows full debugging functionality of the GPU. The software-based management of the GPU ensures the ability to adapt to future market requirements as well as providing optimal performance through priority-based execution of GPU tasks. The microkernel also has the ability to help SoC designers implement advanced power management features by, for example, signalling workload information to DVFS and power-gating logic within the SoC."

                The Xbox SOC ability to ramp down to 3% power usage is a curiosity when AMD APU's do not.

                "support for power islands and flexible power control mechanisms that can seamlessly interact with SoC-level power"


                "Alexandru Voica > ronch • 4 months ago

                You would need to bring a true differentiator to make a dent into the desktop/console space. This is why we've introduced our ray tracing GPUs

                This is a unique piece of IP that no one has at the moment; we believe that a console, a gaming laptop or any other kind of similar device integrating this type of GPU would really bring something new to the table."

       the moment. With Valkan & Directx 12.x support then yes. The RTU will make a dent in the console space.


                Although you usually tend to hear people talking about virtualization in the context of CPUs, it is actually a system-level requirement that can only be implemented optimally if all the components in the chip support it.

                PowerVR Series7XT and Series7XE GPUs implement a set of extensions to the Rogue architecture that enable efficient hardware support for virtualization. These virtual GPUs can be added to virtual machines running on hypervisors. A priority based mechanism ensures each virtual machine efficiently gets the required amount of performance, ensuring robust performance across all virtual clients.
                Last edited by srenia_ia; 07-07-2015, 06:27 PM.

              • srenia_ia
                Editing a comment
                Oh ya, notice the Xbox 360 reference... The IP from imgtec/MS strategic partnership is evident in the Xbox One. The power islands, micro kernels, master/slave host & co processors. The game dvr, gesture control, etc... The AMD GPU arrangement is a mobile graphics solution scaled up. The compression technology is top notch allowing for local processing without using external memory as well. The more looking into the design the graphics section in the One is made for low external memory usage while the HSA compute section uses the higher bandwidth. Two different caches.

                In the PS4 everything contends for bandwidth allowing for lower than stated bandwidth in actual game usage. The stated bandwidth in the One stays the stated bandwidth. The GPU being mobile in nature (tiling) makes external bandwidth usage much much smaller than even what we saw on the 285 AMD GPU. The processing takes place within the GPU caches because of the small tiles of processing being done. No back and forth between external memory within the tiling processing. So near TB bandwidth within tiling game scenarios. The ray tracing and other HSA compute does not compress so uses the outside of cache bandwidth. This is all ideal game engine design, not what current game engines have shown.

                The MS project showing tiling and being canceled was because of the tight integration of components. Only in a closed system could MS see the full benefit of this tiling gaming. BC could be explained by this in cache graphics engine design verses the PS4 10X more memory usage graphics engine. PS4 easier to use, but the One using tiling will have a much longer shelf life.

                So yes the GPU uses compression on the esram and DDR3, but the compression is for use within the caches of the GPU so that the GPU doesn't have to use the DDR3 and esram but rarely. It's a mobile GPU set up. So really I don't believe GDDR5 was a real option for MS when tiling could give them 10 times more bandwidth. Why use a external memory when PIM works so much better. The same goes for the 100X more efficient ray tracing than GPU compute that RTU's provide in the One.
                Last edited by srenia_ia; 07-07-2015, 05:40 PM.

              • G-Force
                Editing a comment
                That last comment, was that all of your own words or did you get that from some where? But I also have said from over 3 years ago that Tiled Resources and making the screen being rendered in small chunks was what was going to make the XBOX One come to life. And I also mentioned before its what Mobiles do so it is VERY scalable.

                I CAN'T Wait to have tiled resources games. Watching the Ponies mouths drop to the floor is going to be soooo funny.

              • srenia_ia
                Editing a comment
                Last comment is mine. Trying to separate research from my words. For certain we've talked lightly about it before. :-) there is such a plethora of info on such a complex SOC. The new info coming out of this company we probably should have put front and center sooner. Yes, I'm so looking forward to watching the ponies squirm as well. Going to have them scratching their heads. This mobile architecture is perfect for scaling from phones on upwards you are right.

            • #91
              Certainly this another milestone
              i forget that the proof sometimes is overlooked
              they hinted but i overlooked

              now before check the slide remember some fact about GCN (also in slide)
              1. Scheduler and Brach are part of CU, Scheduler are extension from CP
              2. from point 1, no way branch in GPU execured in scheduler, as both part of CU !!
              3. branch will very much costly, worst case is 1/16 from 16 ALU per SIMD

              on other hand this is CPU like charateristic
              4. CPU can do branch with same cost as compute !!

              With point 1 -4 it is shocked to me that XDK even hinted it

              they event said that the Branch is as costly as compute
              that certainly a CPU like charateristics !!!, funnily they highlight that on XDK !!! --> from XDK " Just like they do on CPUs" !!!

              Now the slide , enjoy, Hard Proof
              plus added linkedin fact about HP-APU, and Gfx pipeline lecturer that like i said branch is very much costly !!!!!

              Additional Slide:

              Click image for larger version

Name:	P7fIAuM.jpg
Views:	1
Size:	125.8 KB
ID:	3578 Click image for larger version

Name:	F5YgGHc.jpg
Views:	1
Size:	64.8 KB
ID:	3579 Click image for larger version

Name:	CHmwVguUwAAOQao.jpg
Views:	1
Size:	46.5 KB
ID:	3580


              • #92
                Now a bit about ROPS and RBE

                this is 7970 32 ROPS test
                as you can see 7970 is more than PS4 or suppose to be X1 too
                but people will think 32 ROPS will do at least 29 Gpix/sec
                but look at the result .....

                From Anantech

                We’ll start with 3DMark Vantage and its color fill test. This is basically a ROP test that attempts to have a GPU’s ROPs blend as many pixels as it can. Theoretically AMD can do 32 color operations per clock on Tahiti, which at 925MHz for 7970 means the theoretical limit is 29.6Gpix/sec;
                You see at the end it can only push 13.3 Gpix/sec it can be worst if the middle or front end process takes more rendering cycle time and BW !!

                also now bit more about RBE
                RBE on 7970 or GCN 1.0 is like Big RBE block 1 RBE = 16 ROPS
                you can cross checked at GCN white paper here ----> (page 13)

                RBE of newer GCN splitted into more smaller one ( 1 BIG RBE are from 4 part)
                it is why R9 290x has 16 RBE, but actually same as old 4 BIg RBE

                X1 XDK actually tells us why John Sell IEEE showed 4 RBE, then you can guess how many ROPS on that part
                plus also remember X1 13.2 Gpix is balanced resolve out, it is the actual final pixel out rate, that is basically like 7970 above (7970 at the end could be worse as it is not balanced)
                what balanced means, it means no matter what in the middle or front end, the X1 can do max 13.2 Gpix/sec as actual pixel resolve rate !!!
                X1 has 4 Big RBE but clocked Low !!

                *) BTW PS4 has 2 RBE, X1 has 4 RBE go figure .... (all using BIG RBE block a la 7970) plus see below pic, it is also a hint
                how many shader engine X1 has

                Click image for larger version

Name:	174785_original.jpg
Views:	1
Size:	125.6 KB
ID:	3726

                My Final calclulation based on this is like this
                X360 can do 4xMSAA 720p --> 4Gpix/sec
                X1 can do resolve rate --> 16 Gpix/sec at 1 Ghz or 13.2 Gpix/sec at 853 Mhz

                This is resolve rate, not related to 3D rendering metric
                based on that X1 can do 1080p 8xMSAA
                or 1440p (upscale to 4K) at 4xMSAA
                or 4K with no AA (but can use Post AA, at 4K maybe no need AA at all)


                • G-Force
                  Editing a comment
                  And on top of that when they use more compute (which the xbox one has MANY compute cores) the RBE's are not needed as much.


                • mistercteam
                  Editing a comment
                  Yep, for programmable back end, they will use the last 2 CU group
                  i am start to understand why on XDk they said process on CB DB then to ROPS
                  that huge hint that CB DB work like X360 edram thats why it has huge cache
                  then the last one is real RBE

              • #93
                Microsoft research Fac, Summit 2015,



                • #94
                  Now to make more clear from Anantech article that said 32 ROPS supposed to be 32 pix/cycle instead it is 13.3 Gpix/sec
                  *) also 64 ROPS (R9 290x) only has marginal resolve rate vs 32 ROPS on GCN !! (on the slide Anantech put wrong label it should said Gigapixel

                  ================================================== ================================================== =====
                  Click image for larger version

Name:	0_7970_resolve1.jpg
Views:	1
Size:	133.8 KB
ID:	3742

                  Click image for larger version

Name:	59318.png
Views:	1
Size:	30.7 KB
ID:	3745

                  VS X360 4 Gpix/Sec serving 720p 4xMSAA
                  Click image for larger version

Name:	0_x360_resolverate.jpg
Views:	1
Size:	129.7 KB
ID:	3743

                  VS Xbox One at 853 = 13.3 resolve rate, or 12.8 Gpix/sec at 800 Mhz
                  ================================================== =======
                  Click image for larger version

Name:	0_pixel_resolve_rate_s.jpg
Views:	1
Size:	31.0 KB
ID:	3744


                  • srenia_ia
                    Editing a comment
                    Good research. For all the PC only gaming crowd boasting about numbers, the numbers don't mean nearly as much as they believe. 50 percent or less usage of high end set ups from games is abysmal. Without ability to use all the silicon and at a consistent performance level, a system like PS4 looks good on paper, but doesn't deliver on those paper numbers.

                • #95
                  Xbox One ADK (apps development kit ---> example universal apps) is surprisingly has less DX features, unless this is processed in different block like our digging here


                  • #96
                    to add more prespective , that X1 is not weak
                    this gives you the idea How much power needed to encode 4K 60FPS HEVC 10bit
                    it need 36 core (2x E5 v3) = 72 thread !!!
                    let said 1080p 30FPS is 4 times less cpu power needed
                    it still need approximately if it is linear need 9core !!!!

                    so whether X1 use HP-APU , or DSP, the block responsible to encode has to be powerfull


                    "In Las Vegas, Nevada, this week, MulticoreWare will be demonstrating high quality real-time 4K 10 bit HEVC at frame rates in excess of 60 FPS on a dual Intel Xeon E5 v3 server, occupying only one standard rack unit."

                    *) Also HVEC decode natively is start with Carrizo, other AMD CPU is not native in hardware !!!
                    even native Carrizo support only 8bit,. X1 supprt up to 10bit


                    • #97
                      Looking back sure MS will try to deal with all of this on X1 design as they stated on the interview
                      CP is bottleneck, as current AMD hardware only support up to 9 stream (8 Compute + 1 Gfx)
                      on other hanc X1 support up to 64 stream workload, and dont forget that one thing
                      there is many kind of bottleneck that X1 try to deal with that ....
                      Click image for larger version

Name:	187494_600 (1).jpg
Views:	1
Size:	27.5 KB
ID:	3951

                      Click image for larger version

Name:	FAcojBK.jpg
Views:	1
Size:	33.3 KB
ID:	3952


                      • #98
                        Raytracing on Console interview. Imgtec
                        Hi everyone, My name is Alex and I work for Imagination, the company that designs PowerVR GPUs, MIPS CPUs and other IP technologies used in many...


                        • srenia_ia
                          Editing a comment
                          "I think the next evolution of graphics in mobile (and console) is ray tracing. A jump in performance is always nice to have (and we are always working to deliver better performance) but if you can deliver photorealistic graphics across a range of devices, then people will finally get the leaps made by GPUs in the last decade."

                          "Interest for our ray tracing technology remains high but I can't disclose the companies that we are talking to. Obviously, the number of companies that can take this type of technology into production is limited but we're not leaving any stone unturned."

                          "Finally, to answer your question about scaling: if you look at the 16-cluster PowerVR GT7900 GPU, it already gets quite close to current generation GPUs inside game consoles in terms of performance. However, we are an IP company which means we are interested in volume and in market disruptions; it is very difficult to break into a very saturated/mature market. But if a customer wants to enter that market using PowerVR, the IP is ready and available."

                          "The announcement of a processor and it being released to customers don't necessarily happen simultaneously.

                          There are also lead licensees that have access to pre-release versions (alpha, beta, etc.) - this can accelerate the process."


                          My words: Just some extra info from the strategic partners of MS. This Alex is talking about consoles gaming being able to use their Ray Tracing IP. Also that the IP being in the hands of MS (lead licensees)well before its announced is revealing. The PowerVR GPU not being used currently in the console market unless someone wants to shows where not looking at the whole IP package in the One. HEVC 10 bit, RTUs and other IP's can be used as their customer desires. I believe that the One had a possibility of being solely a Imagination SOC at one point. The Rogue GR6500 Wizard SOC which is HSA enabled has the same attributes we see in the One Xbox. Being HSA made using parts of Imagination tech compatible with AMD SOC. With MS research the best components where put together with custom MS IP bringing us the One. This partnership with the creators of the Dreamcast (Imgtec) is unique.

                          -10 watt envelope for 16 cluster SOC... PIM on the HBM Esram? 1 TF at FP16. Funny that he knows how expensive stacking memory on his SOC is.

                          "memory bandwidth is the main limiting factor."
                          "HBM is very expensive; packaging and the interposer adds significant cost. Even if there's no interposer and the memory is stacked on top of the SoC, just packaging with TSVs is hard and expensive to implement."

                          My words: Remember that even if this SOC isn't being specifically used in the One, this is the Strategic Partner of MS. A variation of this PIM is very likely with cross licenses. Also the RTU is balanced for 10 watt envelope SOC. Would that mean a 100watts plus SOC would see a 10x more powerful RTU?

                          Let's assume that there is two 16 cluster PIM Imgtec HSA enabled SOCs under the Esram in the One. Adds around 26 watts total with stacks of memory, adds ray tracing and 2 TF of compute processing at 16 bit. This is assuming MS didn't customize it more.
                          Last edited by srenia_ia; 07-14-2015, 11:48 PM.

                      • #99
                        more and more inline with SDK
                        this is
                        when GameOS suspended
                        focus on this
                        - CE RAM
                        - CP Internal memory

                        "Suspend and Resume events
                        On Xbox One, Game OS apps can be suspended and resumed by the Process Lifetime Manager (PLM). In the suspended state, the app’s memory is left intact, but the app has no CPU or GPU resources. It is an XR requirement for Xbox One that Game OS apps implement suspend and resume.

                        If an app receives the Suspending event, the app must call Suspend, otherwise the app will be terminated. Both the Suspend and Resumecalls must operate on the title's render thread, to ensure that GPU state is saved off correctly by the Suspend call.

                        When the Suspend call is made, the Direct3D runtime will save the state of the context registers, CE RAM, ESRAM, GDS, Index buffers, some GPU registers (not all the GPU registers are readable), and CP internal memory. This state will be restored on the Resume call."

                        think it....
                        i said that MS modified the CP to be like CPU also related to 768 scalar
                        so if game OS suspended if GameOS run on Jaguar they certainly need to suspend part of memory on 8 GB DDR!!
                        why it is actually suspen the CP internal Memory !!!

                        unless the CP has its own memory like insider said, and my digging that
                        1024 bit mem controller related to large cache plus X1 internal HBM


                        • srenia_ia
                          Editing a comment
                          +1. CP memory being the small cache between CPU's?

                        • mistercteam
                          Editing a comment
                          internal memory i belive is where 1GB edram rumor comes from
                          also act as substantial cache for CB-DB
                          eSRAM is still 32MB fast one act like L2/L3 CPU-GPU on Gfx core
                          esram aso 32MB as general cache for entire system where Jag can see it too
                          basically move engine is copies back and forth from slower DDR into edram
                          and from slower eSRAM into faster eSRAM

                      • Now more and more data from patent, now the front end
                        part start to be more inline with XDK spec

                        will put the slide later
                        when i do have some time

                        now from the AMD patent
                        X1 has 8 VGT, 4 IA, 8SX
                        PS4 has 2 VGT, 2 IA, 2SX

                        *) the CUs below is CU group (it can active 6-12 CU)

                        also about VGT structures in more detail