HiSilicon Kirin 960: A Closer Look at Performance and Power

by Matt Humrick on March 14, 2017 7:00 AM EST
  • Posted in
  • Smartphones
  • Mobile
  • HiSilicon
  • Cortex A73
  • Kirin 960
  • HiSilicon’s Kirin 950 proved to be a breakout product for the Huawei subsidiary, ultimately finding a home in many of Huawei’s flagship phones, including the Mate 8, P9, P9 Plus, and Honor 8. Its big.LITTLE combination of four A72 and four A53 CPU cores manufactured on TSMC’s 16nm FF+ FinFET process delivered excellent performance and efficiency. Somewhat surprisingly, it turned out to be one of the best, if not the best, implementation of ARM’s IP we’ve seen.

    Because of the 950’s success, we were eager to see what improvements the Kirin 960 could offer. In our review of the Huawei Mate 9 , the first device to use the new SoC, we saw gains in most of our performance and battery life tests relative to the Mate 8 and its Kirin 950 SoC. Now it’s time to dive a little deeper and answer some of our remaining questions: How does IPC compare between the A73, A72, and other CPU cores? How is memory performance impacted by the A73’s microarchitecture changes? Does CPU efficiency improve? How much more power do the extra GPU cores consume?

    2x 32-bit LPDDR3 @ 933MHz (14.9GB/s)
    or 2x 32-bit LPDDR4 @ 1333MHz (21.3GB/s)
    (hybrid controller) Interconnect ARM CCI-550 ARM CCI-400 Storage UFS 2.1 eMMC 5.0 ISP/Camera Dual 14-bit ISP
    (Improved) Dual 14-bit ISP
    940MP/s Encode/Decode 2160p30 HEVC & H.264
    Decode & Encode
    2160p60 HEVC
    Decode 1080p H.264
    Decode & Encode
    2160p30 HEVC
    Decode Integrated Modem Kirin 960 Integrated LTE
    (Category 12/13)
    DL = 600Mbps
    4x20MHz CA, 64-QAM
    UL = 150Mbps
    2x20MHz CA, 64-QAM Balong Integrated LTE
    (Category 6)
    DL = 300Mbps
    2x20MHz CA, 64-QAM
    UL = 50Mbps
    1x20MHz CA, 16-QAM Sensor Hub Mfc. Process TSMC 16nm FFC TSMC 16nm FF+

    The Kirin 960 is the first SoC to use ARM’s latest A73 CPU cores, which seems fitting considering the Kirin 950 was the first to use ARM’s A72. Its CPU core frequencies see a negligible increase relative to the Kirin 950: 1.81GHz to 1.84GHz for the four A53s and 2.30GHz to 2.36GHz for the four A73s. Setting the peak operating point for the A73 cores lower than the 2.52GHz used by Kirin 955’s A72 cores, and lower still than the 2.8GHz that ARM targets for 16nm, is an interesting and deliberate choice by HiSilicon to limit the CPU’s power envelope, allowing the bigger GPU to take a larger chunk.

    We’ve already discussed the A73’s microarchitecture in depth , so I’ll just summarize a few of the highlights. For starters, the A73 stems from the A17 and does not belong to the A15/A57/A72 Austin family tree. This means the differences between the A72 and A73 are more substantial than the small change in product numbering would suggest, particularly in the CPU’s front end.

    The biggest difference is a reduction in decoder width, which is now 2-wide instead of 3-wide like the A72. This sounds like a downgrade on paper; however, there’s likely some workloads where the A72’s instruction fetch block fails to consistently saturate the decoder, so the actual performance impact of the A73’s narrower decode stage may not be that severe.

    In many cases, instruction dispatch throughput should actually improve relative to the A72. The A73’s shorter pipeline reduces front-end latency, including 1-2 fewer cycles for the decoder, which can decode most instructions in a single cycle, and 1 less cycle for the fetch stage. The L1 instruction cache doubles in size and is optimized for better throughput, and changes to the instruction fetch block reduce instruction bubbles. ARM also says the A73 includes a new, more accurate branch predictor, with a larger BTAC (Branch Target Address Cache) structure and a new 64-entry “micro-BTAC” for accelerating branch prediction.

    There are several other changes to the front end too, not to mention further along the pipeline, but it should be obvious by now that the A73 is a very different beast than the A72, grown from a different design philosophy. While the Austin family (A72) targeted industrial and low-power server applications in addition to mobile, the A73 focuses specifically on mobile, where power and area become an even higher priority. ARM says the A73 consumes 20%-30% less power than the A72 (same process, same frequency) and is up to 25% smaller (same process, same performance targets).

    When it comes to Kirin 960’s GPU, however, HiSilicon is clearly chasing performance instead of efficiency. With its previous SoCs, the Kirin 950/955 in particular, HiSilicon was criticized for using four-core Mali configurations while Samsung packed in eight or twelve Mali cores in its Exynos SoCs and Qualcomm squeezed more ALU resources into its Adreno GPUs. This was not entirely justified, though, because the Kirin 950’s Mali-T880MP4 GPU was capable of playing nearly any game available at acceptable frame rates and the performance difference between the Mate 8 (Kirin 950), Samsung Galaxy S7 edge (Snapdragon 820), and Galaxy S7 (Exynos 8890) after reaching thermal equilibrium is minimal.

    Whether in response to this criticism or to enable future use cases such as VR/AR, HiSilicon has significantly increased the Kirin 960’s peak GPU performance. Not only is it the first to use ARM’s latest Mali-G71 GPU, but it doubles core count to eight and boosts the peak frequency to 1037MHz, 15% higher than the 950’s smaller GPU.

    The Mali-G71 uses ARM’s new Bifrost microarchitecture , which moves from an SIMD ISA that relied on Instruction Level Parallelism (ILP) to a scalar ISA designed to take advantage of Thread Level Parallelism (TLP) like modern desktop GPU architectures from Nvidia and AMD. I’m not going to explain the difference in depth here, but basically this change allows better utilization of the shader cores, increasing throughput and performance. ARM’s previous Midgard microarchitecture needed to extract 4 instructions from a single thread and execute them concurrently to achieve full utilization of a single shader core, which is not easy to do consistently. In contrast, Bifrost can group 4 separate threads together on a shader core and execute a single instruction from each one, which is more inline with modern graphics and compute workloads.

    Now that we have a better understanding for Kirin 960’s design goals—better efficiency for the CPU and higher peak performance for the GPU—and a summary of the hardware changes HiSilicon made to achieve them, we’re ready to see how the performance and power consumption of the Kirin 960 compares to the 950/955 and other recent SoCs.

    CPU Performance

    niva - Tuesday, March 14, 2017 - link

    Well I for one am glad to see this deepdive into the performance. It gives a much more complete picture of what's happening. People on Android Central were giving this chip such glowing reviews and I really wasn't sold on it yet. That being said I'm fairly confident AC is sponsored by Huawei because any phone they push out gets glowing reviews despite it's Chinese hackware. Only Huawei phone worth buying is the Nexus 6P and still remains so. This won't change even if/when they actually do make better hardware.

    close - Tuesday, March 14, 2017 - link

    "Only Huawei phone worth buying is the Nexus 6P [...]. This won't change even if/when they actually do make better hardware."
    So you "know" that even if they make better hardware only the 6P will be worth buying? Crystal ball much? Are you also sponsored by somebody or did you just choose ignorance? And I'm being delicate here.

    Alexvrb - Tuesday, March 14, 2017 - link

    A lot of reviewers take free bread. Especially ones that aren't making enough off ads alone, or that have a personal slant, etc. I'm not saying that includes AC, but you shouldn't dismiss it so casually either. The most effective place to apply grease is reviewers and review sites. The Kirin 960 is obviously a step sideways, so to give it glowing reviews is hilarious.

    However... even though the process and design they chose may be hindering power consumption at these clocks, they did have another goal in mind. Cost. If they really achieved such massive increases in density, and have a similar % yield, they can sell bucketloads of these chips for cheap. So at least for the mid-range devices, these would be plenty good for the foreseeable future.

    close - Wednesday, March 15, 2017 - link

    A lot of internet commentators are paid to praise one sine or company and accuse competing ones. This is an undisputed fact so you shouldn't dismiss this casually.

    The point wasn't whether AC is or isn't on Huawei's payroll but rather the attitude of the person commenting who basically disqualified themselves by saying that "even with better products they will never be worth buying".That sound like the user doesn't actually care about the product or the facts, and is only here to criticize Huawei and anything that might be related to them.

    And no matter what he says in the future none of it can be taken seriously and contain any trace of objective value ;).
    Intel High-NA Lithography Update: Dev Work On Intel 18A, Production On Future Node Intel Announces Panther Lake Client Platform, Built on Intel 18A For 2025
  • Asus Quietly Begins to Sell Cheap ROG Ally Console with Non-Extreme CPU
  • Intel Demos Lunar Lake Client Processor In Action, Silicon Pulled In To Intel 20A?
  • Netgear Unveils Orbi 970 Wi-Fi 7 Quad-Band Mesh System
  • Intel Broadens FPGA Range with New Products Across All Six FPGA Platforms
  • ASRock's Low-Profile Arc A310 Fits Every PC and Every Budget
  • Western Digital Releases WD_Black SN770M: M.2-2230 SSD for Consoles
  • TSMC Buys 10% Stake in IMS Nano from Intel
  • Intel Unveils Barlow Ridge Thunderbolt 5 Controllers - TB5 Launching In 2024
  • Epos Winds Down Former Sennheiser Gaming Headphone Business
  •