[A-48]ARMv9/v8-电源状态管理机制(PSCI协调机制)

发布时间:2026/7/5 20:33:32
[A-48]ARMv9/v8-电源状态管理机制(PSCI协调机制) ver0.1前言前序的文章我们介绍了PSCI这套机制的软件架构(包括虚拟化架构)。这个架构看似简单实则一点也不难当然如果你有一定的基础那自然是不难的如果你看上去觉得有障碍还是要去老老实实的打好基础就是说老板让你搞Power这块的稳定性问题和优化你暂时还是胜任不了。PSCI的架构可以分为代理和实现两个部分根据异常模型实现的不同PSCI的代理主要是现在EL1、EL2而PSCI的实现是在EL3。PSCI的代理主要是干一件事情就是根据各自VM的状态决定一个物理Core的状态通知EL3让这里的Firmware中的PSCI的实现决定一个(PE-(PE-Core\Cluster\System)状态Core\Cluster\System)状态和SCP沟通一下让SCP去和分布在系统架构中PPU进行编程通信进而影响这些电源域的电源模式完成软件层面的电源状态和硬件层面的电源模式的转换。这里面就有一个点就是一个物理Core的电源状态是如何变成一组(PE-Core\Cluster\System)状态给到SCP这里就是要发挥PSCI的协调作用今天我们就来详细的聊聊这块的内容。同样在阅读本文之前希望大家读一读我们的前序文章掌握一些基础也顺便找找感觉(1)[V-02]虚拟化基础-CPU架构(基于AArch64)(2)[A-03]ARMv8/ARMv9-多级Cache架构(3)[A-21]ARMv8/v9-SMMU系统架构和功能概述(4)[A-25]ARMv8/v9-GIC的系统架构(中断的硬件基础)(5)[A-38]ARMv8/v9-Generic Timer系统架构(6)[A-41]ARMv9/v8-电源管理系统架构(Power Management System Architecture)(7)[A-42]ARMv9/v8-电源管理工作原理(SCP Service Overview)(8)[A-43]ARMv9/v8-电源控制框架简介PCF(Power Control Framework Overview)(9)[A-0x2c]ARMv9/v8-电源管理域(Voltage Domain/Power Domain)(10)[A-45]ARMv9/v8-电源模式(Power Modes)(11)[A-46]ARMv9/v8-电源状态(Power States)(12)[V-05] 虚拟化基础-异常模型(Exception)(AArch64)(13)[A-47]ARMv9/v8-电源状态管理软件架构(PSCI架构)正文1.1 协调机制的背景通过对PSCI架构的介绍我们需要清楚PSCI就是为了上层操作系统(EL0/EL1)关于ARM的PE-Cores的电源状态管理工作与控制真实的物理设备的电源模式之间的解耦。我们通过一张图简要回顾一下(如图1-1)PE-Core的电源状态和电源模式之间的映射关系。图1-1 DSU-120 power domains现代CPU的微架构已经变得非常的复杂最显著的一个特点就是集成了很多的功能单元辅助CPU工作而这些功能单元从电源管理的角度看过去就是一个个的电源域(Power Domain)。这些电源域在工作的时候根据不同的场景又是处于不同的工作模式之下(Power Mode)如图1-2所示。图1-2 DSU-120 power domains我们对图1-1 DSU进行放大因为我们清楚PSCI关注的核心电源域就是CPU Cluster(DSU)的PE-Cores的电源状态。看一下PE-Cores支持的电源模式如图1-3、1-4所示。图1-3 Cortex-A720 core power modes(Part-1)图1-4 Cortex-A720 core power modes(Part-2)具体到一个PE-Core也有6个电源模式如图上图中的介绍这些电源模式在不同的工况下可以实现工作模式的迁移如图1-5所示。图1-5 Cortex-A720 core power mode transitionsPE-Core作为DSU中最重要的电源域的电源模式的迁移也会影响为PE-Cores提供CPU-Cluster内部工况的变化。例如DSU内部全部的PE-Cores都变成了OFF状态这个时候DSU肯定不需要维持之前的电源模式了也就是说DSU电源模式也会随着PE-Core的电源模式的变化而变化如图1-6所示。图1-6 DSU-120 DynamIQ cluster PPU mode transitions虽然不是很严谨但是绝大多数情况下这些电源模式变化的起点都是PE-Cores毕竟DSU上所有的组件都是为PE-Cores服务的。而PE-Cores电源模式的切换的触发点其实是来自于OSPM的干预也就是通过对电源状态的干预进而促成电源模式的变化进而引发后面的一连串的反应。这部分内容前面我们讲述PCSA的系统架构以及工作原理的时候已经详细的阐述过了这里只贴出一张图帮助大家找到ARM电源管理体系的软硬件的分界线如图1-7所示。图1-7 电源状态与电源模式的映射关系从硬件层面看SCP帮助我们吸收了电源模式迁移的策略PPU帮助我们实现了电源模式的切换。那么从软件层面看OSPM需要吸收电源状态的迁移工作但是如图1-1所示一个DSU有那么多的PE-Cores如DSU-120为例子You can configure the cluster to have between one and 14 cores. Each core within a complex counts towards the total number of cores in the cluster. This is in addition to any cores in the cluster that are not in complexes (stand-alone cores).这个最新一代的Cluster可以拥有多达14个PE-Cores一个PE-Core没事儿干了不代表其他的PE-Cores没事儿干了再加上整个软件世界虚拟化之后软件世界的工况更加的复杂。因此需要将电源状态的管理与OSPM解耦OSPM只需要根据自身的工况选择合适的电源状态就好直接调用PSCI接口就好这部分工作称之为PSCI-Proxy端负责收集软件世界内部各个软件模块对于当前工况下所希望的电源状态。大家对于电源状态的需求一致还好如果不一致话例如PE-Core-1希望是ON而PE-Core-2希望是OFF此时软件世界对于当前电源状态需求是有冲突的就需要PSCI的机制发挥作用了根据既定的电源状态的管理策略进行仲裁得到一个仲裁后的结果再输出给SCP走完后面的电源模式的管理流程。这个仲裁的过程其实就是PSCI机制协调各方的过程具体的实现可以称为PSCI-Server端。1.2 PSCI协调机制通过背景的描述我们将PSCI的工作分成Proxy和Server两个部分Proxy负责收集状态而真正的电源状态的协调工作是在PSCI-Server端完成的下面我们就来研究一下这部分的内容。1.2.1 Power States(shallow or deep)作为协调工作的PSCI-Server端要考虑的因素就是在功耗和系统的唤醒延迟方面找到一个平衡点而这个度量的因素就是Power State的深度When a core is idle, the OSPM transitions it into a low-power state. Typically, a choice of states is available, with different entry and exit latencies, and different levels of power consumption associated with each state. The state that is used typically depends on how quickly the core will be needed again. The power states that can be used at any one time might also depend on the activity of other components in a SoC, in addition to the cores. Each state is defined by the set of components that are clock-gated or power-gated when the state is entered. States are sometimes described as beingshallow or deep. Typically, a state X is said to be deeper than a state Y if:• The set of components that are powered down in state X subsumes and is a superset of the corresponding set for state Y.• The set of components that is powered down in state X is the same as the corresponding set for state Y, but various power modes are supported, and the modes used in state X save more power than those used in state Y.The time required to move from a low-power state to a running state is known as the wakeup latency. Generally, deeper power states have longer wakeup latencies, but this is not necessarily always the case.手册中描述了更深的电源状态下唤醒的延迟就越大。那么如何判断哪个电源状态更深呢Deeper:(1) 以Cluster为例一个SOC设计定型后DSU上的各个组件都会被划分成不同的电源域当一个电源状态需要关闭的电源域越多那么这个状态就越深。如手册的描述In addition to retention features, the DynamIQ ™ Shared Unit-120 (DSU-120) can further reduce static leakage power, using three powerdown features.• Optionally power down half, or all except one, of the L3 cache slices.• Within each L3 cache slice, power down a portion of the L3 cache RAM that the cache slice contains.• Use Quick Nap with L3 data RAMs, for fine-grained automatic transitions to a low-leakage power mode.同样是retention mode根据需求的不同可以对L3 Cache做进一步的细分和切片处理如果一个电源的状态X需要管理L3 Cache的状态另外一个电源状态Y不需要管理L3 Cache的电源状态那么X 就比 Y的电源状态深。(2) 如果两个电源状态管理的电源域一样多那么就看谁省电。还是以L3的Cache为例假如两个电源状态都要关注L3的Cache的电源模式此时X状态下需要关闭L3 Cache而Y状态不需要关闭L3 Cache那么X 就比Y的电源状态深因为X肯定比Y省电。当然X状态下唤醒也肯定比Y要慢因为要多点亮一个电源域L3 Cache。通过(1)(2)规则中的规则我们可以代入图1-8看一下PE-Core的电源状态的深度关系。图1-8 Cortex-A55: mapping AP core power states to modes1.2.2 Power state system topologies通常情况下一个SOC的电源管理涉及的电源域如图1-9所示图1-9 SoC voltage and power domain partitioning example大家(各个 Power Domain)都是出来混的彼此之间还是要照顾一下组织内(SOC)其他Domain域的情绪尤其是有依赖关系的电源域之间的关系要搞得融洽且不能出乱子即便是PSCI机制Cover的PE-Core的电源状态在迁移的时候也要考虑全局的思想。因此PSCI-Server端要考虑的第二个要考虑的因素就是各个电源域电源状态的继承关系Although idle power management is driven by thread behavior on a core, the OSPM can place the platform into states that affect many other components beyond the core itself. For example, if the last core in a SoC goes idle, the OSPM can target power states that affect the whole SoC. The choice is also driven by the use of other components in the system, and therefore might require coordination among multiple agents. A typical example is placing the system into a state where memory is in self-refresh when all cores, and any other requesters, are idle. The OSPM has to provide the necessary power management software infrastructure to determine the correct choice of state.Each component in a power domain has a set of power states that affect the components in the domain. Although physically the power domains are not necessarily built in a hierarchical fashion, from a software control point of view, they are arranged in a logical hierarchy. The hierarchy arises out of ordering dependencies that are required when placing the power domains into different power states. For example, consider a power domain that encompasses a shared cache, and power domains for the cores that use it. In such a system, the core power domains must be powered down before the shared cache domain, to guarantee correct operation.我们把图1-9简化一下只保留PSCI强相关的部分如图1-10所示图1-10 Example power domain topology结合图1-1、1-9、1-10再来看手册中的描述理解应该更深刻一些(1) 经过高度的抽象PE-Core的电源状态作为电源域层面的末端节点除了自身外至少会影响两层电源域(Cluster和System)。(2) 这两层电源域内部也关联了很多其他的电源域。比如Cluster内部比较重要的L3-Cache因此同样是Retention这个状态Cluster的Retention就比PE-Core的Retention的深度要深。同理System的Retention就比Cluster的Retention要深因为System关联的电源域的组件更多。(3)PSCI机制对PE-Core电源状态的干预会间接的影响Cluster和System的电源状态。比如手册中给出的例子一个Cluster上所有的PE-Core都进入Retention模式了Cluster就没有必要继续维持Run的状态处理好Cache总线的事情后也可以考虑设计一个策略就是Cluster也进入休眠状态。PSCI-Server要做的一件重要的事情就是要有全局视角不能只考虑自己的一亩三分地。(4) 还需要注意的是PSCI是支持系统层面的电源状态的操作的也就是说Root节点的电源状态也会反向影响孩子节点的电源状态。下面我们全文引用手册中的描述应该说手册中的视野更加的宏大(Multiple SoCs、多软件架构)不知道该节选哪一部分(这部分我们后面也会规划文章专门进行介绍初步的想法是依托MTK或者高通一款具体的芯片的软件层面的实现展开)。PSCI provides an interface to allow an OS to request system shutdown, system reset, and system suspend (suspend to RAM and suspend to disk). This allows a silicon vendor to provide a common implementation of these functions that is independent of the supervisory software running on the device.The usage of the term system in the PSCI function definitions refers to the machine view that is available to the calling OS. If the caller is a guest running in a virtual machine system, shutdown, reset, and suspend operations affect the virtual machine and might not result in any physical power changes.However, if a hypervisor is not present, or the caller is a hypervisor, the result is physical changes in power. Even if the caller is running on a physical machine, the term system might not mean the entire physical machine. For example, consider an advanced server system consisting of multiple boards, each with a board management controller (BMC), and each containing multiple SoCs. Such a system could run an OS instance per SoC. In this example, a PSCI command to shut down the system applies to a single SoC, while powering down the entire board requires access to the BMC through an administration interface that is beyond the reach of the calling OS or a PSCI implementation. In this document, the term system refers only to the machine view that is visible to an OS. In the example above(Pic 1-10), this maps to a single SoC.1.2.3 Power state coordination搞清楚了PSCI机制协调电源状态的两个维度(深度和拓扑)之后下面我们将进入PSCI机制的核心环节Entry into local power states for high-level nodes in a power topology (for example, clusters or system) requires coordinating children nodes. For example, entry into a cluster powerdown state is only possible when all cores in the cluster are powered down. To achieve this, every core but the last one must be placed into a powerdown state, and the last one places itself and the cluster into a powerdown state.PSCI supports two modes of power state coordination, platform-coordinated mode, and OS-initiated mode.PSCI-Server端要在系统架构层面协调好各个PE-Core的状态之后才能决定上级节点(Cluster/System)的电源状态。ARM给出了两种PSCI-Server的实现方式 platform-coordinated mode(平台协调模式), and OS-initiated mode(操作系统协调模式)。1.2.3.1 Platform-coordinated modePSCI 1.0之前只支持Platform-coordinated mode这种模式This is the default mode of coordination. In this mode, the PSCI implementation is responsible for coordinating power states. When a core has no more work to do, the OSPM requests the deepest state it can tolerate for that core and its parent nodes. For power state requests that affect a topology node above the core level, the implementation chooses the deepest power state that can be tolerated by all the cores in the node. In effect, the power state request expresses the following two constraints:•The caller allows entry to states up to this depth, but no deeper.•The caller cannot tolerate a higher wakeup latency than that associated with the requested state.The PSCI implementation then determines the deepest state that satisfies the constraints expressed by each core in a given node.这种模式协调的原则必须符合两条限制(1) 申请节点的电源状态经过PSCI-Server端的策略仲裁之后不能被设置为更深的深度。假如电源状态深度State A State B State C如果PE-Core申请 State B那么在任何情况下PE-Core不能被设置为State C。(2) 申请节点必须遵循最小唤醒延迟的原则。假如电源状态深度State A State B State C但是唤醒延迟State(Wakeup Delay) A State(Wakeup Delay) C State(Wakeup Delay) B(一般不会这样)。如果一个PE-Core申请状态为State C那么根据规则(1)经过PSCI的规则是可以设置为State B或者 State A的但是加上规则(2)的限制这个PE-Core是不能被设置为StateB的原因就是State B的唤醒延迟比State C要长可能的选择是State A或者保持 State C。当然实际的各个厂商的关于电源状态的实现是极少存在这种情况的。在上面规则的基础上我们来看一下Platform-coordinated mode如何工作如图1-11所示图1-11 Example platform coordination of power state requestsPSCI-Proxy(PSCI的接口通过OSPM发送过来的电源状态的请求)这个请求ARM的建议是一组编码的形式申请如图1-12所示图1-12 StateID encodings for local and composite states in example system我们把图1-11和图1-12结合在一起看就比较清晰了通过相同的颜色就可以做比较确认。(1) PSCI-Proxy 收集到当前PE-Core的状态并汇总到PSCI-Server中每次申请包含当前PE-Core所在电源拓扑的三级状态(Core、Cluste、System)。(2) PSCI-Server拿到PE-Core 0和PE-Core 1的状态后然后决策出最终的三级状态。例如只有所有PE-Core所在的拓扑分支的(Core、Cluste、System)都申请进入Retention状态之后System才能进入Retention状态。(3) 其他的状态迁移大家可以按照上面的规则自行推演注意这部分是要看各个芯片OEM厂家的基本实现量产项目中涉及到这部分的内容具体要看芯片的手册。1.2.3.2 OS-initiated modePSCI在1.0版本之后引入了OS-initiated modeIntroduced in PSCI 1.0, OS-initiated mode places the responsibility for coordination on the calling OS. In the OS-initiated coordination scheme, OSPM only requests an idle state for a particular topology node when the last underlying core goes idle.When a core goes idle it always selects an idle state for itself, but idle states for higher-level nodes such as clusters are only selected when the last running core in the node goes idle. In addition, the implementation only considers the most recent request for a particular node when deciding on its idle state.这种模式与Platform-coordinated mode比较在申请状态的时候要表示清楚当前PE-Core的last状态也就是图1-12的第一列(Last in Level)不能全部是0了要做出明确的标识如图1-13所示图1-13 StateID sample encoding还是通过一个例子来理解一下OS-initiated mode这种协调模式如图1-14所示图1-14 Example flow: Cluster powerdown entry看一下手册中对这种模式的描述As the table illustrates, there are periods (marked in red) where the OS view of core state and the implementation view of core state do not match. This might happen after the OS requests a state, but before the implementation has processed the request. This can also happen when a core powers up, as the implementation sees the core before the OS. To implement OS-initiated mode, it is necessary to deal with the races that arise due to the differing views of core state. Solving the races gives rise to the following requirements:• The implementation must deny any requests from the calling OS that are inconsistent with its view of core state.• The calling OS must indicate when the calling core is the last running core at a particular power hierarchy level. It must also specify which power hierarchy level the core is last in, for example, whether it is the last core in the cluster or the last core in the system.OS-initiated mode 协调模式下从一个PE-Core发出的电源状态的切换需要一个周期才能完成因此OS层面的电源状态视角和PSCI-Server端的视角会在一个周期的末端实现统一其实就是不能像Platform-coordinated mode立即响应。主要的原因就是OS-initiated mode一般应用于更加复杂的系统架构下(虚拟化架构安全架构多SOC环境)这个仲裁的策略需要一个周期内综合各方的需求才能给SCP一个明确的电源状态的设置。而Platform-coordinated mode响应的策略简单粗暴远不如Platform-coordinated mode更加的细腻有弹性。这种便捷是有代价的(1) 需要当前的PE-Core发起操作之前指示当前PE-Core(vPE-Core)所需要Last In Level信息这就要求PSCI-Proxy端需要维护这些PE-Core的运行状态在Cluster层面和Soc层面的信息。(2) PSCI-Server作为PSCI的实现端需要在一个周期内对各个PE-Core的并发访问做临界状态的竞合处理特别是对于非法的状态申请做规避处理包括但是不限于如下的情况(也要看各个芯片OEM的具体实现特别是SCP和AP之间的电源处理策略的分工)• 电源状态的迁移不符合映射为电源模式的状态机的迁移(图1-5、图1-6)。• 一个处理周期内的同一个PE-Core电源状态的申请不一致的行为比如刚申请了PE-Core的状态为RUN却申请了PE-Core的Cluster为PD或者Retention状态。• 一个处理周期内PE-Core之间申请的状态竞合大家不能冲突需要按照深度和唤醒时间的规则进行协调。1.2.3.3 两种协调模式实现的区别这部分我们之前是讲过PSCI的架构的我们不展开讨论了只谈一下它们之间的区别。先看一下Platform-coordinated mode的典型架构如图1-15所示图1-15 Platform-coordinated mode SW ARCH别着急我们再来看一下OS-initiated mode的典型架构如图1-16所示图1-16 OS-initiated mode SW ARCH两种模式的区别其实我们前面的章节已经谈了一些了这里只补充一些软件层的考量(1) 首先是决策的主体• Platform-coordinated mode是在EL3中的固件程序中完成各个VM的供应商和虚拟化供应商几乎没有裁量权。• OS-initiated mode是在EL3之上的Hypervisor中或者干脆在EL1中(要看具体的虚拟化实现方案)这样就给了各个软件厂商更多的定制空间(没有十足的把我建议大家还是使用芯片厂商的默认策略)。(2) 效率• Platform-coordinated mode需要各个VM频繁的陷入EL3而且中间要经过EL2如果PE-Core的电源状态切换比较多的话这种ELx的切换还是一笔不小的开销。• OS-initiated mode则可以省略一部分陷入EL3的操作起到优化的目的。(3) 趋势• Platform-coordinated mode 更加适合系统架构简单电源状态不太复杂的场景如MCU和早期的ARM体系的Soc最大的优势是决策快。• OS-initiated mode更加适合系统架构复杂(手机、服务器)电源状态复杂的场景最大的优势是处理的场景更多但是决策相对Platform-coordinated mode要慢一些毕竟考虑的系统架构PE-Cores的状态更多。结语对于PSCI协调机制的讨论就到这里我们从电源状态到电源模式的映射开始聊起详细讨论了PSCI介入协调PE-Cores电源状态的必要性。在介绍具体的PSCI的协调机制之前我们讨论了PSCI中的两个重要的基础知识电源状态的深度和电源状态拓扑。有了预备知识后我们开始介绍PSCI目前的两种协调模式Platform-coordinated mode和OS-initiated mode。我们对这两模式进行了详细的分析包括融合的规则、编码的规范、以及一些业务内容。最后介绍了两种模式在软件架构层面的主要区别。本来想在本文介绍一下具体PSCI接口的实现和工作流程限于篇幅只能分成两篇文章发表了。今天就到这里谢谢大家请关注、转发、评论。Reference[01] DEN0050D_Power_Control_System_Architecture.pdf[02] armv8_a_power_management_100960_0100_en.pdf[03] Power_Policy_Unit_Architecture_Specification_V_ARM_DEN_0051E.pdf[04] DEN0024A_v8_architecture_PG.pdf[05] 79-LX-LD-s003-Linux设备驱动开发详解4_0内核-3rd.pdf[06] 80-PGxxx-35_QNX_Thermal_Manager_Overview.pdf[07] 80-pgxxx-7_n_qnx_power_management_software_architecture_ref.pdf[08] 80-ARM-POWER-HK0001_一文搞懂ARM_SoC功耗控制架构.pdf[09] Arm_Power_and_Performance_Management_SCMI_White_Paper.pdf[10] 80-ARM-POWER-cs0001_Arm-SoC-power功耗控制架构.pdf[11] 80-LX-LK-cl0009_深入理解Linux电源管理.pdf[12] DEN0056D_System_Control_and_Management_Interface.pdf[13] arm_total_compute_2021_reference_design_software_developer_guide_en.pdf[14] arm_total_compute_2022_reference_design_software_developer_guide_en.pdf[15] arm_cortex_m85_processor_trm_en.pdf[16] DEN0108_00eac0_smcf-archl-Specification.pdf[17] DEN0022F.b_Power_State_Coordination_Interface.pdf[18] MTxxxx_SCP_User_Manual_V1.0.pdf[19] learn_the_architecture_arm_system_architectures_en.pdf[20] arm_dsu_110_trm_101381_0400_11_en.pdf[21] DEN0077A_Firmware_Framework_Arm_A_profile_EAC0.pdf[22] 80-LX-POWER-PSCI-cs0001_Linux-PSCI框架.pdf[23] learn_the_architecture_realm_management_extension_guide.pdfGlossaryAP - application processorOSPM - Operating System Power ManagementWFI - Wait For InterruptWFE - Wait For EventDVFS - Dynamic Voltage and Frequency ScalingSCU - Snoop Control UnitOPP - Operating Performance PointPSCI - Power State Coordination InterfacePPU - Power Policy UnitPCSA - Power Control System ArchitectureSoC - System-on-ChipPCF - Power Control FrameworkSCP - System Control ProcessorBSP - board support packageSCMI - System Control and Management InterfaceEAS - Energy Aware SchedulingIPA - Intelligent Power AllocationACPI - Advanced Configuration and Power InterfaceLPI - Low-Power IdleCPPC - Collaborative Processor Performance ControlPCSM - power control state machineAOSS - Always-on subsystemPMIC - Power Management Integrated CircuitJM - job managerAON - always on domainSBSA - Server Base System ArchitectureCLK_CTRL - Clock ControllerLPD - Low Power DistributorLPC - Low Power CombinerP2Q - P-Channel to Q-Channel ConvertorGPIO - General Purpose IORAS - Reliability, Availability, and ServiceabilitySTR - Suspend to RAMPPF - Privileged platform firmware