diff --git a/doc/release-notes/skiboot-6.0.rst b/doc/release-notes/skiboot-6.0.rst new file mode 100644 index 000000000000..309734557a81 --- /dev/null +++ b/doc/release-notes/skiboot-6.0.rst @@ -0,0 +1,1027 @@ +.. _skiboot-6.0: + +skiboot-6.0 +=========== + +skiboot v6.0 was released on Friday May 11th 2018. It is the first +release of skiboot 6.0, which is the new stable release of skiboot +following the 5.11 release, first released April 6th 2018. + +Skiboot 6.0 is the basis for op-build v2.0 and will is *required* for +POWER9 systems. + +skiboot v6.0 contains all bug fixes as of :ref:`skiboot-5.11`, +:ref:`skiboot-5.10.5`, and :ref:`skiboot-5.4.9` (the currently maintained +stable releases). We do *not* expect any further stable releases in the +5.10.x series, nor in the 5.11.x series. + +For how the skiboot stable releases work, see :ref:`stable-rules` for details. + +Over skiboot-5.11, we have the following changes: + + +New Features +------------ + +Since 6.0-rc1: + +- Update default stop-state-disable mask to cut only stop11 + + Stability improvements in microcode for stop4/stop5 are + available in upstream hcode images. Stop4 and stop5 can + be safely enabled by default. + + Use ~0xE0000000 to cut all but stop0,1,2 in case there + are any issues with stop4/5. + + example: :: + + nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0x1FFFFFFF + + **Note**: that DD2.1 chips that have a frequency <1867Mhz possible *need* to + run a hcode image *different* than the default in op-build (set + `BR2_HCODE_LATEST_VERSION=y` in your config) +- ibm,firmware-versions: add hcode to device tree + + op-build commit 736a08b996e292a449c4996edb264011dfe56a40 + added hcode to the VERSION partition, let's parse it out + and let the user know. +- ipmi: Add BMC firmware version to device tree + + BMC Get device ID command gives BMC firmware version details. Lets add this + to device tree. User space tools will use this information to display BMC + version details. + +Since 5.11: + +- Disable stop states from OPAL + + On ZZ, stop4,5,11 are enabled for PowerVM, even though doing + so may cause problems with OPAL due to bugs in hcode. + + For other platforms, this isn't so much of an issue as + we can just control stop states by the MRW. However the + rebuild-the-world approach to changing values there is a bit + annoying if you just want to rule out a specific stop state + from being problematic. + + Provide an nvram option to override what's disabled in OPAL. + + The OPAL mask is currently ~0xE0000000 (i.e. all but stop 0,1,2) + + You can set an NVRAM override with: :: + + nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0xFFFFFFF + + This nvram override will disable *all* stop states. +- interrupts: Create an "interrupts" property in the OPAL node + + Deprecate the old "opal-interrupts", it's still there, but the new + property follows the standard and allow us to specify whether an + interrupt is level or edge sensitive. + + Similarly create "interrupt-names" whose content is identical to + "opal-interrupts-names". +- SBE: Add timer support on POWER9 + + SBE on P9 provides one shot programmable timer facility. We can use this + to implement OPAL timers and hence limit the reliance on the Linux + heartbeat (similar to HW timer facility provided by SLW on P8). +- Add SBE driver support + + SBE (Self Boot Engine) on P9 has two different jobs: + - Boot the chip up to the point the core is functional + - Provide various services like timer, scom, stash MPIPL, etc., at runtime + + We will use SBE for various purposes like timer, MPIPL, etc. + +- opal:hmi: Add missing processor recovery reason string. + + With this patch now we see reason string printed for CORE_WOF[43] bit. :: + + [ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred. + [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error: + [ 477.352242181,7] HMI: PC - Thread hang recovery +- Add DIMM actual speed to device tree + + Recent HDAT provides DIMM actuall speed. Lets add this to device tree. +- Fix DIMM size property + + Today we parse vpd blob to get DIMM size information. This is limited + to FSP based system. HDAT provides DIMM size value. Lets use that to + populate device tree. So that we can get size information on BMC based + system as well. + +- PCI: Set slot power limit when supported + + The PCIe slot capability can be implemented in a root or switch + downstream port to set the maximum power a card is allowed to draw + from the system. This patch adds support for setting the power limit + when the platform has defined one. +- hdata/spira: parse vpd to add part-number and serial-number to xscom@ node + + Expected by FWTS and associates our processor with the part/serial + number, which is obviously a good thing for one's own sanity. + + +Improved HMI Handling +^^^^^^^^^^^^^^^^^^^^^ + +- opal/hmi: Add documentation for opal_handle_hmi2 call +- opal/hmi: Generate hmi event for recovered HDEC parity error. +- opal/hmi: check thread 0 tfmr to validate latched tfmr errors. + + Due to P9 errata, HDEC parity and TB residue errors are latched for + non-zero threads 1-3 even if they are cleared. But these are not + latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr + value and ignore them on non-zero threads if they are not present on + thread 0. +- opal/hmi: Print additional debug information in rendezvous. +- opal/hmi: Fix handling of TFMR parity/corrupt error. + + While testing TFMR parity/corrupt error it has been observed that HMIs are + delivered twice for this error + + - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. + - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. + + On second HMI we end up throwing "HMI: TB invalid without core error + reported" even though TB is in a valid state. +- opal/hmi: Stop flooding HMI event for TOD errors. + + Fix the issue where every thread on the chip sends HMI event to host for + TOD errors. TOD errors are reported to all the core/threads on the chip. + Any one thread can fix the error and send event. Rest of the threads don't + need to send HMI event unnecessarily. +- opal/hmi: Fix soft lockups during TOD errors + + There are some TOD errors which do not affect working of TOD and TB. They + stay in valid state. Hence we don't need rendez vous for TOD errors that + does not affect TB working. + + TOD errors that affects TOD/TB will report a global error on TFMR[44] + alongwith bit 51, and they will go in rendez vous path as expected. + + But the TOD errors that does not affect TB register sets only TFMR bit 51. + The TFMR bit 51 is cleared when any single thread clears the TOD error. + Once cleared, the bit 51 is reflected to all the cores on that chip. Any + thread that reads the TFMR register after the error is cleared will see + TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through + rendez-vous path and threads that see TFMR[51]=0, returns doing + nothing. This ends up in a soft lockups in host kernel. + + This patch fixes this issue by not considering TOD interrupt (TFMR[51]) + as a core-global error and hence avoiding rendez-vous path completely. + Instead threads that see TFMR[51]=1 will now take different path that + just do the TOD error recovery. +- opal/hmi: Do not send HMI event if no errors are found. + + For TOD errors, all the cores in the chip get HMIs. Any one thread from any + core can fix the issue and TFMR will have error conditions cleared. Rest of + the threads need take any action if TOD errors are already cleared. Hence + thread 0 of every core should get a fresh copy of TFMR before going ahead + recovery path. Initialize recover = -1, so that if no errors found that + thread need not send a HMI event to linux. This helps in stop flooding host + with hmi event by every thread even there are no errors found. +- opal/hmi: Initialize the hmi event with old value of HMER. + + Do this before we check for TFAC errors. Otherwise the event at host console + shows no error reported in HMER register. + + Without this patch the console event show HMER with all zeros :: + + [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] + [ 216.753498] Error detail: Timer facility experienced an error + [ 216.753509] HMER: 0000000000000000 + [ 216.753518] TFMR: 3c12000870e04000 + + After this patch it shows old HMER values on host console: :: + + [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] + [ 2237.652651] Error detail: Timer facility experienced an error + [ 2237.652766] HMER: 0840000000000000 + [ 2237.652837] TFMR: 3c12000870e04000 +- opal/hmi: Rework HMI handling of TFAC errors + + This patch reworks the HMI handling for TFAC errors by introducing + 4 rendez-vous points improve the thread synchronization while handling + timebase errors that requires all thread to clear dirty data from TB/HDEC + register before clearing the errors. +- opal/hmi: Don't bother passing HMER to pre-recovery cleanup + + The test for TFAC error is now redundant so we remove it and + remove the HMER argument. +- opal/hmi: Move timer related error handling to a separate function + + Currently no functional change. This is a first step to completely + rewriting how these things are handled. +- opal/hmi: Add a new opal_handle_hmi2 that returns direct info to Linux + + It returns a 64-bit flags mask currently set to provide info + about which timer facilities were lost, and whether an event + was generated. +- opal/hmi: Remove races in clearing HMER + + Writing to HMER acts as an "AND". The current code writes back the + value we originally read with the bits we handled cleared. This is + racy, if a new bit gets set in HW after the original read, we'll end + up clearing it without handling it. + + Instead, use an all 1's mask with only the bit handled cleared. +- opal/hmi: Don't re-read HMER multiple times + + We want to make sure all reporting and actions are based + upon the same snapshot of HMER in case bits get added + by HW while we are in OPAL. + +libflash and ffspart +^^^^^^^^^^^^^^^^^^^^ + +Many improvements to the `ffspart` utility and `libflash` have come +in this release, making `ffspart` suitable for building bit-identical +PNOR images as the existing tooling used by `op-build`. The plan is to +switch `op-build` to use this infrastructure in the not too distant +future. + +- libflash/blocklevel: Make read/write be ECC agnostic for callers + + The blocklevel abstraction allows for regions of the backing store to be + marked as ECC protected so that blocklevel can decode/encode the ECC + bytes into the buffer automatically without the caller having to be ECC + aware. + + Unfortunately this abstraction is far from perfect, this is only useful + if reads and writes are performed at the start of the ECC region or in + some circumstances at an ECC aligned position - which requires the + caller be aware of the ECC regions. + + The problem that has arisen is that the blocklevel abstraction is + initialised somewhere but when it is later called the caller is unaware + if ECC exists in the region it wants to arbitrarily read and write to. + This should not have been a problem since blocklevel knows. Currently + misaligned reads will fail ECC checks and misaligned writes will + overwrite ECC bytes and the backing store will become corrupted. + + This patch add the smarts to blocklevel_read() and blocklevel_write() to + cope with the problem. Note that ECC can always be bypassed by calling + blocklevel_raw_() functions. + + All this work means that the gard tool can can safely call + blocklevel_read() and blocklevel_write() and as long as the blocklevel + knows of the presence of ECC then it will deal with all cases. + + This also commit removes code in the gard tool which compensated for + inadequacies no longer present in blocklevel. +- libflash/blocklevel: Return region start from ecc_protected() + + Currently all ecc_protected() does is say if a region is ECC protected + or not. Knowing a region is ECC protected is one thing but there isn't + much that can be done afterwards if this is the only known fact. A lot + more can be done if the caller is told where the ECC region begins. + + Knowing where the ECC region start it allows to caller to align its + read/and writes. This allows for more flexibility calling read and write + without knowing exactly how the backing store is organised. +- libflash/ecc: Add helpers to align a position within an ecc buffer + + As part of ongoing work to make ECC invisible to higher levels up the + stack this function converts a 'position' which should be ECC agnostic + to the equivalent position within an ECC region starting at a specified + location. +- libflash/ecc: Add functions to deal with unaligned ECC memcpy +- external/ffspart: Improve error output +- libffs: Fix bad checks for partition overlap + + Not all TOCs are written at zero +- libflash/libffs: Allow caller to specifiy header partition + + An FFS TOC is comprised of two parts. A small header which has a magic + and very minimmal information about the TOC which will be common to all + partitions, things like number of patritions, block sizes and the like. + Following this small header are a series of entries. Importantly there + is always an entry which encompases the TOC its self, this is usually + called the 'part' partition. + + Currently libffs always assumes that the 'part' partition is at zero. + While there is always a TOC and zero there doesn't actually have to be. + PNORs may have multiple TOCs within them, therefore libffs needs to be + flexible enough to allow callers to specify TOCs not at zero. + + The 'part' partition is otherwise a regular partition which may have + flags associated with it. libffs should allow the user to set the flags + for the 'part' partition. + + This patch achieves both by allowing the caller to specify the 'part' + partition. The caller can not and libffs will provide a sensible + default. +- libflash/libffs: Refcount ffs entries + + Currently consumers can add an new ffs entry to multiple headers, this + is fine but freeing any of the headers will cause the entry to be freed, + this causes double free problems. + + Even if only one header is uses, the consumer of the library still has a + reference to the entry, which they may well reuse at some other point. + + libffs will now refcount entries and only free when there are no more + references. + + This patch also removes the pointless return value of ffs_hdr_free() +- libflash/libffs: Switch to storing header entries in an array + + Since the libffs no longer needs to sort the entries as they get added + it makes little sense to have the complexity of a linked list when an + array will suffice. +- libflash/libffs: Remove backup partition from TOC generation code + + It turns out this code was messy and not all that reliable. Doing it at + the library level adds complexity to the library and restrictions to the + caller. + + A simpler approach can be achived with the just instantiating multiple + ffs_header structures pointing to different parts of the same file. +- libflash/libffs: Remove the 'sides' from the FFS TOC generation code + + It turns out this code was messy and not all that reliable. Doing it at + the library level adds complexity to the library and restrictions to the + caller. + + A simpler approach can be achived with the just instantiating multiple + ffs_header structures pointing to different parts of the same file. +- libflash/libffs: Always add entries to the end of the TOC + + It turns out that sorted order isn't the best idea. This removes + flexibility from the caller. If the user wants their partitions in + sorted order, they should insert them in sorted order. +- external/ffspart: Remove side, order and backup options + + These options are currently flakey in libflash/libffs so there isn't + much point to being able to use them in ffspart. + + Future reworks planned for libflash/libffs will render these options + redundant anyway. +- libflash/libffs: ffs_close() should use ffs_hdr_free() +- libflash/libffs: Add setter for a partitions actual size +- pflash: Use ffs_entry_user_to_string() to standardise flag strings +- libffs: Standardise ffs partition flags + + It seems we've developed a character respresentation for ffs partition + flags. Currently only pflash really prints them so it hasn't been a + problem but now ffspart wants to read them in from user input. + + It is important that what libffs reads and what pflash prints remain + consistent, we should move the code into libffs to avoid problems. +- external/ffspart: Allow # comments in input file\ + +p9dsu Platform changes +---------------------- + +The p9dsu platform from SuperMicro (also known as 'Boston') has received +a number of updates, and the patches once carried by SuperMicro are now +upstream. + +Since 6.0-rc1: + +- p9dsu: timeout for variant detection, default to 2uess + + +Since 5.11: + +- p9dsu: detect p9dsu variant even when hostboot doesn't tell us + + The SuperMicro BMC can tell us what riser type we have, which dictates + the PCI slot tables. Usually, in an environment that a customer would + experience, Hostboot will do the query with an SMC specific patch + (not upstream as there's no platform specific code in hostboot) + and skiboot knows what variant it is based on the compatible string. + + However, if you're using upstream hostboot, you only get the bare + 'p9dsu' compatible type. We can work around this by asking the BMC + ourselves and setting the slot table appropriately. We do this + syncronously in platform init so that we don't start probing + PCI before we setup the slot table. +- p9dsu: add slot power limit. +- p9dsu: add pci slot table for Boston LC 1U/2U and Boston LA/ESS. +- p9dsu HACK: fix system-vpd eeprom +- p9dsu: change esel command from AMI to IBM 0x3a. + +ZZ Platform Changes +------------------- + +- hdata/i2c: Fix up pci hotplug labels + + These labels are used on the devices used to do PCIe slot power control + for implementing PCIe hotplug. I'm not sure how they ended up as + "eeprom-pgood" and "eeprom-controller" since that doesn't make any sense. +- hdata/i2c: Ignore multi-port I2C devices + + Recent FSP firmware builds add support for multi-port I2C devices such + as the GPIO expanders used for the presence detect of OpenCAPI devices + and the PCIe hotplug controllers used to power cycle PCIe slots on ZZ. + + The OpenCAPI driver inside of skiboot currently uses a platform-specific + method to talk to the relevant I2C device rather than relying on HDAT + since not all platforms correctly report the I2C devices (hello Zaius). + Additionally the nature of multi-port devices require that we a device + specific handler so that we generate the correct DT bindings. Currently + we don't and there is no immediate need for this support so just ignore + the multi-port devices for now. +- hdata/i2c: Replace `i2c_` prefix with `dev_` + + The current naming scheme makes it easy to conflate "i2cm_port" and + "i2c_port." The latter is used to describe multi-port I2C devices such + as GPIO expanders and multi-channel PCIe hotplug controllers. Rename + i2c_port to dev_port to make the two a bit more distinct. + + Also rename i2c_addr to dev_addr for consistency. +- hdata/i2c: Ignore CFAM I2C master + + Recent FSP firmware builds put in information about the CFAM I2C master + in addition the to host I2C masters accessible via XSCOM. Odds are this + information should not be there since there's no handshaking between the + FSP/BMC and the host over who controls that I2C master, but it is so + we need to deal with it. + + This patch adds filtering to the HDAT parser so it ignores the CFAM I2C + master. Without this it will create a bogus i2cm@ which migh cause + issues. +- ZZ: hw/imc: Add support to load imc catalog lid file + + Add support to load the imc catalog from a lid file packaged + as part of the system firmware. Lid number allocated + is 0x80f00103.lid. + + +Bugs Fixed +---------- + +Since 6.0-rc2: + +- core/opal: Fix recursion check in opal_run_pollers() + + An earlier commit introduced a counter variable poller_recursion to + limit to the number number of error messages shown when opal_pollers + are run recursively. However the check for the counter value was + placed in a way that the poller recursion was only detected first 16 + times and then allowed afterwards. + + This patch fixes this by moving the check for the counter value inside + the conditional branch with some re-factoring so that opal_poller + recursion is not erroneously allowed after poll_recursion is detected + first 16 times. +- phb4: Print WOF registers on fence detect + + Without the WOF registers it's hard to figure out what went wrong first, + so print those when we print the FIRs when a fence is detected. +- p9dsu: detect variant in init only if probe fails to found. + + Currently the slot table init happens twice in both probe and init + functions due to the variant detection logic called with in-correct + condition check. + +Since 6.0-rc1: + +- core/direct-controls: improve p9_stop_thread error handling + + p9_stop_thread should fail the operation if it finds the thread was + already quiescd. This implies something else is doing direct controls + on the thread (e.g., pdbg) or there is some exceptional condition we + don't know how to deal with. Proceeding here would cause things to + trample on each other, for example the hard lockup watchdog trying to + send a sreset to the core while it is stopped for debugging with pdbg + will end in tears. + + If p9_stop_thread times out waiting for the thread to quiesce, do + not hit it with a core_start direct control, because we don't know + what state things are in and doing more things at this point is worse + than doing nothing. There is no good recipe described in the workbook + to de-assert the core_stop control if it fails to quiesce the thread. + After timing out here, the thread may eventually quiesce and get + stuck, but that's simpler to debug than undefied behaviour. + +- core/direct-controls: fix p9_cont_thread for stopped/inactive threads + + Firstly, p9_cont_thread should check that the thread actually was + quiesced before it tries to resume it. Anything could happen if we + try this from an arbitrary thread state. + + Then when resuming a quiesced thread that is inactive or stopped (in + a stop idle state), we must not send a core_start direct control, + clear_maint must be used in these cases. +- hmi: Clear unknown debug trigger + + On some systems, seeing hangs like this when Linux starts: :: + + [ 170.027252763,5] OCC: All Chip Rdy after 0 ms + [ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes) + [ 171.238270428,5] OPAL: Switch to little-endian OS + + If you look at the in memory skiboot console (or do `nvram -p + ibm,skiboot --update-config log-level-driver=7`) we see the console get + spammed with: :: + + [ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 + [ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 + [ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 + [ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 + + We're taking the debug trigger (bit 17) early on, before the + hmi_debug_trigger function in the kernel is set up. + + This clears the HMI in Skiboot and reports to the kernel instead of + bringing down the machine. + +- core/hmi: assign flags=0 in case nothing set by handle_hmi_exception + + Theoretically we could have returned junk to the OS in this parameter. + +- SLW: Fix mambo boot to use stop states + + After commit 35c66b8ce5a2 ("SLW: Move MAMBO simulator checks to + slw_init"), mambo boot no longer calls add_cpu_idle_state_properties() + and as such we never enable stop states. + + After adding the call back, we get more testing coverage as well + as faster mambo SMT boots. + +- phb4: Hardware init updates + + CFG Write Request Timeout was incorrectly set to informational and not + fatal for both non-CAPI and CAPI, so set it to fatal. This was a + mistake in the specification. Correcting this fixes a niche bug in + escalation (which is necessary on pre-DD2.2) that can cause a checkstop + due to a NCU timeout. + + In addition, set the values in the timeout control registers to match. + This fixes an extremely rare and unreproducible bug, though the current + timings don't make sense since they're higher than the NCU timeout (16) + which will checkstop the machine anyway. + +- SLW: quieten 'Configuring self-restore' for DARN,NCU_SPEC_BAR and HRMOR + +Since 5.11: + +- core: Fix iteration condition to skip garded cpu +- uart: fix uart_opal_flush to take console lock over uart_con_flush + This bug meant that OPAL_CONSOLE_FLUSH didn't take the appropriate locks. + Luckily, since this call is only currently used in the crash path. +- xive: fix missing unlock in error path +- OPAL_PCI_SET_POWER_STATE: fix locking in error paths + + Otherwise we could exit OPAL holding locks, potentially leading + to all sorts of problems later on. +- hw/slw: Don't assert on a unknown chip + + For some reason skiboot populates nodes in /cpus/ for the cores on + chips that are deconfigured. As a result Linux includes the threads + of those cores in it's set of possible CPUs in the system and attempts + to set the SPR values that should be used when waking a thread from + a deep sleep state. + + However, in the case where we have deconfigured chip we don't create + a xscom node for that chip and as a result we don't have a proc_chip + structure for that chip either. In turn, this results in an assertion + failure when calling opal_slw_set_reg() since it expects the chip + structure to exist. Fix this up and print an error instead. +- opal/hmi: Generate one event per core for processor recovery. + + Processor recovery is per core error. All threads on that core receive + HMI. All threads don't need to generate HMI event for same error. + + Let thread 0 only generate the event. +- sensors: Dont add DTS sensors when OCC inband sensors are available + + There are two sets of core temperature sensors today. One is DTS scom + based core temperature sensors and the second group is the sensors + provided by OCC. DTS is the highest temperature among the different + temperature zones in the core while OCC core temperature sensors are + the average temperature of the core. DTS sensors are read directly by + the host by SCOMing the DTS sensors while OCC sensors are read and + updated by OCC to main memory. + + Reading DTS sensors by SCOMing is a heavy and slower operation as + compared to reading OCC sensors which is as good as reading memory. + So dont add DTS sensors when OCC sensors are available. +- core/fast-reboot: Increase timeout for dctl sreset to 1sec + + Direct control xscom can take more time to complete. We seem to + wait too little on Boston failing fast-reboot for no good reason. + + Increase timeout to 1 sec as a reasonable value for sreset to be delivered + and core to start executing instructions. +- occ: sensors-groups: Add DT properties to mark HWMON sensor groups + + Fix the sensor type to match HWMON sensor types. Add compatible flag + to indicate the environmental sensor groups so that operations on + these groups can be handled by HWMON linux interface. +- core: Correctly load initramfs in stb container + + Skiboot does not calculate the actual size and start location of the + initramfs if it is wrapped by an STB container (for example if loading + an initramfs from the ROOTFS partition). + + Check if the initramfs is in an STB container and determine the size and + location correctly in the same manner as the kernel. Since + load_initramfs() is called after load_kernel() move the call to + trustedboot_exit_boot_services() into load_and_boot_kernel() so it is + called after both of these. +- hdat/i2c.c: quieten "v2 found, parsing as v1" +- hw/imc: Check for pause_microcode_at_boot() return status + + pause_microcode_at_boot() loops through all the chip's ucode + control block and pause the ucode if it is in the running state. + But it does not fail if any of the chip's ucode is not initialised. + + Add code to return a failure if ucode is not initialized in any + of the chip. Since pause_microcode_at_boot() is called just before + attaching the IMC device nodes in imc_init(), add code to check for + the function return. + + +Slot location code fixes: + +- npu2: Use ibm, loc-code rather than ibm, slot-label + + The ibm,slot-label property is to name the slot that appears under a + PCIe bridge. In the past we (ab)used the slot tables to attach names + to GPU devices and their corresponding NVLinks which resulted in npu2.c + using slot-label as a location code rather than as a way to name slots. + + Fix this up since it's confusing. +- hdata/slots: Apply slot label to the parent slot + + Slot names only really make sense when applied to an actual slot rather + than a device. On witherspoon the GPU devices have a name associated with + the device rather than the slot for the GPUs. Add a hack that moves the + slot label to the parent slot rather than on the device itself. +- pci-dt-slot: Big ol' cleanup + + The underlying data that we get from HDAT can only really describe a + PCIe system. As such we can simplify the devicetree slot lookup code + by only caring about the important cases, namly, root ports and switch + downstream ports. + + This also fixes a bug where root port didn't get a Slot label applied + which results in devices under that port not having ibm,loc-code set. + This results in the EEH core being unable to report the location of + EEHed devices under that port. + +opal-prd +^^^^^^^^ +- opal-prd: Insert powernv_flash module + + Explictly load powernv_flash module on BMC based system so that we are sure + that flash device is created before starting opal-prd daemon. + + Note that I have replaced pnor_available() check with is_fsp_system(). As we + want to load module on BMC system only. Also pnor_init has enough logic to + detect flash device. Hence pnor_available() becomes redundant check. + +NPU2/NVLINK2 +^^^^^^^^^^^^ +- npu2/hw-procedures: fence bricks on GPU reset + + The NPU workbook defines a way of fencing a brick and + getting the brick out of fence state. We do have an implementation + of bringing the brick out of fenced/quiesced state. We do + the latter in our procedures, but to support run time reset + we need to do the former. + + The fencing ensures that access to memory behind the links + will not lead to HMI's, but instead SUE's will be populated + in cache (in the case of speculation). The expectation is then + that prior to and after reset, the operating system components + will flush the cache for the region of memory behind the GPU. + + This patch does the following: + + 1. Implements a npu2_dev_fence_brick() function to set/clear + fence state + 2. Clear FIR bits prior to clearing the fence status + 3. Clear's the fence status + 4. We take the powerbus out of CQ fence much later now, + in credits_check() which is the last hardware procedure + called after link training. +- hw/npu2.c: Remove static configuration of NPU2 register + + The NPU_SM_CONFIG0 register currently needs to be configured in Skiboot to + select NVLink mode, however Hostboot should configure other bits in this + register. + + For some reason Skiboot was explicitly clearing bit-6 + (CONFIG_DISABLE_VG_NOT_SYS). It is unclear why this bit was getting cleared + as recent Hostboot versions explicitly set it to the correct value based on + the specific system configuration. Therefore Skiboot should not alter it. + + Bit-58 (CONFIG_NVLINK_MODE) selects if NVLink mode should be enabled or + not. Hostboot does not configure this bit so Skiboot should continue to + configure it. +- npu2: Improve log output of GPU-to-link mapping + + Debugging issues related to unconnected NVLinks can be a little less + irritating if we use the NPU2DEV{DBG,INF}() macros instead of prlog(). + + In short, change this: :: + + NPU2: comparing GPU 'GPU2' and NPU2 'GPU1' + NPU2: comparing GPU 'GPU3' and NPU2 'GPU1' + NPU2: comparing GPU 'GPU4' and NPU2 'GPU1' + NPU2: comparing GPU 'GPU5' and NPU2 'GPU1' + : + npu2_dev_bind_pci_dev: No PCI device for NPU2 device 0006:00:01.0 to bind to. If you expect a GPU to be there, this is a problem. + + to this: :: + + NPU6:0:1.0 Comparing GPU 'GPU2' and NPU2 'GPU1' + NPU6:0:1.0 Comparing GPU 'GPU3' and NPU2 'GPU1' + NPU6:0:1.0 Comparing GPU 'GPU4' and NPU2 'GPU1' + NPU6:0:1.0 Comparing GPU 'GPU5' and NPU2 'GPU1' + : + NPU6:0:1.0 No PCI device found for slot 'GPU1' +- npu2: Move NPU2_XTS_BDF_MAP_VALID assignment to context init + + A bad GPU or other condition may leave us with a subset of links that + never get initialized. If an ATSD is sent to one of those bricks, it + will never complete, leaving us waiting forever for a response: :: + + watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [acos:2050] + ... + Modules linked in: nvidia_uvm(O) nvidia(O) + CPU: 23 PID: 2050 Comm: acos Tainted: G W O 4.14.0 #2 + task: c0000000285cfc00 task.stack: c000001fea860000 + NIP: c0000000000abdf0 LR: c0000000000acc48 CTR: c0000000000ace60 + REGS: c000001fea863550 TRAP: 0901 Tainted: G W O (4.14.0) + MSR: 9000000000009033 CR: 28004484 XER: 20040000 + CFAR: c0000000000abdf4 SOFTE: 1 + GPR00: c0000000000acc48 c000001fea8637d0 c0000000011f7c00 c000001fea863820 + GPR04: 0000000002000000 0004100026000000 c0000000012778c8 c00000000127a560 + GPR08: 0000000000000001 0000000000000080 c000201cc7cb7750 ffffffffffffffff + GPR12: 0000000000008000 c000000003167e80 + NIP [c0000000000abdf0] mmio_invalidate_wait+0x90/0xc0 + LR [c0000000000acc48] mmio_invalidate.isra.11+0x158/0x370 + + + ATSDs are only sent to bricks which have a valid entry in the XTS_BDF + table. So to prevent the hang, don't set NPU2_XTS_BDF_MAP_VALID unless + we make it all the way to creating a context for the BDF. + +Secure and Trusted Boot +^^^^^^^^^^^^^^^^^^^^^^^ +- hdata/tpmrel: detect tpm not present by looking up the stinfo->status + + Skiboot detects if tpm is present by checking if a secureboot_tpm_info + entry exists. However, if a tpm is not present, hostboot also creates a + secureboot_tpm_info entry. In this case, hostboot creates an empty + entry, but setting the field tpm_status to TPM_NOT_PRESENT. + + This detects if tpm is not present by looking up the stinfo->status. + + This fixes the "TPMREL: TPM node not found for chip_id=0 (HB bug)" + issue, reproduced when skiboot is running on a system that has no tpm. + +PCI +^^^ +- phb4: Restore bus numbers after CRS + + Currently we restore PCIe bus numbers right after the link is + up. Unfortunately as this point we haven't done CRS so config space + may not be accessible. + + This moves the bus number restore till after CRS has happened. +- romulus: Add a barebones slot table +- phb4: Quieten and improve "Timeout waiting for electrical link" + + This happens normally if a slot doesn't have a working HW presence + detect and relies instead of inband presence detect. + + The message we display is scary and not very useful unless ou + are debugging, so quiten it up and change it to something more + meaningful. +- pcie-slot: Don't fail powering on an already on switch + + If the power state is already the required value, return + OPAL_SUCCESS rather than OPAL_PARAMETER to avoid spurrious + errors during boot. + +CAPI/OpenCAPI +^^^^^^^^^^^^^ +- capi: Keep the current mmio windows in the mbt cache table. + + When the phb is used as a CAPI interface, the current mmio windows list + is cleaned before adding the capi and the prefetchable memory (M64) + windows, which implies that the non-prefetchable BAR is no more + configured. + This patch allows to set only the mbt bar to pass capi mmio window and + to keep, as defined, the other mmio values (M32 and M64). +- npu2-opencapi: Fix 'link internal error' FIR, take 2 + + When setting up an opencapi link, we set the transport muxes first, + then set the PHY training config register, which includes disabling + nvlink mode for the bricks. That's the order of the init sequence, as + found in the NPU workbook. + + In reality, doing so works, but it raises 2 FIR bits in the PowerBus + OLL FIR Register for the 2 links when we configure the transport + muxes. Presumably because nvlink is not disabled yet and we are + configuring the transport muxes for opencapi. + + bit 60: + link0 internal error + bit 61: + link1 internal error + + Overall the current setup ends up being correct and everything works, + but we raise 2 FIR bits. + + So tweak the order of operations to disable nvlink before configuring + the transport muxes. Incidentally, this is what the scripts from the + opencapi enablement team were doing all along. +- npu2-opencapi: Fix 'link internal error' FIR, take 1 + + When we setup a link, we always enable ODL0 and ODL1 at the same time + in the PHY training config register, even though we are setting up + only one OTL/ODL, so it raises a "link internal error" FIR bit in the + PowerBus OLL FIR Register for the second link. The error is harmless, + as we'll eventually setup the second link, but there's no reason to + raise that FIR bit. + + The fix is simply to only enable the ODL we are using for the link. +- phb4: Do not set the PBCQ Tunnel BAR register when enabling capi mode. + + The cxl driver will set the capi value, like other drivers already do. +- phb4: set TVT1 for tunneled operations in capi mode + + The ASN indication is used for tunneled operations (as_notify and + atomics). Tunneled operation messages can be sent in PCI mode as + well as CAPI mode. + + The address field of as_notify messages is hijacked to encode the + LPID/PID/TID of the target thread, so those messages should not go + through address translation. Therefore bit 59 is part of the ASN + indication. + + This patch sets TVT#1 in bypass mode when capi mode is enabled, + to prevent as_notify messages from being dropped. + +Debugging/Testing improvements +------------------------------ + +Since 6.0-rc1: +- mambo: Enable XER CA32 and OV32 bits on P9 + + POWER9 adds 32 bit carry and overflow bits to the XER, but we need to + set the relevant CTRL1 bit to enable them. +- Makefile: Fix building natively on ppc64le + + When on ppc64le and CROSS is not set by the environment, make assumes + ppc64 and sets a default CROSS. Check for ppc64le as well, so that + 'make' works out of the box on ppc64le. +- Experimental support for building with Clang +- Improvements to testing and Travis CI + +Since 5.11: + +- core/stack: backtrace unwind basic OPAL call details + + Put OPAL callers' r1 into the stack back chain, and then use that to + unwind back to the OPAL entry frame (as opposed to boot entry, which + has a 0 back chain). + + From there, dump the OPAL call token and the caller's r1. A backtrace + looks like this: :: + + CPU 0000 Backtrace: + S: 0000000031c03ba0 R: 000000003001a548 ._abort+0x4c + S: 0000000031c03c20 R: 000000003001baac .opal_run_pollers+0x3c + S: 0000000031c03ca0 R: 000000003001bcbc .opal_poll_events+0xc4 + S: 0000000031c03d20 R: 00000000300051dc opal_entry+0x12c + --- OPAL call entry token: 0xa caller R1: 0xc0000000006d3b90 --- + + This is pretty basic for the moment, but it does give you the bottom + of the Linux stack. It will allow some interesting improvements in + future. + + First, with the eframe, all the call's parameters can be printed out + as well. The ___backtrace / ___print_backtrace API needs to be + reworked in order to support this, but it's otherwise very simple + (see opal_trace_entry()). + + Second, it will allow Linux's stack to be passed back to Linux via + a debugging opal call. This will allow Linux's BUG() or xmon to + also print the Linux back trace in case of a NMI or MCE or watchdog + lockup that hits in OPAL. +- asm/head: implement quiescing without stack or clobbering regs + + Quiescing currently is implmeented in C in opal_entry before the + opal call handler is called. This works well enough for simple + cases like fast reset when one CPU wants all others out of the way. + + Linux would like to use it to prevent an sreset IPI from + interrupting firmware, which could lead to deadlocks when crash + dumping or entering the debugger. Linux interrupts do not recover + well when returning back to general OPAL code, due to r13 not being + restored. OPAL also can't be re-entered, which may happen e.g., + from the debugger. + + So move the quiesce hold/reject to entry code, beore the stack or + r1 or r13 registers are switched. OPAL can be interrupted and + returned to or re-entered during this period. + + This does not completely solve all such problems. OPAL will be + interrupted with sreset if the quiesce times out, and it can be + interrupted by MCEs as well. These still have the issues above. +- core/opal: Allow poller re-entry if OPAL was re-entered + + If an NMI interrupts the middle of running pollers and the OS + invokes pollers again (e.g., for console output), the poller + re-entrancy check will prevent it from running and spam the + console. + + That check was designed to catch a poller calling opal_run_pollers, + OPAL re-entrancy is something different and is detected elsewhere. + Avoid the poller recursion check if OPAL has been re-entered. This + is a best-effort attempt to cope with errors. +- core/opal: Emergency stack for re-entry + + This detects OPAL being re-entered by the OS, and switches to an + emergency stack if it was. This protects the firmware's main stack + from re-entrancy and allows the OS to use NMI facilities for crash + / debug functionality. + + Further nested re-entry will destroy the previous emergency stack + and prevent returning, but those should be rare cases. + + This stack is sized at 16kB, which doubles the size of CPU stacks, + so as not to introduce a regression in primary stack size. The 16kB + stack originally had a 4kB machine check stack at the top, which was + removed by 80eee1946 ("opal: Remove machine check interrupt patching + in OPAL."). So it is possible the size could be tightened again, but + that would require further analysis. + +- hdat_to_dt: hash_prop the same on all platforms + Fixes this unit test on ppc64le hosts. +- mambo: Add persistent memory disk support + + This adds support to for mapping disks images using persistent + memory. Disks can be added by setting this ENV variable: + + PMEM_DISK="/mydisks/disk1.img,/mydisks/disk2.img" + + These will show up in Linux as /dev/pmem0 and /dev/pmem1. + + This uses a new feature in mambo "mysim memory mmap .." which is only + available since mambo commit 0131f0fc08 (from 24/4/2018). + + This also needs the of_pmem.c driver in Linux which is only available + since v4.17. It works with powernv_defconfig + CONFIG_OF_PMEM. +- external/mambo: Add di command to decode instructions + + By default you get 16 instructions but you can specify the number you + want. i.e. :: + + systemsim % di 0x100 4 + 0x0000000000000100: Enc:0xA64BB17D : mtspr HSPRG1,r13 + 0x0000000000000104: Enc:0xA64AB07D : mfspr r13,HSPRG0 + 0x0000000000000108: Enc:0xF0092DF9 : std r9,0x9F0(r13) + 0x000000000000010C: Enc:0xA6E2207D : mfspr r9,PPR + + Using di since it's what xmon uses. +- mambo/mambo_utils.tcl: Inject an MCE at a specified address + + Currently we don't support injecting an MCE on a specific address. + This is useful for testing functionality like memcpy_mcsafe() + (see https://patchwork.ozlabs.org/cover/893339/) + + The core of the functionality is a routine called + inject_mce_ue_on_addr, which takes an addr argument and injects + an MCE (load/store with UE) when the specified address is accessed + by code. This functionality can easily be enhanced to cover + instruction UE's as well. + + A sample use case to create an MCE on stack access would be :: + + set addr [mysim display gpr 1] + inject_mce_ue_on_addr $addr + + This would cause an mce on any r1 or r1 based access +- external/mambo: improve helper for machine checks + + Improve workarounds for stop injection, because mambo often will + trigger on 0x104/204 when injecting sreset/mces. + + This also adds a workaround to skip injecting on reservations to + avoid infinite loops when doing inject_mce_step. +- travis: Enable ppc64le builds + + At least on the IBM Travis Enterprise instance, we can now do + ppc64le builds! + + We can only build a subset of our matrix due to availability of + ppc64le distros. The Dockerfiles need some tweaking to only + attempt to install (x86_64 only) Mambo binaries, as well as the + build scripts. +- external: Add "lpc" tool + + This is a little front-end to the lpc debugfs files to access + the LPC bus from userspace on the host. +- core/test/run-trace: fix on ppc64el