Commit graph

631 commits

Author SHA1 Message Date
Reza Arbab a6030d7e0b spapr: Add a new level of NUMA for GPUs
NUMA nodes corresponding to GPU memory currently have the same
affinity/distance as normal memory nodes. Add a third NUMA associativity
reference point enabling us to give GPU nodes more distance.

This is guest visible information, which shouldn't change under a
running guest across migration between different qemu versions, so make
the change effective only in new (pseries > 5.0) machine types.

Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):

node distances:
node   0   1   2   3   4   5
  0:  10  40  40  40  40  40
  1:  40  10  40  40  40  40
  2:  40  40  10  40  40  40
  3:  40  40  40  10  40  40
  4:  40  40  40  40  10  40
  5:  40  40  40  40  40  10

After:

node distances:
node   0   1   2   3   4   5
  0:  10  40  80  80  80  80
  1:  40  10  80  80  80  80
  2:  80  80  10  80  80  80
  3:  80  80  80  10  80  80
  4:  80  80  80  80  10  80
  5:  80  80  80  80  80  10

These are the same distances as on the host, mirroring the change made
to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
Add a new level of NUMA for GPU's").

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Message-Id: <20200716225655.24289-1-arbab@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-07-20 09:21:39 +10:00
Gustavo Romero 7861e083f8 spapr: Fix typos in comments and macro indentation
This commit fixes typos in spapr_vio_reg_to_irq() comments and a macro
indentation.

Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
Message-Id: <1590710681-12873-1-git-send-email-gromero@linux.ibm.com>
Acked-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-06-26 09:22:30 +10:00
Markus Armbruster 2f35254aa0 pnv/psi: Correct the pnv-psi* devices not to be sysbus devices
pnv_chip_power8_instance_init() creates a "pnv-psi-POWER8" sysbus
device in a way that leaves it unplugged.
pnv_chip_power9_instance_init() and pnv_chip_power10_instance_init()
do the same for "pnv-psi-POWER9" and "pnv-psi-POWER10", respectively.

These devices aren't actually sysbus devices.  Correct that.

Cc: "Cédric Le Goater" <clg@kaod.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: qemu-ppc@nongnu.org
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200609122339.937862-18-armbru@redhat.com>
2020-06-15 21:36:21 +02:00
Leonardo Bras 0911a60c76 ppc/spapr: Add hotremovable flag on DIMM LMBs on drmem_v2
On reboot, all memory that was previously added using object_add and
device_add is placed in this DIMM area.

The new SPAPR_LMB_FLAGS_HOTREMOVABLE flag helps Linux to put this memory in
the correct memory zone, so no unmovable allocations are made there,
allowing the object to be easily hot-removed by device_del and
object_del.

This new flag was accepted in Power Architecture documentation.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Reviewed-by: Bharata B Rao <bharata@linux.ibm.com>
Message-Id: <20200511200201.58537-1-leobras.c@gmail.com>
[dwg: Fixed syntax error spotted by Cédric Le Goater]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-27 15:29:36 +10:00
Markus Armbruster 40c2281cc3 Drop more @errp parameters after previous commit
Several functions can't fail anymore: ich9_pm_add_properties(),
device_add_bootindex_property(), ppc_compat_add_property(),
spapr_caps_add_properties(), PropertyInfo.create().  Drop their @errp
parameter.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200505152926.18877-16-armbru@redhat.com>
2020-05-15 07:08:14 +02:00
Greg Kurz 087820e37f spapr: Drop CAS reboot flag
The CAS reboot flag is false by default and all the locations that
could set it to true have been dropped. This means that all code
blocks depending on the flag being set is dead code and the other
code blocks should be executed always.

Just do that and drop the now uneeded CAS reboot flag. Fix a
comment on the way to make checkpatch happy.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514994893.478799.11772512888322840990.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-07 11:10:50 +10:00
Alexey Kardashevskiy 91067db1ab spapr/cas: Separate CAS handling from rebuilding the FDT
At the moment "ibm,client-architecture-support" ("CAS") is implemented
in SLOF and QEMU assists via the custom H_CAS hypercall which copies
an updated flatten device tree (FDT) blob to the SLOF memory which
it then uses to update its internal tree.

When we enable the OpenFirmware client interface in QEMU, we won't need
to copy the FDT to the guest as the client is expected to fetch
the device tree using the client interface.

This moves FDT rebuild out to a separate helper which is going to be
called from the "ibm,client-architecture-support" handler and leaves
writing FDT to the guest in the H_CAS handler.

This should not cause any behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200310050733.29805-3-aik@ozlabs.ru>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514994229.478799.2178881312094922324.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-05-07 11:10:50 +10:00
Cédric Le Goater 25f3170b06 ppc/pnv: Create BMC devices only when defaults are enabled
Commit e2392d4395 ("ppc/pnv: Create BMC devices at machine init")
introduced default BMC devices which can be a problem when the same
devices are defined on the command line with :

  -device ipmi-bmc-sim,id=bmc0 -device isa-ipmi-bt,bmc=bmc0,irq=10

QEMU fails with :

  qemu-system-ppc64: error creating device tree: node: FDT_ERR_EXISTS

Use defaults_enabled() when creating the default BMC devices to let
the user provide its own BMC devices using '-nodefaults'. If no BMC
device are provided, output a warning but let QEMU run as this is a
supported configuration. However, when multiple BMC devices are
defined, stop QEMU with a clear error as the results are unexpected.

Fixes: e2392d4395 ("ppc/pnv: Create BMC devices at machine init")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200404153655.166834-1-clg@kaod.org>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-04-07 08:55:11 +10:00
Nicholas Piggin edfdbf9c6b ppc/spapr: Add FWNMI System Reset state
The FWNMI option must deliver system reset interrupts to their
registered address, and there are a few constraints on the handler
addresses specified in PAPR. Add the system reset address state and
checks.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-4-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviwed-by: Mahesh Salgaonkar <mahesh@linux.ibm.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17 17:00:22 +11:00
Nicholas Piggin 8af7e1fe6f ppc/spapr: Change FWNMI names
The option is called "FWNMI", and it involves more than just machine
checks, also machine checks can be delivered without the FWNMI option,
so re-name various things to reflect that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20200316142613.121089-3-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17 17:00:22 +11:00
David Gibson 91335a5e15 spapr: Rename DT functions to newer naming convention
In the spapr code we've been gradually moving towards a convention that
functions which create pieces of the device tree are called spapr_dt_*().
This patch speeds that along by renaming most of the things that don't yet
match that so that they do.

For now we leave the *_dt_populate() functions which are actual methods
used in the DRCClass::dt_populate method.

While we're there we remove a few comments that don't really say anything
useful.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
2020-03-17 17:00:19 +11:00
Alexey Kardashevskiy 4dba872219 spapr/rtas: Reserve space for RTAS blob and log
At the moment SLOF reserves space for RTAS and instantiates the RTAS blob
which is 20 bytes binary blob calling an hypercall. The rest of the RTAS
area is a log which SLOF has no idea about but QEMU does.

This moves RTAS sizing to QEMU and this overrides the size from SLOF.
The only remaining problem is that SLOF copies the number of bytes it
reserved (2KB for now) so QEMU needs to reserve at least this much;
SLOF will be fixed separately to check that rtas-size from QEMU is
enough for those 20 bytes for the H_RTAS hcall.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200316011841.99970-1-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17 15:08:50 +11:00
Alexey Kardashevskiy 395a20d3cc ppc/spapr: Move GPRs setup to one place
At the moment "pseries" starts in SLOF which only expects the FDT blob
pointer in r3. As we are going to introduce a OpenFirmware support in
QEMU, we will be booting OF clients directly and these expect a stack
pointer in r1, Linux looks at r3/r4 for the initramdisk location
(although vmlinux can find this from the device tree but zImage from
distro kernels cannot).

This extends spapr_cpu_set_entry_state() to take more registers. This
should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200310050733.29805-2-aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17 15:08:50 +11:00
David Gibson 1052ab67f4 spapr: Don't clamp RMA to 16GiB on new machine types
In spapr_machine_init() we clamp the size of the RMA to 16GiB and the
comment saying why doesn't make a whole lot of sense.  In fact, this was
done because the real mode handling code elsewhere limited the RMA in TCG
mode to the maximum value configurable in LPCR[RMLS], 16GiB.

But,
 * Actually LPCR[RMLS] has been able to encode a 256GiB size for a very
   long time, we just didn't implement it properly in the softmmu
 * LPCR[RMLS] shouldn't really be relevant anyway, it only was because we
   used to abuse the RMOR based translation mode in order to handle the
   fact that we're not modelling the hypervisor parts of the cpu

We've now removed those limitations in the modelling so the 16GiB clamp no
longer serves a function.  However, we can't just remove the limit
universally: that would break migration to earlier qemu versions, where
the 16GiB RMLS limit still applies, no matter how bad the reasons for it
are.

So, we replace the 16GiB clamp, with a clamp to a limit defined in the
machine type class.  We set it to 16 GiB for machine types 4.2 and earlier,
but set it to 0 meaning unlimited for the new 5.0 machine type.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
2020-03-17 09:41:15 +11:00
David Gibson 8897ea5a9f spapr: Don't attempt to clamp RMA to VRMA constraint
The Real Mode Area (RMA) is the part of memory which a guest can access
when in real (MMU off) mode.  Of course, for a guest under KVM, the MMU
isn't really turned off, it's just in a special translation mode - Virtual
Real Mode Area (VRMA) - which looks like real mode in guest mode.

The mechanics of how this works when using the hash MMU (HPT) put a
constraint on the size of the RMA, which depends on the size of the
HPT.  So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
we advertise to the guest based on this VRMA limit.

There are several things wrong with this:
 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
    of Node 0 memory size and the VRMA limit.  That will *often* work the
    same as clamping, but there can be other constraints on RMA size which
    supersede Node 0 memory size.  We have real bugs caused by this
    (currently worked around in the guest kernel)
 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
    we're past the point that we can actually advertise an RMA limit to the
    guest
 3) But most fundamentally, the VRMA limit depends on host configuration
    (page size) which shouldn't be visible to the guest, but this partially
    exposes it.  This can cause problems with migration in certain edge
    cases, although we will mostly get away with it.

In practice, this clamping is almost never applied anyway.  With 64kiB
pages and the normal rules for sizing of the HPT, the theoretical VRMA
limit will be 4x(guest memory size) and so never hit.  It will hit with
4kiB pages, where it will be (guest memory size)/4.  However all mainstream
distro kernels for POWER have used a 64kiB page size for at least 10 years.

So, simply replace this logic with a check that the RMA we've calculated
based only on guest visible configuration will fit within the host implied
VRMA limit.  This can break if running HPT guests on a host kernel with
4kiB page size.  As noted that's very rare.  There also exist several
possible workarounds:
  * Change the host kernel to use 64kiB pages
  * Use radix MMU (RPT) guests instead of HPT
  * Use 64kiB hugepages on the host to back guest memory
  * Increase the guest memory size so that the RMA hits one of the fixed
    limits before the RMA limit.  This is relatively easy on POWER8 which
    has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
  * Use a guest NUMA configuration which artificially constrains the RMA
    within the VRMA limit (the RMA must always fit within Node 0).

Previously, on KVM, we also temporarily reduced the rma_size to 256M so
that the we'd load the kernel and initrd safely, regardless of the VRMA
limit.  This was a) confusing, b) could significantly limit the size of
images we could load and c) introduced a behavioural difference between
KVM and TCG.  So we remove that as well.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
2020-03-17 09:41:15 +11:00
Greg Kurz ad334d89a6 spapr: Handle pending hot plug/unplug requests at CAS
If a hot plug or unplug request is pending at CAS, we currently trigger
a CAS reboot, which severely increases the guest boot time. This is
because SLOF doesn't handle hot plug events and we had no way to fix
the FDT that gets presented to the guest.

We can do better thanks to recent changes in QEMU and SLOF:

- we now return a full FDT to SLOF during CAS

- SLOF was fixed to correctly detect any device that was either added or
  removed since boot time and to update its internal DT accordingly.

The right solution is to process all pending hot plug/unplug requests
during CAS: convert hot plugged devices to cold plugged devices and
remove the hot unplugged ones, which is exactly what spapr_drc_reset()
does. Also clear all hot plug events that are currently queued since
they're no longer relevant.

Note that SLOF cannot currently populate hot plugged PCI bridges or PHBs
at CAS. Until this limitation is lifted, SLOF will reset the machine when
this scenario occurs : this will allow the FDT to be fully processed when
SLOF is started again (ie. the same effect as the CAS reboot that would
occur anyway without this patch).

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158257222352.4102917.8984214333937947307.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-03-17 09:41:14 +11:00
Paolo Bonzini ca6155c0f2 Merge tag 'patchew/20200219160953.13771-1-imammedo@redhat.com' of https://github.com/patchew-project/qemu into HEAD
This series removes ad hoc RAM allocation API (memory_region_allocate_system_memory)
and consolidates it around hostmem backend. It allows to

* resolve conflicts between global -mem-prealloc and hostmem's "policy" option,
  fixing premature allocation before binding policy is applied

* simplify complicated memory allocation routines which had to deal with 2 ways
  to allocate RAM.

* reuse hostmem backends of a choice for main RAM without adding extra CLI
  options to duplicate hostmem features.  A recent case was -mem-shared, to
  enable vhost-user on targets that don't support hostmem backends [1] (ex: s390)

* move RAM allocation from individual boards into generic machine code and
  provide them with prepared MemoryRegion.

* clean up deprecated NUMA features which were tied to the old API (see patches)
  - "numa: remove deprecated -mem-path fallback to anonymous RAM"
  - (POSTPONED, waiting on libvirt side) "forbid '-numa node,mem' for 5.0 and newer machine types"
  - (POSTPONED) "numa: remove deprecated implicit RAM distribution between nodes"

Introduce a new machine.memory-backend property and wrapper code that aliases
global -mem-path and -mem-alloc into automatically created hostmem backend
properties (provided memory-backend was not set explicitly given by user).
A bulk of trivial patches then follow to incrementally convert individual
boards to using machine.memory-backend provided MemoryRegion.

Board conversion typically involves:

* providing MachineClass::default_ram_size and MachineClass::default_ram_id
  so generic code could create default backend if user didn't explicitly provide
  memory-backend or -m options

* dropping memory_region_allocate_system_memory() call

* using convenience MachineState::ram MemoryRegion, which points to MemoryRegion
   allocated by ram-memdev

On top of that for some boards:

* missing ram_size checks are added (typically it were boards with fixed ram size)

* ram_size fixups are replaced by checks and hard errors, forcing user to
  provide correct "-m" values instead of ignoring it and continuing running.

After all boards are converted, the old API is removed and memory allocation
routines are cleaned up.
2020-02-25 09:19:00 +01:00
Greg Kurz 4b63db1289 spapr: Don't use spapr_drc_needed() in CAS code
We currently don't support hotplug of devices between boot and CAS. If
this happens a CAS reboot is triggered. We detect this during CAS using
the spapr_drc_needed() function which is essentially a VMStateDescription
.needed callback. Even if the condition for CAS reboot happens to be the
same as for DRC migration, it looks wrong to piggyback a migration helper
for this.

Introduce a helper with slightly more explicit name and use it in both CAS
and DRC migration code. Since a subsequent patch will enhance this helper
to cover the case of hot unplug, let's go for spapr_drc_transient(). While
here convert spapr_hotplugged_dev_before_cas() to the "transient" wording as
well.

This doesn't change any behaviour.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158169248180.3465937.9531405453362718771.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21 09:15:04 +11:00
Alexey Kardashevskiy 87262806cb spapr: Allow changing offset for -kernel image
This allows moving the kernel in the guest memory. The option is useful
for step debugging (as Linux is linked at 0x0); it also allows loading
grub which is normally linked to run at 0x20000.

This uses the existing kernel address by default.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200203032943.121178-6-aik@ozlabs.ru>
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21 09:15:04 +11:00
Shivaprasad G Bhat b5fca656f7 spapr: Add Hcalls to support PAPR NVDIMM device
This patch implements few of the necessary hcalls for the nvdimm support.

PAPR semantics is such that each NVDIMM device is comprising of multiple
SCM(Storage Class Memory) blocks. The guest requests the hypervisor to
bind each of the SCM blocks of the NVDIMM device using hcalls. There can
be SCM block unbind requests in case of driver errors or unplug(not
supported now) use cases. The NVDIMM label read/writes are done through
hcalls.

Since each virtual NVDIMM device is divided into multiple SCM blocks,
the bind, unbind, and queries using hcalls on those blocks can come
independently. This doesn't fit well into the qemu device semantics,
where the map/unmap are done at the (whole)device/object level granularity.
The patch doesnt actually bind/unbind on hcalls but let it happen at the
device_add/del phase itself instead.

The guest kernel makes bind/unbind requests for the virtual NVDIMM device
at the region level granularity. Without interleaving, each virtual NVDIMM
device is presented as a separate guest physical address range. So, there
is no way a partial bind/unbind request can come for the vNVDIMM in a
hcall for a subset of SCM blocks of a virtual NVDIMM. Hence it is safe to
do bind/unbind everything during the device_add/del.

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Message-Id: <158131059899.2897.11515211602702956854.stgit@lep8c.aus.stglabs.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21 09:15:04 +11:00
Shivaprasad G Bhat ee3a71e366 spapr: Add NVDIMM device support
Add support for NVDIMM devices for sPAPR. Piggyback on existing nvdimm
device interface in QEMU to support virtual NVDIMM devices for Power.
Create the required DT entries for the device (some entries have
dummy values right now).

The patch creates the required DT node and sends a hotplug
interrupt to the guest. Guest is expected to undertake the normal
DR resource add path in response and start issuing PAPR SCM hcalls.

The device support is verified based on the machine version unlike x86.

This is how it can be used ..
Ex :
For coldplug, the device to be added in qemu command line as shown below
-object memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
-device nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

For hotplug, the device to be added from monitor as below
object_add memory-backend-file,id=memnvdimm0,prealloc=yes,mem-path=/tmp/nvdimm0,share=yes,size=1073872896
device_add nvdimm,label-size=128k,uuid=75a3cdd7-6a2f-4791-8d15-fe0a920e8e9e,memdev=memnvdimm0,id=nvdimm0,slot=0

Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
               [Early implementation]
Message-Id: <158131058078.2897.12767731856697459923.stgit@lep8c.aus.stglabs.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-21 09:15:04 +11:00
Igor Mammedov b28f01880e ppc/{ppc440_bamboo, sam460ex}: use memdev for RAM
memory_region_allocate_system_memory() API is going away, so
replace it with memdev allocated MemoryRegion. The later is
initialized by generic code, so board only needs to opt in
to memdev scheme by providing
  MachineClass::default_ram_id
and using MachineState::ram instead of manually initializing
RAM memory region.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200219160953.13771-67-imammedo@redhat.com>
2020-02-19 16:50:00 +00:00
Igor Mammedov a0258e4afa ppc/{ppc440_bamboo, sam460ex}: drop RAM size fixup
If user provided non-sense RAM size, board will complain and
continue running with max RAM size supported or sometimes
crash like this:
  %QEMU -M bamboo -m 1
    exec.c:1926: find_ram_offset: Assertion `size != 0' failed.
    Aborted (core dumped)
Also RAM is going to be allocated by generic code, so it won't be
possible for board to fix things up for user.

Make it error message and exit to force user fix CLI,
instead of accepting non-sense CLI values.
That also fixes crash issue, since wrongly calculated size
isn't used to allocate RAM

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200219160953.13771-66-imammedo@redhat.com>
2020-02-19 16:50:00 +00:00
Aravinda Prasad 2500fb423a migration: Include migration support for machine check handling
This patch includes migration support for machine check
handling. Especially this patch blocks VM migration
requests until the machine check error handling is
complete as these errors are specific to the source
hardware and is irrelevant on the target hardware.

Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com>
[Do not set FWNMI cap in post_load, now its done in .apply hook]
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Message-Id: <20200130184423.20519-7-ganeshgr@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03 11:33:11 +11:00
Aravinda Prasad f03496bc12 ppc: spapr: Handle "ibm,nmi-register" and "ibm,nmi-interlock" RTAS calls
This patch adds support in QEMU to handle "ibm,nmi-register"
and "ibm,nmi-interlock" RTAS calls.

The machine check notification address is saved when the
OS issues "ibm,nmi-register" RTAS call.

This patch also handles the case when multiple processors
experience machine check at or about the same time by
handling "ibm,nmi-interlock" call. In such cases, as per
PAPR, subsequent processors serialize waiting for the first
processor to issue the "ibm,nmi-interlock" call. The second
processor that also received a machine check error waits
till the first processor is done reading the error log.
The first processor issues "ibm,nmi-interlock" call
when the error log is consumed.

Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com>
[Register fwnmi RTAS calls in core_rtas_register_types()
 where other RTAS calls are registered]
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Message-Id: <20200130184423.20519-6-ganeshgr@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03 11:33:11 +11:00
Aravinda Prasad 81fe70e443 target/ppc: Build rtas error log upon an MCE
Upon a machine check exception (MCE) in a guest address space,
KVM causes a guest exit to enable QEMU to build and pass the
error to the guest in the PAPR defined rtas error log format.

This patch builds the rtas error log, copies it to the rtas_addr
and then invokes the guest registered machine check handler. The
handler in the guest takes suitable action(s) depending on the type
and criticality of the error. For example, if an error is
unrecoverable memory corruption in an application inside the
guest, then the guest kernel sends a SIGBUS to the application.
For recoverable errors, the guest performs recovery actions and
logs the error.

Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com>
[Assume SLOF has allocated enough room for rtas error log]
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200130184423.20519-5-ganeshgr@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03 11:33:10 +11:00
Aravinda Prasad 9ac703ac5f target/ppc: Handle NMI guest exit
Memory error such as bit flips that cannot be corrected
by hardware are passed on to the kernel for handling.
If the memory address in error belongs to guest then
the guest kernel is responsible for taking suitable action.
Patch [1] enhances KVM to exit guest with exit reason
set to KVM_EXIT_NMI in such cases. This patch handles
KVM_EXIT_NMI exit.

[1] https://www.spinics.net/lists/kvm-ppc/msg12637.html
    (e20bbd3d and related commits)

Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com>
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <20200130184423.20519-4-ganeshgr@linux.ibm.com>
[dwg: #ifdefs to fix compile for 32-bit target]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03 11:33:10 +11:00
Aravinda Prasad 9d953ce447 ppc: spapr: Introduce FWNMI capability
Introduce fwnmi an spapr capability and add a helper function
which tries to enable it, which would be used by following patch
of the series. This patch by itself does not change the existing
behavior.

Signed-off-by: Aravinda Prasad <arawinda.p@gmail.com>
[eliminate cap_ppc_fwnmi, add fwnmi cap to migration state
 and reprhase the commit message]
Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200130184423.20519-3-ganeshgr@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-03 11:33:10 +11:00
Cédric Le Goater 9ae1329ee2 ppc/pnv: Add models for POWER8 PHB3 PCIe Host bridge
This is a model of the PCIe Host Bridge (PHB3) found on a POWER8
processor. It includes the PowerBus logic interface (PBCQ), IOMMU
support, a single PCIe Gen.3 Root Complex, and support for MSI and LSI
interrupt sources as found on a POWER8 system using the XICS interrupt
controller.

The POWER8 processor comes in different flavors: Venice, Murano,
Naple, each having a different number of PHBs. To make things simpler,
the models provides 3 PHB3 per chip. Some platforms, like the
Firestone, can also couple PHBs on the first chip to provide more
bandwidth but this is too specific to model in QEMU.

XICS requires some adjustment to support the PHB3 MSI. The changes are
provided here but they could be decoupled in prereq patches.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200127144506.11132-3-clg@kaod.org>
[dwg: Use device_class_set_props()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02 14:07:57 +11:00
Benjamin Herrenschmidt 4f9924c4d4 ppc/pnv: Add models for POWER9 PHB4 PCIe Host bridge
These changes introduces models for the PCIe Host Bridge (PHB4) of the
POWER9 processor. It includes the PowerBus logic interface (PBCQ),
IOMMU support, a single PCIe Gen.4 Root Complex, and support for MSI
and LSI interrupt sources as found on a POWER9 system using the XIVE
interrupt controller.

POWER9 processor comes with 3 PHB4 PEC (PCI Express Controller) and
each PEC can have several PHBs. By default,

  * PEC0 provides 1 PHB  (PHB0)
  * PEC1 provides 2 PHBs (PHB1 and PHB2)
  * PEC2 provides 3 PHBs (PHB3, PHB4 and PHB5)

Each PEC has a set  "global" registers and some "per-stack" (per-PHB)
registers. Those are organized in two XSCOM ranges, the "Nest" range
and the "PCI" range, each range contains both some "PEC" registers and
some "per-stack" registers.

No default device layout is provided and PCI devices can be added on
any of the available PCIe Root Port (pcie.0 .. 2 of a Power9 chip)
with address 0x0 as the firwware (skiboot) only accepts a single
device per root port. To run a simple system with a network and a
storage adapters, use a command line options such as :

  -device e1000e,netdev=net0,mac=C0:FF:EE:00:00:02,bus=pcie.0,addr=0x0
  -netdev bridge,id=net0,helper=/usr/libexec/qemu-bridge-helper,br=virbr0,id=hostnet0

  -device megasas,id=scsi0,bus=pcie.1,addr=0x0
  -drive file=$disk,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none
  -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2

If more are needed, include a bridge.

Multi chip is supported, each chip adding its set of PHB4 controllers
and its PCI busses. The model doesn't emulate the EEH error handling.

This model is not ready for hotplug yet.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[ clg: - numerous cleanups
       - commit log
       - fix for broken LSI support
       - PHB pic printinfo
       - large QOM rework ]
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200127144506.11132-2-clg@kaod.org>
[dwg: Use device_class_set_props()]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02 14:07:57 +11:00
Stefan Berger 864674fa29 spapr: Implement get_dt_compatible() callback
For devices that cannot be statically initialized, implement a
get_dt_compatible() callback that allows us to ask the device for
the 'compatible' value.

Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200121152935.649898-3-stefanb@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02 14:07:57 +11:00
Cédric Le Goater 08c3f3a734 ppc/pnv: Add support for "hostboot" mode
When the "hb-mode" option is activated on the powernv machine, the
firmware is mapped at 0x8000000 and the HRMOR of the HW threads are
set to the same address.

The PNOR mapping on the FW address space of the LPC bus is left enabled
to let the firmware load any other images required to boot the host.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200127144154.10170-4-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02 14:07:57 +11:00
Thomas Huth b2ce76a073 hw/ppc/prep: Remove the deprecated "prep" machine and the OpenHackware BIOS
It's been deprecated since QEMU v3.1. The 40p machine should be
used nowadays instead.

Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Acked-by: Hervé Poussineau <hpoussin@reactos.org>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20200114114617.28854-1-thuth@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-02-02 14:07:57 +11:00
Cédric Le Goater fc2527fb02 ppc/pnv: fix check on return value of blk_getlength()
blk_getlength() returns an int64_t but the result is stored in a
uint32_t. Errors (negative values) won't be caught by the check in
pnv_pnor_realize() and blk_blockalign() will allocate a very large
buffer in such cases.

Fixes Coverity issue CID 1412226.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200107171809.15556-3-clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 12:01:14 +11:00
Greg Kurz 806fed593d pnv/xive: Deduce the PnvXive pointer from XiveTCTX::xptr
And use it instead of reaching out to the machine. This allows to get
rid of pnv_get_chip().

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200106145645.4539-11-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Cédric Le Goater 479509463b xive: Add a "presenter" link property to the TCTX object
This will be used in subsequent patches to access the XIVE associated to
a TCTX without reaching out to the machine through qdev_get_machine().

Signed-off-by: Cédric Le Goater <clg@kaod.org>
[ groug: - split patch
         - write subject and changelog ]
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200106145645.4539-9-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz d8137bb729 ppc/pnv: Add a "pnor" const link property to the BMC internal simulator
This allows to get rid of a call to qdev_get_machine().

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200106145645.4539-8-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz 764f9b2559 ppc/pnv: Add an "nr-threads" property to the base chip class
Set it at chip creation and forward it to the cores. This allows to drop
a call to qdev_get_machine().

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200106145645.4539-7-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz d1214b819f spapr, pnv, xive: Add a "xive-fabric" link to the XIVE router
In order to get rid of qdev_get_machine(), first add a pointer to the
XIVE fabric under the XIVE router and make it configurable through a
QOM link property.

Configure it in the spapr and pnv machine. In the case of pnv, the XIVE
routers are under the chip, so this is done with a QOM alias property of
the POWER9 pnv chip.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20200106145645.4539-5-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz 0da41d3c5a pnv/xive: Use device_class_set_parent_realize()
The XIVE router base class currently inherits an empty realize hook
from the sysbus device base class, but it will soon implement one
of its own to perform some sanity checks. Do the preliminary plumbing
to have it called.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200106145645.4539-4-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Cédric Le Goater 245cdb7f54 ppc/pnv: Introduce a "xics" property under the POWER8 chip
POWER8 is the only chip using the XICS interface. Add a "xics" link
and a XICSFabric attribute under this chip to remove the use of
qdev_get_machine()

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20200106145645.4539-3-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz 6cc64796f2 spapr/xive: Use device_class_set_parent_realize()
The XIVE router base class currently inherits an empty realize hook
from the sysbus device base class, but it will soon implement one
of its own to perform some sanity checks. Do the preliminary plumbing
to have it called.

Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20191219181155.32530-6-clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2020-01-08 11:01:59 +11:00
Greg Kurz 5084c8b763 ppc/pnv: Drop PnvChipClass::type
It isn't used anymore.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623844102.360005.12070225703151669294.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz 70c059e926 ppc/pnv: Introduce PnvChipClass::xscom_pcba() method
The XSCOM bus is implemented with a QOM interface, which is mostly
generic from a CPU type standpoint, except for the computation of
addresses on the Pervasive Connect Bus (PCB) network. This is handled
by the pnv_xscom_pcba() function with a switch statement based on
the chip_type class level attribute of the CPU chip.

This can be achieved using QOM. Also the address argument is masked with
PNV_XSCOM_SIZE - 1, which is for POWER8 only. Addresses may have different
sizes with other CPU types. Have each CPU chip type handle the appropriate
computation with a QOM xscom_pcba() method.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623843543.360005.13996472463887521794.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz 3caf7bd0a2 ppc/pnv: Drop pnv_chip_is_power9() and pnv_chip_is_power10() helpers
They aren't used anymore.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623842986.360005.1787401623906380181.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz c396c58a02 ppc/pnv: Pass content of the "compatible" property to pnv_dt_xscom()
Since pnv_dt_xscom() is called from chip specific dt_populate() hooks,
it shouldn't have to guess the chip type in order to populate the
"compatible" property. Just pass the compat string and its size as
arguments.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623842430.360005.9513965612524265862.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz 3f5b45ca4f ppc/pnv: Pass XSCOM base address and address size to pnv_dt_xscom()
Since pnv_dt_xscom() is called from chip specific dt_populate() hooks,
it shouldn't have to guess the chip type in order to populate the "reg"
property. Just pass the base address and address size as arguments.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623841868.360005.17577624823547136435.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz c4b2c40c0e ppc/pnv: Introduce PnvChipClass::xscom_core_base() method
The pnv_chip_core_realize() function configures the XSCOM MMIO subregion
for each core of a single chip. The base address of the subregion depends
on the CPU type. Its computation is currently open-code using the
pnv_chip_is_powerXX() helpers. This can be achieved with QOM. Introduce
a method for this in the base chip class and implement it in child classes.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623841311.360005.4705705734873339545.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:11 +11:00
Greg Kurz 85913070a6 ppc/pnv: Introduce PnvChipClass::intc_print_info() method
The pnv_pic_print_info() callback checks the type of the chip in order
to forward to the request appropriate interrupt controller. This can
be achieved with QOM. Introduce a method for this in the base chip class
and implement it in child classes.

This also prepares ground for the upcoming interrupt controller of POWER10
chips.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623840755.360005.5002022339473369934.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:10 +11:00
Greg Kurz acc39abb31 ppc/pnv: Drop pnv_is_power9() and pnv_is_power10() helpers
They aren't used anymore.

Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157623840200.360005.1300941274565357363.stgit@bahia.lan>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2019-12-17 10:59:10 +11:00