Commit graph

12 commits

Author SHA1 Message Date
Alex Williamson 89d5202edc vfio/pci: Allow relocating MSI-X MMIO
Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table.  This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64).  In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device.  However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X.  Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location.  There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.

This new x-msix-relocation option accepts the following choices:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Use a known good combination for the platform/device (none yet)
  bar0..bar5: Specify the target BAR for MSI-X data structures

If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.

The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area.  Take for
example:

# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
	...
	Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
	Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
	...
	Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000

This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance.  The data sheet specifically refers
to this as an MSI-X BAR.  This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.

However, here's another example:

# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
	...
	Region 0: I/O ports at c000 [size=256]
	Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
	Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
	...
	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=1 offset=0000e000
		PBA: BAR=1 offset=0000f000

Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR.  If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform.  At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.

Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures.  A few key rules to keep in mind for this selection
include:

 * There are only 6 BAR slots, bar0..bar5
 * 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
 * PCI BARs are always a power of 2 in size, extending == doubling
 * The maximum size of a 32-bit BAR is 2GB
 * MSI-X data structures must reside in an MMIO BAR

Using these rules, we can evaluate each BAR of the second example
device above as follows:

 bar0: I/O port BAR, incompatible with MSI-X tables
 bar1: BAR could be extended, incurring another 64KB of MMIO
 bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
 bar3: BAR could be extended, incurring another 256KB of MMIO
 bar4: Unavailable, bar3 is 64bit, this register is used by bar3
 bar5: Available, empty BAR, minimum additional MMIO

A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3.  The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users.  This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.

Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06 11:08:26 -07:00
Alexey Kardashevskiy 07bc681a33 vfio/spapr: Use iommu memory region's get_attr()
In order to enable TCE operations support in KVM, we have to inform
the KVM about VFIO groups being attached to specific LIOBNs. The KVM
already knows about VFIO groups, the only bit missing is which
in-kernel TCE table (the one with user visible TCEs) should update
the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
attribute of the VFIO KVM device which receives a groupfd/tablefd couple.

This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
and calls KVM to establish the link.

As get_attr() is not implemented yet, this should cause no behavioural
change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2018-02-06 11:08:24 -07:00
Vladimir Sementsov-Ogievskiy 8908eb1a4a trace-events: fix code style: print 0x before hex numbers
The only exception are groups of numers separated by symbols
'.', ' ', ':', '/', like 'ab.09.7d'.

This patch is made by the following:

> find . -name trace-events | xargs python script.py

where script.py is the following python script:
=========================
 #!/usr/bin/env python

import sys
import re
import fileinput

rhex = '%[-+ *.0-9]*(?:[hljztL]|ll|hh)?(?:x|X|"\s*PRI[xX][^"]*"?)'
rgroup = re.compile('((?:' + rhex + '[.:/ ])+' + rhex + ')')
rbad = re.compile('(?<!0x)' + rhex)

files = sys.argv[1:]

for fname in files:
    for line in fileinput.input(fname, inplace=True):
        arr = re.split(rgroup, line)
        for i in range(0, len(arr), 2):
            arr[i] = re.sub(rbad, '0x\g<0>', arr[i])

        sys.stdout.write(''.join(arr))
=========================

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Message-id: 20170731160135.12101-5-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-08-01 12:13:07 +01:00
Philippe Mathieu-Daudé 87e0331c5a docs: fix broken paths to docs/devel/tracing.txt
With the move of some docs/ to docs/devel/ on ac06724a71,
no references were updated.

Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
2017-07-31 13:12:53 +03:00
Peter Xu 3213835720 vfio: trace map/unmap for notify as well
We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2017-02-17 21:52:31 +02:00
Stefan Hajnoczi 7f4076c1bb trace: clean up trace-events files
There are a number of unused trace events that
scripts/cleanup-trace-events.pl finds.  The "hw/vfio/pci-quirks.c"
filename was typoed and "qapi/qapi-visit-core.c" was missing the qapi/
directory prefix.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20170126171613.1399-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-01-31 17:12:15 +00:00
Eric Auger 1a22aca1d0 vfio/pci: Conversion to realize
This patch converts VFIO PCI to realize function.

Also original initfn errors now are propagated using QEMU
error objects. All errors are formatted with the same pattern:
"vfio: %s: the error description"

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-10-17 10:58:01 -06:00
Laurent Vivier e723b87103 trace-events: fix first line comment in trace-events
Documentation is docs/tracing.txt instead of docs/trace-events.txt.

find . -name trace-events -exec \
     sed -i "s?See docs/trace-events.txt for syntax documentation.?See docs/tracing.txt for syntax documentation.?" \
     {} \;

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Message-id: 1470669081-17860-1-git-send-email-lvivier@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-08-12 10:36:01 +01:00
Alexey Kardashevskiy 2e4109de8e vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)
New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management.
This adds ability to VFIO common code to dynamically allocate/remove
DMA windows in the host kernel when new VFIO container is added/removed.

This adds a helper to vfio_listener_region_add which makes
VFIO_IOMMU_SPAPR_TCE_CREATE ioctl and adds just created IOMMU into
the host IOMMU list; the opposite action is taken in
vfio_listener_region_del.

When creating a new window, this uses heuristic to decide on the TCE table
levels number.

This should cause no guest visible change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[dwg: Added some casts to prevent printf() warnings on certain targets
 where the kernel headers' __u64 doesn't match uint64_t or PRIx64]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-05 14:31:08 +10:00
Alexey Kardashevskiy 318f67ce13 vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)
This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a prereg memory listener which listens on address_space_memory
and notifies a VFIO container about memory which needs to be
pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped.

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This enforces guest RAM blocks to be host page size aligned; however
this is not new as KVM already requires memory slots to be host page
size aligned.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[dwg: Fix compile error on 32-bit host]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2016-07-05 14:30:54 +10:00
Alex Williamson e37dac06dc vfio/pci: Hide SR-IOV capability
The kernel currently exposes the SR-IOV capability as read-only
through vfio-pci.  This is sufficient to protect the host kernel, but
has the potential to confuse guests without further virtualization.
In particular, OVMF tries to size the VF BARs and comes up with absurd
results, ending with an assert.  There's not much point in adding
virtualization to a read-only capability, so we simply hide it for
now.  If the kernel ever enables SR-IOV virtualization, we should
easily be able to test it through VF BAR sizing or explicit flags.

Testing whether we should parse extended capabilities is also pulled
into the function to keep these assumptions in one place.

Tested-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2016-06-30 13:00:23 -06:00
Daniel P. Berrange 1cf6ebc73f trace: split out trace events for hw/vfio/ directory
Move all trace-events for files in the hw/vfio/ directory to
their own file.

Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 1466066426-16657-30-git-send-email-berrange@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-06-20 17:22:16 +01:00