Commit graph

2477 commits

Author SHA1 Message Date
Denis V. Lunev f575f145f4 qcow2: fix condition in is_zero_cluster
We should check for (res & BDRV_BLOCK_ZERO) only. The situation when we
will have !(res & BDRV_BLOCK_DATA) and will not have BDRV_BLOCK_ZERO is
not possible for images with bdi.unallocated_blocks_are_zero == true.

For those images where it's false, however, it can happen and we must
not consider the data zeroed then or we would corrupt the image.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-19 16:45:31 +02:00
Max Reitz b97511c7bc block: Propagate AioContext change to all children
Instead of propagating any change of a BDS's AioContext only to its file
and backing children and letting driver-specific code do the rest, just
propagate it to all and drop the thus superfluous implementations of
bdrv_{at,de}tach_aio_context() in Quorum, blkverify and VMDK.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 1f0c461b82 block: Remove BlockDriverState.blk
This patch removes the remaining users of bs->blk, which will allow us
to have multiple BBs on top of a single BDS. In the meantime, all checks
that are currently in place to prevent the user from creating such
setups can be switched to bdrv_has_blk() instead of accessing BDS.blk.

Future patches can allow them and e.g. enable users to mirror to a block
device that already has a BlockBackend on it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 79c719b755 block: Don't return throttling info in query-named-block-nodes
query-named-block-nodes should not return information that is related
to the attached BlockBackend rather than the node itself, so throttling
information needs to be removed from it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 7c8eece45b block: Avoid bs->blk in bdrv_next()
We need to introduce a separate BdrvNextIterator struct that can keep
more state than just the current BDS in order to avoid using the bs->blk
pointer.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf dde33812a8 block: Add bdrv_has_blk()
In many cases we just want to know whether a BDS has at least one BB
attached, without needing to know the exact BB that is attached. In
contrast to bs->blk, this is still a valid question when more than one
BB can be attached, so just answer it by checking the parents list.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 91c6e4b7bb block: Remove bdrv_aio_multiwrite()
Since virtio-blk implements request merging itself these days, the only
remaining users are test cases for the function. That doesn't make the
function exactly useful any more.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 66a0fae438 blockjob: Don't touch BDS iostatus
Block jobs don't actually make use of the iostatus for their BDSes, but
they manage a separate block job iostatus. Still, they require that it
is enabled for the source BDS and they enable it automatically for the
target and set the error handling mode - which ends up never being used
by the job.

This patch removes all of the BDS iostatus handling from the block job,
which removes another few bs->blk accesses.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 81e254dc83 blockjob: Don't set iostatus of target
When block job errors were introduced, we assigned the iostatus of the
target BDS "just in case". The field has never been accessible for the
user because the target isn't listed in query-block.

Before we can allow the user to have a second BlockBackend on the
target, we need to clean this up. If anything, we would want to set the
iostatus for the internal BB of the job (which we can always do later),
but certainly not for a separate BB which the job doesn't even use.

As a nice side effect, this gets us rid of another bs->blk use.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 4c265bf9f4 block: User BdrvChild callback for device name
In order to get rid of bs->blk for bdrv_get_device_name() and
bdrv_get_device_or_node_name(), ask all parents for their name and
simply pick the first one.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 5c8cab4808 block: Use BdrvChild callbacks for change_media/resize
We want to get rid of BlockDriverState.blk in order to allow multiple
BlockBackends per BDS. Converting the device callbacks in block.c (which
assume a single BlockBackend) to per-child callbacks gets us rid of the
first few instances.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf cbe1beb7a1 block: Don't check throttled reqs in bdrv_requests_pending()
Checking whether there are throttled requests requires going to the
associated BlockBackend, which we want to avoid.

All users of bdrv_requests_pending() in block/io.c already call
bdrv_parent_drained_begin() first, which restarts all throttled
requests, so no throttled requests can be left here and this is removal
of dead code.

The remaining users (assertions during graph manipulation in block.c)
don't care about requests that are still queued in the BlockBackend and
haven't been issued for a BlockDriverState yet.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:31 +02:00
Kevin Wolf 7ca7f0f6db block: Decouple throttling from BlockDriverState
This moves the throttling related part of the BDS life cycle management
to BlockBackend. The throttling group reference is now kept even when no
medium is inserted.

With this commit, throttling isn't disabled and then re-enabled any more
during graph reconfiguration. This fixes the temporary breakage of I/O
throttling when used with live snapshots or block jobs that manipulate
the graph.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf bb9aaecaf1 block/io: Quiesce parents between drained_begin/end
So far, bdrv_parent_drained_begin/end() was called for the duration of
the actual bdrv_drain() at the beginning of a drained section, but we
really should keep parents quiesced until the end of the drained
section.

This does not actually change behaviour at this point because the only
user of the .drained_begin/end BdrvChildRole callback is I/O throttling,
which already doesn't send any new requests after flushing its queue in
.drained_begin. The patch merely removes a trap for future users.

Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf c2066af051 block: Drain throttling queue with BdrvChild callback
This removes the last part of I/O throttling from block/io.c and moves
it to the BlockBackend.

Instead of having knowledge about throttling inside io.c, we can call a
BdrvChild callback .drained_begin/end, which happens to drain the
throttled requests for BlockBackend parents.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf 22aa8b246a block: Introduce BdrvChild.opaque
BlockBackends use it to get a back pointer from BdrvChild to
BlockBackend in any BdrvChildRole callbacks.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf 97148076e8 block: Move I/O throttling configuration functions to BlockBackend
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf 441565b279 block: Move actual I/O throttling to BlockBackend
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf 27ccdd5259 block: Move throttling fields from BDS to BB
This patch changes where the throttling state is stored (used to be the
BlockDriverState, now it is the BlockBackend), but it doesn't actually
make it a BB level feature yet. For example, throttling is still
disabled when the BDS is detached from the BB.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:30 +02:00
Kevin Wolf 49d2165d7d block: Convert throttle_group_get_name() to BlockBackend
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:29 +02:00
Kevin Wolf 31dce3ccca block: throttle-groups: Use BlockBackend pointers internally
As a first step towards moving I/O throttling to the BlockBackend level,
this patch changes all pointers in struct ThrottleGroup from referencing
a BlockDriverState to referencing a BlockBackend.

This change is valid because we made sure that throttling can only be
enabled on BDSes which have a BB attached.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:29 +02:00
Kevin Wolf f2cd875d54 block: Introduce BlockBackendPublic
Some features, like I/O throttling, are implemented outside
block-backend.c, but still want to keep information in BlockBackend,
e.g. list entries that allow keeping a list of BlockBackends.

In order to avoid exposing the whole struct layout in the public header
file, this patch introduces an embedded public struct where such
information can be added and a pair of functions to convert between
BlockBackend and BlockBackendPublic.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:29 +02:00
Kevin Wolf a5614993d7 block: Make sure throttled BDSes always have a BB
It was already true in principle that a throttled BDS always has a BB
attached, except that the order of operations while attaching or
detaching a BDS to/from a BB wasn't careful enough.

This commit breaks graph manipulations while I/O throttling is enabled.
It would have been possible to keep things working with some temporary
hacks, but quite cumbersome, so it's not worth the hassle. We'll fix
things again in a minute.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-05-19 16:45:29 +02:00
Paolo Bonzini 58369e22cf qemu-common: stop including qemu/bswap.h from qemu-common.h
Move it to the actual users.  There are still a few includes of
qemu/bswap.h in headers; removing them is left for future work.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-19 16:42:28 +02:00
Peter Maydell f68419eee9 Block layer patches
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJXNIcBAAoJEH8JsnLIjy/WiUQP/Rzfo8pe7TWA2InxdcDOPsx4
 2/tHHJdVkffnNX5rdBvc0mOUZNJxej0NtJu2e63BB+ydYju//xw8gruKU7TR+Nd3
 nPNSsqk80prK3RNgWu7qymBvIkHDDcDQhlp48HKq+dxrfConXtHmoXapGsqc0S47
 xu03oC6WzSIyLf7TLytcjUmEprQSaCGOwsb/XaHAWL750fFAGcdy/K5PWBpUv6DN
 T0jZ3u4UneE1jeabRmqAwjgDJXC9l6riH9fP/ZtYhgNlNj84zlMXajUHSULhGknP
 cTGjwwg9tOvhcjTdhdRmWlvG1m0T77ZX3icfZLhcTdb/Uz68NXVqs8P25IGV9McD
 DPrb3T/M8JUoqLXJxIpxUm2Levof5v0dUF1PHmN5bT7pshcqv/1J7v8Fdtf9l9mp
 zI0+FK1TZ102C0H2F7AWYZSlo2EfNUSd02QQx6MbfDokDIlIxY+EgP1/Es5XlkqC
 wc7HrJvq+uix2zXw9bn9Vg9p/nDuxlRx+ppRRarNNRonaqTrx/1qAaas4bsqc9Gz
 H6gxw7BHybm0TZFdHqAdIonpesecYw6yWUXT/mQehbfphsmQmu/d2HvF2C9uUm4X
 O0JduBlKOTm2hMcg5qL6Gko8WaQIctdCJH/1Onts92cZnm8Vr/9zcmMgwGoCd7sE
 +t6Yg0jqpTUJwhZhIuCw
 =NbjJ
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches

# gpg: Signature made Thu 12 May 2016 14:37:05 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"

* remotes/kevin/tags/for-upstream: (69 commits)
  qemu-iotests: iotests: fail hard if not run via "check"
  block: enable testing of LUKS driver with block I/O tests
  block: add support for encryption secrets in block I/O tests
  block: add support for --image-opts in block I/O tests
  qemu-io: Add 'write -z -u' to test MAY_UNMAP flag
  qemu-io: Add 'write -f' to test FUA flag
  qemu-io: Allow unaligned access by default
  qemu-io: Use bool for command line flags
  qemu-io: Make 'open' subcommand more like command line
  qemu-io: Add missing option documentation
  qmp: add monitor command to add/remove a child
  quorum: implement bdrv_add_child() and bdrv_del_child()
  Add new block driver interface to add/delete a BDS's child
  qemu-img: check block status of backing file when converting.
  iotests: fix the redirection order in 083
  block: Inactivate all children
  block: Drop superfluous invalidating bs->file from drivers
  block: Invalidate all children
  nbd: Simplify client FUA handling
  block: Honor BDRV_REQ_FUA during write_zeroes
  ...

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-05-12 16:33:40 +01:00
Wen Congyang 98292c61bc quorum: implement bdrv_add_child() and bdrv_del_child()
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: zhanghailiang <zhang.zhanghailiang@huawei.com>
Signed-off-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com>
Message-id: 1462865799-19402-3-git-send-email-xiecl.fnst@cn.fujitsu.com
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-05-12 15:33:23 +02:00
Fam Zheng c9e9e9c66c block: Drop superfluous invalidating bs->file from drivers
Now they are invalidated by the block layer, so it's not necessary to
do this in block drivers' implementations of .bdrv_invalidate_cache.

Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Eric Blake 52a4650574 nbd: Simplify client FUA handling
Now that the block layer honors per-bds FUA support, we don't
have to duplicate the fallback flush at the NBD layer.  The
static function nbd_co_writev_flags() is no longer needed, and
the driver can just directly use nbd_client_co_writev().

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Eric Blake 465fe887cc block: Honor BDRV_REQ_FUA during write_zeroes
The block layer has a couple of cases where it can lose
Force Unit Access semantics when writing a large block of
zeroes, such that the request returns before the zeroes
have been guaranteed to land on underlying media.

SCSI does not support FUA during WRITESAME(10/16); FUA is only
supported if it falls back to WRITE(10/16).  But where the
underlying device is new enough to not need a fallback, it
means that any upper layer request with FUA semantics was
silently ignoring BDRV_REQ_FUA.

Conversely, NBD has situations where it can support FUA but not
ZERO_WRITE; when that happens, the generic block layer fallback
to bdrv_driver_pwritev() (or the older bdrv_co_writev() in qemu
2.6) was losing the FUA flag.

The problem of losing flags unrelated to ZERO_WRITE has been
latent in bdrv_co_do_write_zeroes() since commit aa7bfbff, but
back then, it did not matter because there was no FUA flag.  It
became observable when commit 93f5e6d8 paved the way for flags
that can impact correctness, when we should have been using
bdrv_co_writev_flags() with modified flags.  Compare to commit
9eeb6dd, which got flag manipulation right in
bdrv_co_do_zero_pwritev().

Symptoms: I tested with qemu-io with default writethrough cache
(which is supposed to use FUA semantics on every write), and
targetted an NBD client connected to a server that intentionally
did not advertise NBD_FLAG_SEND_FUA.  When doing 'write 0 512',
the NBD client sent two operations (NBD_CMD_WRITE then
NBD_CMD_FLUSH) to get the fallback FUA semantics; but when doing
'write -z 0 512', the NBD client sent only NBD_CMD_WRITE.

The fix is do to a cleanup bdrv_co_flush() at the end of the
operation if any step in the middle relied on a BDS that does
not natively support FUA for that step (note that we don't
need to flush after every operation, if the operation is broken
into chunks based on bounce-buffer sizing).  Each BDS gains a
new flag .supported_zero_flags, which parallels the use of
.supported_write_flags but only when accessing a zero write
operation (the flags MUST be different, because of SCSI having
different semantics based on WRITE vs. WRITESAME; and also
because BDRV_REQ_MAY_UNMAP only makes sense on zero writes).

Also fix some documentation to describe -ENOTSUP semantics,
particularly since iscsi depends on those semantics.

Down the road, we may want to add a driver where its
.bdrv_co_pwritev() honors all three of BDRV_REQ_FUA,
BDRV_REQ_ZERO_WRITE, and BDRV_REQ_MAY_UNMAP, and advertise
this via bs->supported_write_flags for blocks opened by that
driver; such a driver should NOT supply .bdrv_co_write_zeroes
nor .supported_zero_flags.  But none of the drivers touched
in this patch want to do that (the act of writing zeroes is
different enough from normal writes to deserve a second
callback).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Eric Blake 4df863f336 block: Make supported_write_flags a per-bds property
Pre-patch, .supported_write_flags lives at the driver level, which
means we are blindly declaring that all block devices using a
given driver will either equally support FUA, or that we need a
fallback at the block layer.  But there are drivers where FUA
support is a per-block decision: the NBD block driver is dependent
on the remote server advertising NBD_FLAG_SEND_FUA (and has
fallback code to duplicate the flush that the block layer would do
if NBD had not set .supported_write_flags); and the iscsi block
driver is dependent on the mode sense bits advertised by the
underlying device (and is currently silently ignoring FUA requests
if the underlying device does not support FUA).

The fix is to make supported flags as a per-BDS option, set during
.bdrv_open().  This patch moves the variable and fixes NBD and iscsi
to set it only conditionally; later patches will then further
simplify the NBD driver to quit duplicating work done at the block
layer, as well as tackle the fact that SCSI does not support FUA
semantics on WRITESAME(10/16) but only on WRITE(10/16).

Signed-off-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Denis V. Lunev 2928abce6d qcow2: improve qcow2_co_write_zeroes()
There is a possibility that qcow2_co_write_zeroes() will be called
with the partial block. This could be synthetically triggered with
    qemu-io -c "write -z 32k 4k"
and can happen in the real life in qemu-nbd. The latter happens under
the following conditions:
    (1) qemu-nbd is started with --detect-zeroes=on and is connected to the
        kernel NBD client
    (2) third party program opens kernel NBD device with O_DIRECT
    (3) third party program performs write operation with memory buffer
        not aligned to the page
In this case qcow2_co_write_zeroes() is unable to perform the operation
and mark entire cluster as zeroed and returns ENOTSUP. Thus the caller
switches to non-optimized version and writes real zeroes to the disk.

The patch creates a shortcut. If the block is read as zeroes, f.e. if
it is unallocated, the request is extended to cover full block.
User-visible situation with this block is not changed. Before the patch
the block is filled in the image with real zeroes. After that patch the
block is marked as zeroed in metadata. Thus any subsequent changes in
backing store chain are not affected.

Kevin, thank you for a cool suggestion.

Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Eric Blake 7b1deac84e block: Kill unused sector-based blk_* functions
Now that there are no remaining clients, we can drop the
sector-based blk_read(), blk_write(), blk_aio_readv(), and
blk_aio_writev().  Sadly, there are still remaining
sector-based interfaces, such as blk_*discard(), or
blk_write_compressed(); those will have to wait for another
day.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:09 +02:00
Eric Blake 60cb2fa7eb block: Introduce byte-based aio read/write
blk_aio_readv() and blk_aio_writev() are annoying in that they
can't access sub-sector granularity, and cannot pass flags.
Also, they require the caller to pass redundant information
about the size of the I/O (qiov->size in bytes must match
nb_sectors in sectors).

Add new blk_aio_preadv() and blk_aio_pwritev() functions to fix
the flaws. The next few patches will upgrade callers, then
finally delete the old interfaces.

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Eric Blake 983a160050 block: Switch blk_*write_zeroes() to byte interface
Sector-based blk_write() should die; convert the one-off
variant blk_write_zeroes() to use an offset/count interface
instead.  Likewise for blk_co_write_zeroes() and
blk_aio_write_zeroes().

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Eric Blake b7d17f9fa4 block: Switch blk_read_unthrottled() to byte interface
Sector-based blk_read() should die; convert the one-off
variant blk_read_unthrottled().

Signed-off-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Eric Blake 8341f00dc2 block: Allow BDRV_REQ_FUA through blk_pwrite()
We have several block drivers that understand BDRV_REQ_FUA,
and emulate it in the block layer for the rest by a full flush.
But without a way to actually request BDRV_REQ_FUA during a
pass-through blk_pwrite(), FUA-aware block drivers like NBD are
forced to repeat the emulation logic of a full flush regardless
of whether the backend they are writing to could do it more
efficiently.

This patch just wires up a flags argument; followup patches
will actually make use of it in the NBD driver and in qemu-io.

Signed-off-by: Eric Blake <eblake@redhat.com>
Acked-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Janne Karhunen f249924e96 Allow users to specify the vmdk virtual hardware version.
Vmdk images have metadata to indicate the vmware virtual
hardware version image was created/tested to run with.
Allow users to specify that version via new 'hwversion'
option.

[ kwolf: Adjust qemu-iotests common.filter ]

Signed-off-by: Janne Karhunen <Janne.Karhunen@gmail.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Zhou Jie ed79f37d9b block: always compile-check debug prints
Files with conditional debug statements should ensure that the printf is
always compiled. This prevents bitrot of the format string of the debug
statement. And switch debug output to stderr.

Signed-off-by: Zhou Jie <zhoujie2011@cn.fujitsu.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf e3ddef25e9 block: Remove BlockDriver.bdrv_read/write
There are no block drivers left that implement the old .bdrv_read/write
interface, so it can be removed now. This gets us rid of the
corresponding emulation functions, too.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 4575eb496d vvfat: Implement .bdrv_co_preadv/pwritev interfaces
This doesn't really convert any of the actual vvfat logic to use
vectored I/O (and it's doubtful whether that would make sense), but
instead just adapts the wrappers to the modern interface.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 513b0f026b vpc: Implement .bdrv_co_pwritev() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf d46b7cc680 vpc: Implement .bdrv_co_preadv() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 37b1d7d8c9 vmdk: Implement .bdrv_co_pwritev() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf f10cc24359 vmdk: Implement .bdrv_co_preadv() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf a844a2b0d4 vmdk: Add vmdk_find_offset_in_cluster()
This is a byte granularity version of vmdk_find_index_in_cluster().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf fde9d56f5b vdi: Implement .bdrv_co_pwritev() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 0865bb6f04 vdi: Implement .bdrv_co_preadv() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 3edf1e73d5 dmg: Implement .bdrv_co_preadv() interface
This implements .bdrv_co_preadv() for the cloop block driver. While
updating the error paths, change -1 to a valid -errno code.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 5cd230819e cloop: Implement .bdrv_co_preadv() interface
This implements .bdrv_co_preadv() for the cloop block driver. While
updating the error paths, change -1 to a valid -errno code.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 3b8fd33011 bochs: Implement .bdrv_co_preadv() interface
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 3fb06697ae block: Introduce .bdrv_co_preadv/pwritev BlockDriver function
Many parts of the block layer are already byte granularity. The block
driver interface, however, was still missing an interface that allows
making use of this. This patch introduces a new BlockDriver interface,
which is based on coroutines, vectored, has flags and uses a byte
granularity. This is now the preferred interface for new drivers.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf cab3a3563c block: Rename bdrv_co_do_preadv/writev to bdrv_co_preadv/writev
It used to be an internal helper function just for implementing
bdrv_co_do_readv/writev(), but now that it's a public interface, it
deserves a name without "do" in it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:08 +02:00
Kevin Wolf 0884447382 block: Support AIO drivers in bdrv_driver_preadv/pwritev()
Instead of registering emulation functions as .bdrv_co_writev, just
directly check whether the function is there or not, and use the AIO
interface if it isn't. This makes the read/write functions more
consistent with how things are done in other places (flush, discard,
etc.)

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:07 +02:00
Kevin Wolf 78a07294d5 block: Introduce bdrv_driver_pwritev()
This is a function that simply calls into the block driver for doing a
write, providing the byte granularity interface we want to eventually
have everywhere, and using whatever interface that driver supports.

This one is a bit more interesting than the version for reads: It adds
support for .bdrv_co_writev_flags() everywhere, so that drivers
implementing this function can drop .bdrv_co_writev() now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:07 +02:00
Kevin Wolf 166fe96051 block: Introduce bdrv_driver_preadv()
This is a function that simply calls into the block driver for doing a
read, providing the byte granularity interface we want to eventually
have everywhere, and using whatever interface that driver supports.

For now, this is just a wrapper for calling bs->drv->bdrv_co_readv().

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini dd7f7ed104 linux-aio: make it more type safe
Replace void* with an opaque LinuxAioState type.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini 6b98bd6495 block: plug whole tree at once, introduce bdrv_io_unplugged_begin/end
Extract the handling of io_plug "depth" from linux-aio.c and let the
main bdrv_drain loop do nothing but wait on I/O.

Like the two newly introduced functions, bdrv_io_plug and bdrv_io_unplug
now operate on all children.  The visit order is now symmetrical between
plug and unplug, making it possible for formats to implement plug/unplug.

Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini ce0f141259 block: introduce bdrv_no_throttling_begin/end
Extract the handling of throttling from bdrv_flush_io_queue.  These
new functions will soon become BdrvChildRole callbacks, as they can
be generalized to "beginning of drain" and "end of drain".

Reviewed-by: Alberto Garcia <berto@igalia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini b6e84c97ed block: extract bdrv_drain_poll/bdrv_co_yield_to_drain from bdrv_drain/bdrv_co_drain
Do not call bdrv_drain_recurse twice in bdrv_co_drain.  A small
tweak to the logic in Fam's patch, which is harmless since no
one implements bdrv_drain anyway.  But better get it right.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini a72f641407 block: move restarting of throttled reqs to block/throttle-groups.c
We want to remove throttled_reqs from block/io.c.  This is the easy
part---hide the handling of throttled_reqs during disable/enable of
throttling within throttle-groups.c.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Paolo Bonzini 733bbc8cea block: make bdrv_start_throttled_reqs return void
The return value is unused and I am not sure why it would be useful.

Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Kevin Wolf 90c78624f1 block: Don't disable I/O throttling on sync requests
We had to disable I/O throttling with synchronous requests because we
didn't use to run timers in nested event loops when the code was
introduced. This isn't true any more, and throttling works just fine
even when using the synchronous API.

The removed code is in fact dead code since commit a8823a3b ('block: Use
blk_co_pwritev() for blk_write()') because I/O throttling can only be
set on the top layer, but BlockBackend always uses the coroutine
interface now instead of using the sync API emulation in block.c.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <1458660792-3035-2-git-send-email-kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-05-12 15:22:07 +02:00
Eric Blake 15c2f669e3 qapi: Split visit_end_struct() into pieces
As mentioned in previous patches, we want to call visit_end_struct()
functions unconditionally, so that visitors can release resources
tied up since the matching visit_start_struct() without also having
to worry about error priority if more than one error occurs.

Even though error_propagate() can be safely used to ignore a second
error during cleanup caused by a first error, it is simpler if the
cleanup cannot set an error.  So, split out the error checking
portion (basically, input visitors checking for unvisited keys) into
a new function visit_check_struct(), which can be safely skipped if
any earlier errors are encountered, and leave the cleanup portion
(which never fails, but must be called unconditionally if
visit_start_struct() succeeded) in visit_end_struct().

Generated code in qapi-visit.c has diffs resembling:

|@@ -59,10 +59,12 @@ void visit_type_ACPIOSTInfo(Visitor *v,
|         goto out_obj;
|     }
|     visit_type_ACPIOSTInfo_members(v, obj, &err);
|-    error_propagate(errp, err);
|-    err = NULL;
|+    if (err) {
|+        goto out_obj;
|+    }
|+    visit_check_struct(v, &err);
| out_obj:
|-    visit_end_struct(v, &err);
|+    visit_end_struct(v);
| out:

and in qapi-event.c:

@@ -47,7 +47,10 @@ void qapi_event_send_acpi_device_ost(ACP
|         goto out;
|     }
|     visit_type_q_obj_ACPI_DEVICE_OST_arg_members(v, &param, &err);
|-    visit_end_struct(v, err ? NULL : &err);
|+    if (!err) {
|+        visit_check_struct(v, &err);
|+    }
|+    visit_end_struct(v);
|     if (err) {
|         goto out;

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1461879932-9020-20-git-send-email-eblake@redhat.com>
[Conflict with a doc fixup resolved]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2016-05-12 09:47:55 +02:00
Kevin Wolf d208c50d9d vvfat: Fix default volume label
Commit d5941dd documented that it leaves the default volume name as it
was ("QEMU VVFAT"), but it doesn't actually implement this. You get an
empty name (eleven space characters) instead.

This fixes the implementation to apply the advertised default.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-04-29 11:14:13 +02:00
Kevin Wolf ebb72c9f06 vvfat: Fix volume name assertion
Commit d5941dd made the volume name configurable, but it didn't consider
that the rw code compares the volume name string to assert that the
first directory entry is the volume name. This made vvfat crash in rw
mode.

This fixes the assertion to compare with the configured volume name
instead of a literal string.

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-04-29 11:14:08 +02:00
Fam Zheng ab27c3b5e7 mirror: Workaround for unexpected iohandler events during completion
Commit 5a7e7a0ba moved mirror_exit to a BH handler but didn't add any
protection against new requests that could sneak in just before the
BH is dispatched. For example (assuming a code base at that commit):

        main_loop_wait # 1
          os_host_main_loop_wait
            g_main_context_dispatch
              aio_ctx_dispatch
                aio_dispatch
                  ...
                    mirror_run
                      bdrv_drain
    (a)               block_job_defer_to_main_loop
          qemu_iohandler_poll
            virtio_queue_host_notifier_read
              ...
                virtio_submit_multiwrite
    (b)           blk_aio_multiwrite

        main_loop_wait # 2
          <snip>
                aio_dispatch
                  aio_bh_poll
    (c)             mirror_exit

At (a) we know the BDS has no pending request. However, the same
main_loop_wait call is going to dispatch iohandlers (EventNotifier
events), which may lead to a new I/O from guest. So the invariant is
already broken at (c). Data loss.

Commit f3926945c8 made iohandler to use aio API.  The order of
virtio_queue_host_notifier_read and block_job_defer_to_main_loop within
a main_loop_wait becomes unpredictable, and even worse, if the host
notifier event arrives at the next main_loop_wait call, the
unpredictable order between mirror_exit and
virtio_queue_host_notifier_read is also a trouble. As shown below, this
commit made the bug easier to trigger:

    - Bug case 1:

        main_loop_wait # 1
          os_host_main_loop_wait
            g_main_context_dispatch
              aio_ctx_dispatch (qemu_aio_context)
                ...
                  mirror_run
                    bdrv_drain
    (a)             block_job_defer_to_main_loop
              aio_ctx_dispatch (iohandler_ctx)
                virtio_queue_host_notifier_read
                  ...
                    virtio_submit_multiwrite
    (b)               blk_aio_multiwrite

        main_loop_wait # 2
          ...
                aio_dispatch
                  aio_bh_poll
    (c)             mirror_exit

    - Bug case 2:

        main_loop_wait # 1
          os_host_main_loop_wait
            g_main_context_dispatch
              aio_ctx_dispatch (qemu_aio_context)
                ...
                  mirror_run
                    bdrv_drain
    (a)             block_job_defer_to_main_loop

        main_loop_wait # 2
          ...
            aio_ctx_dispatch (iohandler_ctx)
              virtio_queue_host_notifier_read
                ...
                  virtio_submit_multiwrite
    (b)             blk_aio_multiwrite
              aio_dispatch
                aio_bh_poll
    (c)           mirror_exit

In both cases, (b) breaks the invariant wanted by (a) and (c).

Until then, the request loss has been silent. Later, 3f09bfbc7b added
asserts at (c) to check the invariant (in
bdrv_replace_in_backing_chain), and Max reported an assertion failure
first visible there, by doing active committing while the guest is
running bonnie++.

2.5 added bdrv_drained_begin at (a) to protect the dataplane case from
similar problems, but we never realize the main loop bug until now.

As a bandage, this patch disables iohandler's external events
temporarily together with bs->ctx.

Launchpad Bug: 1570134

Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-22 16:44:09 +02:00
Fam Zheng 4150ae60eb mirror: Don't extend the last sub-chunk
The last sub-chunk is rounded up to the copy granularity in the target
image, resulting in a larger size than the source.

Add a function to clip the copied sectors to the end.

This undoes the "wrong" changes to tests/qemu-iotests/109.out in
e5b43573e2. The remaining two offset changes are okay.

[ kwolf: Use DIV_ROUND_UP to calculate nb_chunks now ]

Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
2016-04-20 16:52:55 +02:00
Max Reitz f27a274259 block/mirror: Refresh stale bitmap iterator cache
If the drive's dirty bitmap is dirtied while the mirror operation is
running, the cache of the iterator used by the mirror code may become
stale and not contain all dirty bits.

This only becomes an issue if we are looking for contiguously dirty
chunks on the drive. In that case, we can easily detect the discrepancy
and just refresh the iterator if one occurs.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-20 16:52:55 +02:00
Max Reitz 9c83625bdd block/mirror: Revive dead yielding code
mirror_iteration() is supposed to wait if the current chunk is subject
to a still in-flight mirroring operation. However, it mixed checking
this conflict situation with checking the dirty status of a chunk. A
simplification for the latter condition (the first chunk encountered is
always dirty) led to neglecting the former: We just skip the first chunk
and thus never test whether it conflicts with an in-flight operation.

To fix this, pull out the code which waits for in-flight operations on
the first chunk of the range to be mirrored to settle.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-20 16:52:55 +02:00
Jeff Cody d85fa9eb87 block/gluster: prevent data loss after i/o error
Upon receiving an I/O error after an fsync, by default gluster will
dump its cache.  However, QEMU will retry the fsync, which is especially
useful when encountering errors such as ENOSPC when using the werror=stop
option.  When using caching with gluster, however, the last written data
will be lost upon encountering ENOSPC.  Using the write-behind-cache
xlator option of 'resync-failed-syncs-after-fsync' should cause gluster
to retain the cached data after a failed fsync, so that ENOSPC and other
transient errors are recoverable.

Unfortunately, we have no way of knowing if the
'resync-failed-syncs-after-fsync' xlator option is supported, so for now
close the fd and set the BDS driver to NULL upon fsync error.

Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-04-19 12:24:59 -04:00
Jeff Cody 5d4343e6c2 block/gluster: code movement of qemu_gluster_close()
Move qemu_gluster_close() further up in the file, in preparation
for the next patch, to avoid a forward declaration.

Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-04-19 12:24:59 -04:00
Jeff Cody a882745356 block/gluster: return correct error value
Upon error, gluster will call the aio callback function with a
ret value of -1, with errno set to the proper error value.  If
we set the acb->ret value to the return value in the callback,
that results in every error being EPERM (i.e. 1).  Instead, set
it to the proper error result.

Reviewed-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-04-19 12:24:59 -04:00
Kevin Wolf 16aaf975ee block: Don't ignore flags in blk_{,co,aio}_write_zeroes()
Commit 57d6a428 neglected to pass the given flags to blk_aio_prwv(),
which broke discard by WRITE SAME for scsi-disk (the UNMAP bit would be
ignored).

Commit fc1453cd introduced the same bug for blk_write_zeroes(). This is
used for 'qemu-img convert' without has_zero_init (e.g. on a block
device) and for preallocation=falloc in parallels.

Commit 8896e088 is the version for blk_co_write_zeroes(). This function
is only used in qemu-io.

Reported-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody 9c057d0b68 block/vpc: update comments to be compliant w/coding guidelines
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody 32f6439cf7 block/vpc: set errp in vpc_open
Add more useful error information to failure paths in vpc_open

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody 66176fc6a7 block/vpc: make checks on max table size a bit more lax
The check on the max_table_size field not being larger than required is
valid, and in accordance with the VHD spec.  However, there have been
VHD images encountered in the wild that have an out-of-spec max table
size that is technically too large.

There is no issue in allowing this larger table size, as we also
later verify that the computed size (used for the pagetable) is
large enough to fit all sectors.  In addition, max_table_entries
is bounds checked against SIZE_MAX and INT_MAX.

Remove the strict check, so that we can accomodate these sorts of
images that are benignly out of spec.

Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Reported-by: Grant Wu <grantwwu@gmail.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody c23fb11bbb block/vpc: Use the correct max sector count for VHD images
The old VHD_MAX_SECTORS value is incorrect, and is a throwback
to the CHS calculations.  The VHD specification allows images up to 2040
GiB, which (using 512 byte sectors) corresponds to a maximum number of
sectors of 0xff000000, rather than the old value of 0xfe0001ff.

Update VHD_MAX_SECTORS to reflect the correct value.

Also, update comment references to the actual size limit, and correct
one compare so that we can have sizes up to the limit.

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody bab246db1d block/vpc: use current_size field for XenConverter VHD images
XenConverter VHD images are another VHD image where current_size is
different from the CHS values in the the format header.  Use
current_size as the default, by looking at the creator_app signature
field.

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Stefan Hajnoczi 9bdfb9e8ac vpc: use current_size field for XenServer VHD images
The vpc driver has two methods of determining virtual disk size.  The
correct one to use depends on the software that generated the image
file.  Add the XenServer creator_app signature so that image size is
correctly detected for those images.

Reported-by: Grant Wu <grantwwu@gmail.com>
Reported-by: Spencer Baugh <sbaugh@catern.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:12 +02:00
Jeff Cody 0211b9becc block/vpc: set errp in vpc_create
Add more useful error information to failure paths in vpc_create().

Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-15 17:22:11 +02:00
Kevin Wolf 7fa84cd8d4 block: Fix blk_aio_write_zeroes()
Commit 57d6a428 broke blk_aio_write_zeroes() because in some write
functions in the call path don't have an explicit length argument but
reuse qiov->size instead. Which is great, except that write_zeroes
doesn't have a qiov, which this commit interprets as 0 bytes.
Consequently, blk_aio_write_zeroes() didn't effectively do anything.

This patch introduces an explicit acb->bytes in BlkAioEmAIOCB and uses
that instead of acb->rwco.size.

The synchronous version of the function is okay because it does pass a
qiov (with the right size and a NULL pointer as its base).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-04-15 17:22:11 +02:00
Max Reitz 4e876bcf2b qcow2: Prevent backing file names longer than 1023
We reject backing file names with a length of more than 1023 characters
when opening a qcow2 file, so we should not produce such files
ourselves.

Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-12 18:06:51 +02:00
Paolo Bonzini 40a99aace3 vpc: fix return value check for blk_pwrite
bdrv_pwrite_sync used to return zero or negative error, while blk_pwrite returns
the number of written bytes when successful.  This caused VPC image creation
to fail spectacularly: it wrote the first 512 bytes, and then exited immediately
because of the non-zero answer from blk_pwrite.  But the truly spectacular part
is that it returns a positive value (the 512 that blk_pwrite returned) causing
everyone to believe that it succeeded.

This fixes qemu-iotests with vpc format.

Fixes: b8f45cdf78
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-04-12 18:06:51 +02:00
Fam Zheng 39bf92dd70 mirror: Replace bdrv_drain(bs) with bdrv_co_drain(bs)
Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1459855253-5378-3-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-04-11 16:59:09 +01:00
Fam Zheng a77fd4bb29 block: Fix bdrv_drain in coroutine
Using the nested aio_poll() in coroutine is a bad idea. This patch
replaces the aio_poll loop in bdrv_drain with a BH, if called in
coroutine.

For example, the bdrv_drain() in mirror.c can hang when a guest issued
request is pending on it in qemu_co_mutex_lock().

Mirror coroutine in this case has just finished a request, and the block
job is about to complete. It calls bdrv_drain() which waits for the
other coroutine to complete. The other coroutine is a scsi-disk request.
The deadlock happens when the latter is in turn pending on the former to
yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
(assuming a qcow2 image):

  mirror coroutine               scsi-disk coroutine
  -------------------------------------------------------------
  do last write

    qcow2:qemu_co_mutex_lock()
    ...
                                 scsi disk read

                                   tracked request begin

                                   qcow2:qemu_co_mutex_lock.enter

    qcow2:qemu_co_mutex_unlock()

  bdrv_drain
    while (has tracked request)
      aio_poll()

In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
because the mirror coroutine is blocked in the aio_poll(blocking=true).

With this patch, the added qemu_coroutine_yield() allows the scsi-disk
coroutine to make progress as expected:

  mirror coroutine               scsi-disk coroutine
  -------------------------------------------------------------
  do last write

    qcow2:qemu_co_mutex_lock()
    ...
                                 scsi disk read

                                   tracked request begin

                                   qcow2:qemu_co_mutex_lock.enter

    qcow2:qemu_co_mutex_unlock()

  bdrv_drain.enter
>   schedule BH
>   qemu_coroutine_yield()
>                                  qcow2:qemu_co_mutex_lock.return
>                                  ...
                                   tracked request end
    ...
    (resumed from BH callback)
  bdrv_drain.return
  ...

Reported-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Message-id: 1459855253-5378-2-git-send-email-famz@redhat.com
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-04-11 16:59:09 +01:00
Peter Maydell 31370dbe5d Block layer patches for 2.6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJXA9qJAAoJEH8JsnLIjy/W+W8P/2xTXH8+h0qurUvv5Rz6HUbD
 HHNlGnqa3M5yMLMqmtlb0J/9dj0pTlGNkp7d/9blh3MlZH/ZpeUMtq8Lro23YhUF
 0J94sRKGhK3T5GqYSA/BFbVvXQJ3yX7cKcYaQjmh7rK6Ua+65Mv/dulci+jbfGuu
 BkiVgumGAalSeaFqXZR685g61ZHbz+mQJnd3VFcvletnPBu0j1GMkuU0THAcy09q
 CTUjwWlL9CHu1lYkAa0KxgFtj6mZ+gEu5ws5Lvk8yFtSB+af/mJtzoHuq/7+Ske2
 7SiVXotFW8kR7ic1TnWiEku8+31FSBVJp6xUcRVTDOHVG7oSQxBDg5bGwPn8TxXy
 bvLvTJDIFodGhkiDTuGLuttvX+U2xCl4GmBS01OiFF53UGWgjjY+pkDZiaNC4nFW
 vwItj7/KGKL2Nq6cVfGCDOYYjFtHAPGI3yyJ2babXecv+9nKr0WeJpk6cfKVnP17
 rZs28Y3Ub/P2M4oOt4YdhRSanQZbe5eIQOsdfWX4q12hujL0zbsCtV6dpjeTsY74
 J4CBLzBYCj6y9Jc8R+D6XLYJBtJQGaSj99Oqe9WdUuHEqTGQt4HMuYHAj77wImVG
 ccURYiBpmB+FChLhG+yIlo1PHS0kpgeD+ZkZDHC0gYdiFqdKnFeQ7lj3Jj2tSWXY
 7Y7qbaOhtXp+20M6oO+G
 =ax/b
 -----END PGP SIGNATURE-----

Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging

Block layer patches for 2.6

# gpg: Signature made Tue 05 Apr 2016 16:32:25 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"

* remotes/kevin/tags/for-upstream:
  crypto: Avoid memory leak on failure
  qemu-iotests: 149: Use "/usr/bin/env python"
  block: Forbid I/O throttling on nodes with multiple parents for 2.6
  block: forbid x-blockdev-del from acting on DriveInfo

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
2016-04-05 17:03:32 +01:00
Eric Blake 95c3df5a24 crypto: Avoid memory leak on failure
Commit 7836857 introduced a memory leak due to invalid use of
Error vs. visit_type_end().  If visiting the intermediate
members fails, we clear the error and unconditionally use
visit_end_struct() on the same error object; but if that
cleanup succeeds, we then skip the qapi_free call.

Until a later patch adds visit_check_struct(), the only safe
approach is to use two separate error objects.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-id: 1459526222-30052-1-git-send-email-eblake@redhat.com
Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-04-05 17:23:21 +02:00
Eric Blake a89ef0c357 nbd: don't request FUA on FLUSH
The NBD protocol does not clearly document what will happen
if a client sends NBD_CMD_FLAG_FUA on NBD_CMD_FLUSH.
Historically, both the qemu and upstream NBD servers silently
ignored that flag, but that feels a bit risky.  Meanwhile, the
qemu NBD client unconditionally sends the flag (without even
bothering to check whether the caller cares; at least with
NBD_CMD_WRITE the client only sends FUA if requested by a
higher layer).

There is ongoing discussion on the NBD list to fix the
protocol documentation to require that the server MUST ignore
the flag (unless the kernel folks can better explain what FUA
means for a flush), but until those doc improvements land, the
current nbd.git master was recently changed to reject the flag
with EINVAL (see nbd commit ab22e082), which now makes it
impossible for a qemu client to use FLUSH with an upstream NBD
server.

We should not send FUA with flush unless the upstream protocol
documents what it will do, and even then, it should be something
that the caller can opt into, rather than being unconditional.

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1459526902-32561-1-git-send-email-eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-05 11:46:52 +02:00
Stefan Hajnoczi 0d94b74655 block/nfs: add missing #include "qemu/cutils.h"
parse_uint_full() used to be included from qemu-common.h but was moved
to qemu/cutils.h in commit f348b6d1a5
("util: move declarations out of qemu-common.h").

Cc: Veronia Bahaa <veroniabahaa@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1459341994-20567-3-git-send-email-stefanha@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-03-30 16:50:39 -04:00
Stefan Hajnoczi d165b8cb8b block/nfs: add missing #include "qapi/error.h"
error_setg() used to be included indirectly through qemu/osdep.h.  Since
commit da34e65cb4 ("include/qemu/osdep.h:
Don't include qapi/error.h") it requires an explicit include.

Cc: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1459341994-20567-2-git-send-email-stefanha@redhat.com
Signed-off-by: Jeff Cody <jcody@redhat.com>
2016-03-30 16:50:39 -04:00
Max Reitz a90639270d block/null-{co,aio}: Implement get_block_status()
Signed-off-by: Max Reitz <mreitz@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-03-30 12:16:04 +02:00
Max Reitz cd219eb1e5 block/null-{co,aio}: Allow reading zeroes
This is optional so that it does not impede the null block driver's
performance unless this behavior is desired.

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Acked-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-03-30 12:16:03 +02:00
Kevin Wolf 09cf9db1bc block: Remove bdrv_(set_)enable_write_cache()
The only remaining users were block jobs (mirror and backup) which
unconditionally enabled WCE on the BlockBackend of the target image. As
these block jobs don't go through BlockBackend for their I/O requests,
they aren't affected by this setting anyway but always get a writeback
mode, so that call can be removed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:03 +02:00
Kevin Wolf 61de4c6808 block: Remove BDRV_O_CACHE_WB
The previous patches have successively made blk->enable_write_cache the
true source for the information whether a writethrough mode must be
implemented. The corresponding BDRV_O_CACHE_WB is only useless baggage
we're carrying around, so now's the time to remove it.

At the same time, we remove the 'cache.writeback' option parsing on the
BDS level as the only effect was setting the BDRV_O_CACHE_WB flag.

This change requires test cases that explicitly enabled the option to
drop it. Other than that and the change of the error message when
writethrough is enabled on the BDS level (from "Can't set writethrough
mode" to "doesn't support the option"), there should be no change in
behaviour.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:03 +02:00
Kevin Wolf 5481531154 raw: Support BDRV_REQ_FUA
Pass through the FUA flag to the lower layer so that the separate flush
can be saved in practically relevant cases where a (raw) format driver
sits on top of the protocol driver.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00
Kevin Wolf 2b556518c3 nbd: Support BDRV_REQ_FUA
The NBD server already used to send a FUA flag when the writethrough
mode was set. This code was a remnant from the times where protocol
drivers actually had to implement writethrough modes. Since nowadays the
block layer sends flushes in writethrough mode and non-root nodes are
always writeback, this was mostly dead code - only mostly because if NBD
was configured to be used without a format, we sent _both_ FUA and an
explicit flush afterwards, which makes the code not technically dead,
but useless overhead.

This patch changes the code so that the block layer's FUA flag is
recognised and translated into a NBD FUA flag. The additional flush is
avoided now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00
Kevin Wolf 9f0eb9e129 iscsi: Support BDRV_REQ_FUA
This replaces the existing hack in the iscsi driver that sent the FUA
bit in writethrough mode and ignored the following flush in order to
optimise the number of roundtrips (see commit 73b5394e).

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00
Kevin Wolf 93f5e6d88a block: Introduce bdrv_co_writev_flags()
This function will allow drivers to implement BDRV_REQ_FUA natively
instead of sending a separate flush after the write.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00
Kevin Wolf c83f9fba2a block/qapi: Use blk_enable_write_cache()
Now that WCE is handled on the BlockBackend level, the flag is
meaningless for BDSes. As the schema requires us to fill the field,
we return an enabled write cache for them.

Note that this means that querying the BlockBackend name may return
writethrough as the cache information, whereas querying the node-name of
the root of that same BlockBackend will return writeback.

This may appear odd at first, but it actually makes sense because it
correctly repesents the layer that implements the WCE handling. This
becomes more apparent when you consider nodes that are the root node of
multiple BlockBackends, where each BB can have its own WCE setting.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00
Kevin Wolf bfd18d1e0b block: Move enable_write_cache to BB level
Whether a write cache is used or not is a decision that concerns the
user (e.g. the guest device) rather than the backend. It was already
logically part of the BB level as bdrv_move_feature_fields() always kept
it on top of the BDS tree; with this patch, the core of it (the actual
flag and the additional flushes) is also implemented there.

Direct callers of bdrv_open() must pass BDRV_O_CACHE_WB now if bs
doesn't have a BlockBackend attached.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
2016-03-30 12:16:02 +02:00