Commit graph

620 commits

Author SHA1 Message Date
Kevin Wolf 94282e7146 raw-posix: Fix build without is_allocated support
Move the declaration of s into the #ifdef sections that actually make
use of it.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2012-06-24 01:04:45 +02:00
Stefan Hajnoczi af7b708db2 qcow2: fix autoclear image header update
The autoclear feature bits can be used for qcow2 file format features
that are safe to "drop" by old programs that do not understand the
feature.  Upon opening the image file unknown autoclear feature bits are
cleared and the image file header is rewritten, but this was happening
too early in the code when critical header fields were not yet loaded.

Process autoclear feature bits after all necessary header information
has been loaded.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Kevin Wolf b7ab0fea37 qcow2: Fix avail_sectors in cluster allocation code
avail_sectors should really be the number of sectors from the start of
the allocation, not from the start of the write request.

We're lucky enough that this mistake didn't cause any real bug.
avail_sectors is only used in the intialiser of QCowL2Meta:

  .nb_available   = MIN(requested_sectors, avail_sectors),

m->nb_available in turn is only used for COW at the end of the
allocation. A COW occurs only if the request wasn't cluster aligned,
which in turn would imply that requested_sectors was less than
avail_sectors (both in the original and in the fixed version). In this
case avail_sectors is ignored and therefore the mistake doesn't cause
any misbehaviour.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Kevin Wolf cdba7fee1d qcow2: Simplify calculation for COW area at the end
copy_sectors() always uses the sum (cluster_offset + n_start) or
(start_sect + n_start), so if some value is added to both cluster_offset
and start_sect, and subtracted from n_start, it's cancelled out anyway.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
Paolo Bonzini 6af4e9ead4 qcow2: always operate caches in writeback mode
Writethrough does not need special-casing anymore in the qcow2 caches.
The block layer adds flushes after every guest-initiated data write,
and these will also flush the qcow2 caches to the OS.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:43 +02:00
MORITA Kazutaka e0d93a89b9 sheepdog: add coroutine_fn markers to coroutine functions
Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Josh Durgin b11f38fcdf rbd: hook up cache options
Writeback caching was added in Ceph 0.46, and writethrough will be in
0.47. These are controlled by general config options, so there's no
need to check for librbd version.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf 166acf546f qcow2: Support for fixing refcount inconsistencies
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf ccf34716ee qemu-img check: Print fixed clusters and recheck
When any inconsistencies have been fixed, print the statistics and run
another check to make sure everything is correct now.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf 4534ff5426 qemu-img check -r for repairing images
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().

This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini 6ef228fc0d stream: move rate limiting to a separate header file
Make the code reusable.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini 188a7bbf94 stream: move is_allocated_above to block.c
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini f9749f28b7 stream: tweak usage of bdrv_co_is_allocated
is_allocated_base has complex semantics that are not really usable
outside streaming.  Split the check in two parts, where the allocated
state for the top bs is moved to the caller.  The resulting function
is more generally useful.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Paolo Bonzini 5500316ded block: implement is_allocated for raw
Either FIEMAP, or SEEK_DATA+SEEK_HOLE can be used to implement the
is_allocated callback for raw files.  On Linux ext4, btrfs and XFS
all support it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Zhi Yong Wu 87267753a3 qcow2: fix endianness conversion
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Zhi Yong Wu 833e40858c qcow2: remove a line of unnecessary code
Commit 3948d1d4 removed the pointer argument we filled in with l2_offset
but forgot to remove the unnecessary l2_offset assignment.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-06-15 14:03:42 +02:00
Kevin Wolf 1417d7e40e qcow2: Silence false warning
Some gcc versions seem not to be able to figure out that the switch
statement covers all possible values and that c is therefore always
initialised. Add a default branch for them.

Reported-by: malc <av1474@comtv.ru>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: malc <av1474@comtv.ru>
2012-06-15 15:52:45 +04:00
Paolo Bonzini 7456e4ce8d build: move block/ objects to nested Makefile.objs
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-06-07 09:21:13 +02:00
Jim Meyering c2d76497b6 block: prevent snapshot mode $TMPDIR symlink attack
In snapshot mode, bdrv_open creates an empty temporary file without
checking for mkstemp or close failure, and ignoring the possibility
of a buffer overrun given a surprisingly long $TMPDIR.
Change the get_tmp_filename function to return int (not void),
so that it can inform its two callers of those failures.
Also avoid the risk of buffer overrun and do not ignore mkstemp
or close failure.
Update both callers (in block.c and vvfat.c) to propagate
temp-file-creation failure to their callers.

get_tmp_filename creates and closes an empty file, while its
callers later open that presumed-existing file with O_CREAT.
The problem was that a malicious user could provoke mkstemp failure
and race to create a symlink with the selected temporary file name,
thus causing the qemu process (usually root owned) to open through
the symlink, overwriting an attacker-chosen file.

This addresses CVE-2012-2652.
http://bugzilla.redhat.com/CVE-2012-2652

Signed-off-by: Jim Meyering <meyering@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-30 10:18:20 +02:00
MORITA Kazutaka 6f3c714eb7 sheepdog: fix return value of do_load_save_vm_state
bdrv_save_vmstate and bdrv_load_vmstate should return the vmstate size
on success, and -errno on error.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-30 09:58:39 +02:00
Anthony Liguori 306761537f Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
  fdc-test: introduced qtest no_media_on_start and cmos qtest for floppy
  fdc: fix media detection
  fdc: floppy drive should be visible after start without media
  qemu-iotests: mark 035 qcow2-only
  qcow2: Check qcow2_alloc_clusters_at() return value
  sheepdog: use heap instead of stack for BDRVSheepdogState
  sheepdog: return -errno on error
  sheepdog: mark image as snapshot when tag is specified
  qemu-img: Explain how rebase operation can be used to perform a 'diff' operation.
  qcow2: don't leak buffer for unexpected qcow_version in header
2012-05-29 04:30:49 -05:00
Ronnie Sahlberg f4dfa67f04 ISCSI: Switch to using READ16/WRITE16 for I/O to the LUN
This allows using LUNs bigger than 2TB.  Keep using READ10 for other
device types such as MMC.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:16 +02:00
Ronnie Sahlberg 6bcd1346bb ISCSI: Only call READCAPACITY16 for SBC devices, use READCAPACITY10 for MMC
Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:15 +02:00
Ronnie Sahlberg dbfff6d776 ISCSI: get device type at connection time
This is needed to avoid READ CAPACITY(16) for MMC devices.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-28 14:04:14 +02:00
Paolo Bonzini c7b4a95202 ISCSI: change num_blocks to 64-bit
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-28 14:04:14 +02:00
Ronnie Sahlberg c9b9f6824f ISCSI: redo how we set up the events
Call qemu_notify_event() after updating events.  Otherwise, If we add
an event for -is-writeable but the socket is already writeable there
may be a delay before the event callback is actually triggered.

Those delays would in particular hurt performance during BIOS boot and
when the GRUB bootloader reads the kernel and initrd.

But first call out to the socket write functions directly, and only set up
the write event if the socket is full.  This will happen very rarely and
this improves performance.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
2012-05-28 14:04:06 +02:00
Kevin Wolf df02179189 qcow2: Check qcow2_alloc_clusters_at() return value
When using qcow2_alloc_clusters_at(), the cluster allocation code
checked the wrong variable for an error code.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka b6fc8245e9 sheepdog: use heap instead of stack for BDRVSheepdogState
bdrv_create() is called in coroutine context now, so we cannot use
more stack than 1 MB in the function if we use ucontext coroutine.
This patch allocates BDRVSheepdogState, whose size is 4 MB, on the
heap in sd_create().

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka cb595887cc sheepdog: return -errno on error
On error, BlockDriver APIs should return -errno instead of -1.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
MORITA Kazutaka 622b6057be sheepdog: mark image as snapshot when tag is specified
When a snapshot tag is specified in the filename, the opened image is
a snapshot.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@lab.ntt.co.jp>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
Jim Meyering b6c147622d qcow2: don't leak buffer for unexpected qcow_version in header
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-25 18:12:54 +02:00
Kevin Wolf c44bfe4637 qcow2: Don't ignore failure to clear autoclear flags
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-14 17:02:19 +02:00
Anthony Liguori 04120e3bb0 block: fix warning introduced in efcc7a23
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
2012-05-10 09:10:42 -05:00
Paolo Bonzini efcc7a2324 stream: do not copy unallocated sectors from the base
Unallocated sectors should really never be accessed by the guest,
so there's no need to copy them during the streaming process.
If they are read by the guest during streaming, guest-initiated
copy-on-read will copy them (we're in the base == NULL case, which
enables copy on read).  If they are read after we disconnect the
image from the base, they will read as zeroes anyway.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini b21d677ee9 stream: fix ratelimiting corner case
This fixes inability to make progress in streaming if the quota is set
to less than the amount of data that an I/O operation has to write.

In this case, limit->dispatched + n will always be above the quota and,
due to the "goto retry" to recheck cancellation and allocation, streaming
will livelock.

This can be reproduced with "block_job_set_speed ide0-hd0 1b".  Of course,
with this patch the requested limit will not be obeyed.  That could be
done with another patch that caps is_allocated's n argument by the slice
quota.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini f6133def92 stream: pass new base image format to bdrv_change_backing_file
When an image is modified to point to the new backing file, the backing
file format is set to NULL, which means auto-probe.  This is wrong, in
fact it is a small security problem.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 11:01:59 +02:00
Paolo Bonzini fa4478d5c8 block: wait for job callback in block_job_cancel_sync
The limitation on not having I/O after cancellation cannot really be
kept.  Even streaming has a very small race window where you could
cancel a job and have it report completion.  If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.

So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false.  We can set a very lax
condition:

- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).

- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.

Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled.  Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time.  This condition is obeyed by the block_job_sleep_ns function.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini 4513eafe92 block: add block_job_sleep_ns
This function abstracts the pretty complex semantics of the "busy"
member of BlockJob.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini e023b2e244 block: fix snapshot on QED
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.

The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.

Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:12 +02:00
Paolo Bonzini 469ef350e1 block: update in-memory backing file and format
These are needed to print "info block" output correctly.  QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:11 +02:00
Paolo Bonzini 5f3777945d block: push bdrv_change_backing_file error checking up from drivers
This check applies to all drivers, but QED lacks it.

Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-10 10:32:11 +02:00
Anthony Liguori 7c652c1eaf Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
  fdc: simplify media change handling
  qcow2: lock on prealloc
  block: make bdrv_create adopt coroutine
  qcow2: Limit COW to where it's needed
  sheepdog: switch to writethrough mode if cluster doesn't support flush
2012-05-08 09:38:41 -05:00
Zhi Yong Wu 15552c4ad3 qcow2: lock on prealloc
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
Kevin Wolf 54e6814360 qcow2: Limit COW to where it's needed
This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.

Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.

If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.

This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Tested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
MORITA Kazutaka 115c2b5a68 sheepdog: switch to writethrough mode if cluster doesn't support flush
This is necessary for qemu to work with the older version of Sheepdog
which doesn't support SD_OP_FLUSH_VDI.

Signed-off-by: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-07 19:33:18 +02:00
Ronnie Sahlberg fa6acb0c2f ISCSI: Add support for thin-provisioning via discard/UNMAP and bigger LUNs
Update the configure test for libiscsi support to detect version 1.3
or later.  Version 1.3 of libiscsi provides both READCAPACITY16 as well
as UNMAP commands.

Update the iscsi block layer to use READCAPACITY16 to detect the size of
the LUN instead of READCAPACITY10. This allows support for LUNs larger
than 2TB.

Update to implement bdrv_aio_discard() using the UNMAP command.
This allows us to use thin-provisioned LUNs from TGTD and other iSCSI
targets that support thin-provisioning.

Signed-off-by: Ronnie Sahlberg <ronniesahlberg@gmail.com>
[squashed in subsequent patch from Ronnie to fix off-by-one in LBA count]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2012-05-04 10:39:18 +02:00
Josh Durgin 787f31330e rbd: add discard support
Change the write flag to an operation type in RBDAIOCB, and make the
buffer optional since discard doesn't use it.

Discard is first included in librbd 0.1.2 (which is in Ceph 0.46).
If librbd is too old, leave out qemu_rbd_aio_discard entirely,
so the old behavior is preserved.

Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:41:42 +02:00
Zhi Yong Wu 647cc47223 qcow2: fix the return value -ENOENT -> -EEXIST
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Kevin Wolf 7242411460 qcow2: Don't hold cache references across yield
If cache references are held while the coroutine has yielded, the cache
may get used up and abort() when it can't find a free entry.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00
Kevin Wolf 60651f901a qcow2: Remove unused parameter in do_alloc_cluster_offset
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2012-05-02 18:39:39 +02:00