As we are about to access the APRs from the GICv2 uaccess interface,
make this logic generally available.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
The ARM-ARM has two bits in the ESR/HSR relevant to external aborts.
A range of {I,D}FSC values (of which bit 5 is always set) and bit 9 'EA'
which provides:
> an IMPLEMENTATION DEFINED classification of External Aborts.
This bit is in addition to the {I,D}FSC range, and has an implementation
defined meaning. KVM should always ignore this bit when handling external
aborts from a guest.
Remove the ESR_ELx_EA definition and rewrite its helper
kvm_vcpu_dabt_isextabt() to check the {I,D}FSC range. This merges
kvm_vcpu_dabt_isextabt() and the recently added is_abort_sea() helper.
CC: Tyler Baicar <tbaicar@codeaurora.org>
Reported-by: gengdongjiu <gengdj.1984@gmail.com>
Signed-off-by: James Morse <james.morse@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <cdall@linaro.org>
According to the SDM, if the "use TPR shadow" VM-execution control is
1, bits 11:0 of the virtual-APIC address must be 0 and the address
should set any bits beyond the processor's physical-address width.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
------------[ cut here ]------------
WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc4+ #11
RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
Call Trace:
? kvm_multiple_exception+0x149/0x170 [kvm]
? handle_emulation_failure+0x79/0x230 [kvm]
? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
? check_chain_key+0x137/0x1e0
? reexecute_instruction.part.168+0x130/0x130 [kvm]
nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
vmx_queue_exception+0x197/0x300 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
? preempt_count_sub+0x18/0xc0
? restart_apic_timer+0x17d/0x300 [kvm]
? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
The flag "nested_run_pending", which can override the decision of which should run
next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
is especially intended to avoid switching to L1 in the injection decision-point.
This can be handled just like the other cases in vmx_check_nested_events, instead of
having a special case in vmx_queue_exception.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx_complete_interrupts() assumes that the exception is always injected,
so it can be dropped by kvm_clear_exception_queue(). However,
an exception cannot be injected immediately if it is: 1) originally
destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
This patch applies to exceptions the same algorithm that is used for
NMIs, replacing exception.reinject with "exception.injected" (equivalent
to nmi_injected).
exception.pending now represents an exception that is queued and whose
side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
If exception.pending is true, the exception might result in a nested
vmexit instead, too (in which case the side effects must not be applied).
exception.injected instead represents an exception that is going to be
injected into the guest at the next vmentry.
Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_event_needs_reinjection() encapsulation.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
update_permission_bitmask currently does a 128-iteration loop to,
essentially, compute a constant array. Computing the 8 bits in parallel
reduces it to 16 iterations, and is enough to speed it up substantially
because many boolean operations in the inner loop become constants or
simplify noticeably.
Because update_permission_bitmask is actually the top item in the profile
for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
clock cycles, or up to 30%:
before after
cpuid 35173 25954
vmcall 35122 27079
inl_from_pmtimer 52635 42675
inl_from_qemu 53604 44599
inl_from_kernel 38498 30798
outl_to_kernel 34508 28816
wr_tsc_adjust_msr 34185 26818
rd_tsc_adjust_msr 37409 27049
mmio-no-eventfd:pci-mem 50563 45276
mmio-wildcard-eventfd:pci-mem 34495 30823
mmio-datamatch-eventfd:pci-mem 35612 31071
portio-no-eventfd:pci-io 44925 40661
portio-wildcard-eventfd:pci-io 29708 27269
portio-datamatch-eventfd:pci-io 31135 27164
(I wrote a small C program to compare the tables for all values of CR0.WP,
CR4.SMAP and CR4.SMEP, and they match).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch exposes 5 level page table feature to the VM.
At the same time, the canonical virtual address checking is
extended to support both 48-bits and 57-bits address width.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extends the shadow paging code, so that 5 level shadow page
table can be constructed if VM is running in 5 level paging
mode.
Also extends the ept code, so that 5 level ept table can be
constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
shadow logic, KVM should still use 4 level ept table for a VM
whose physical address width is less than 48 bits, even when
the VM is running in 5 level paging mode.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
[Unconditionally reset the MMU context in kvm_cpuid_update.
Changing MAXPHYADDR invalidates the reserved bit bitmasks.
- Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now we have 4 level page table and 5 level page table in 64 bits
long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
helpful to make the code more clear.
Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
redefine it to 5 whenever a replacement is needed for 5 level
paging.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
reserved bits in CR3. Yet the length of reserved bits in
guest CR3 should be based on the physical address width
exposed to the VM. This patch changes CR3 check logic to
calculate the reserved bits at runtime.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return false in kvm_cpuid() when it fails to find the cpuid
entry. Also, this routine(and its caller) is optimized with
a new argument - check_limit, so that the check_cpuid_limit()
fall back can be avoided.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A guest may not be configured to support XSAVES/XRSTORS, even when the host
does. If the guest does not support XSAVES/XRSTORS, clear the secondary
execution control so that the processor will raise #UD.
Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
the VMCS02.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A guest may not be configured to support RDSEED, even when the host
does. If the guest does not support RDSEED, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A guest may not be configured to support RDRAND, even when the host
does. If the guest does not support RDRAND, intercept the instruction
and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting
in the IA32_VMX_PROCBASED_CTLS2 MSR.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, secondary execution controls are divided in three groups:
- static, depending mostly on the module arguments or the processor
(vmx_secondary_exec_control)
- static, depending on CPUID (vmx_cpuid_update)
- dynamic, depending on nested VMX or local APIC state
Because walking CPUID is expensive, prepare_vmcs02 is using only
the first group. This however is unnecessarily complicated. Just
cache the static secondary execution controls, and then prepare_vmcs02
does not need to compute them every time. Computation of all static
secondary execution controls is now kept in a single function,
vmx_compute_secondary_exec_control.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enable the Virtual GIF feature. This is done by setting bit 25 at position
60h in the vmcb.
With this feature enabled, the processor uses bit 9 at position 60h as the
virtual GIF when executing STGI/CLGI instructions.
Since the execution of STGI by the L1 hypervisor does not cause a return to
the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window
are modified.
The IRQ window will be opened even if GIF is not set, under the assumption
that on resuming the L1 hypervisor the IRQ will be held pending until the
processor executes the STGI instruction.
For the NMI window, the STGI intercept is set. This will assist in opening
the window only when GIF=1.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a new cpufeature definition for Virtual GIF.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We already always set that type but don't check if it is supported. Also
for nVMX, we only support WB for now. Let's just require it.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Don't use shifts, tag them correctly as EPTP and use better matching
names (PWL vs. GAW).
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
There is currently some confusion between nested and L1 GPAs. The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough. What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache. This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).
The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Changed comment to say "GPAs" instead of "L1's physical addresses", as
per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.
Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
new code. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Moved "ret < 0" to the else brach, as per David's review. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Calling handle_mmio_page_fault() has been unnecessary since commit
e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to
kvm_mmu_page_fault()", 2016-02-22).
handle_mmio_page_fault() can now be made static.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow
local APIC state transition constraints, but the value written must be
valid.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bailing out immediately if there is no available mmu page to alloc.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
irq event stamp: 20532
hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8
RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
Call Trace:
make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
kvm_mmu_load+0x1cf/0x410 [kvm]
kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
This can be reproduced readily by ept=N and running syzkaller tests since
many syzkaller testcases don't setup any memory regions. However, if ept=Y
rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
which just hide the issue.
I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
softlockup.
This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
according to whether or not there is mmu page zapped.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let's reuse the function introduced with eptp switching.
We don't explicitly have to check against enable_ept_ad_bits, as this
is implicitly done when checking against nested_vmx_ept_caps in
valid_ept_address().
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is the same as commit 147277540bbc ("kvm: svm: Add support for
additional SVM NPF error codes", 2016-11-23), but for Intel processors.
In this case, the exit qualification field's bit 8 says whether the
EPT violation occurred while translating the guest's final physical
address or rather while translating the guest page tables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error
codes", 2016-11-23) added a new error code to aid nested page fault
handling. The commit unprotects (kvm_mmu_unprotect_page) the page when
we get a NPF due to guest page table walk where the page was marked RO.
However, if an L0->L2 shadow nested page table can also be marked read-only
when a page is read only in L1's nested page table. If such a page
is accessed by L2 while walking page tables it can cause a nested
page fault (page table walks are write accesses). However, after
kvm_mmu_unprotect_page we may get another page fault, and again in an
endless stream.
To cover this use case, we qualify the new error_code check with
vcpu->arch.mmu_direct_map so that the error_code check would run on L1
guest, and not the L2 guest. This avoids hitting the above scenario.
Fixes: 147277540bbc54119172481c8ef6d930cc9fbfc2
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
The kvm-intel.unrestricted_guest=0
WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
Call Trace:
? put_pid+0x3a/0x50
? rcu_read_lock_sched_held+0x79/0x80
? kmem_cache_free+0x2f2/0x350
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
The syszkaller folks reported a residual mmio emulation request to userspace
due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
several threads to launch the same vCPU, the thread which lauch this vCPU after
the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
trigger the warning.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/kvm.h>
#include <stdio.h>
int kvmcpu;
struct kvm_run *run;
void* thr(void* arg)
{
int res;
res = ioctl(kvmcpu, KVM_RUN, 0);
printf("ret1=%d exit_reason=%d suberror=%d\n",
res, run->exit_reason, run->internal.suberror);
return 0;
}
void test()
{
int i, kvm, kvmvm;
pthread_t th[4];
kvm = open("/dev/kvm", O_RDWR);
kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
srand(getpid());
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, 0);
usleep(rand() % 10000);
}
for (i = 0; i < 4; i++)
pthread_join(th[i], 0);
}
int main()
{
for (;;) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
test();
exit(0);
}
int status;
while (waitpid(pid, &status, __WALL) != pid) {}
}
return 0;
}
This patch fixes it by resetting the vcpu->mmio_needed once we receive
the triple fault to avoid the residue.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This implements the kvm_arch_vcpu_in_kernel() for ARM, and adjusts
the calls to kvm_vcpu_on_spin().
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This implements kvm_arch_vcpu_in_kernel() for s390. DIAG is a privileged
operation, so it cannot be called from problem state (user mode).
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
get_cpl requires vcpu_load, so we must cache the result (whether the
vcpu was preempted when its cpl=0) in kvm_vcpu_arch.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If a vcpu exits due to request a user mode spinlock, then
the spinlock-holder may be preempted in user mode or kernel mode.
(Note that not all architectures trap spin loops in user mode,
only AMD x86 and ARM/ARM64 currently do).
But if a vcpu exits in kernel mode, then the holder must be
preempted in kernel mode, so we should choose a vcpu in kernel mode
as a more likely candidate for the lock holder.
This introduces kvm_arch_vcpu_in_kernel() to decide whether the
vcpu is in kernel-mode when it's preempted. kvm_vcpu_on_spin's
new argument says the same of the spinning VCPU.
Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry().
Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has().
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid,
X86_FEATURE_XYZ), which gets rid of many very similar helpers.
When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but
this information isn't in common code, so we recreate it for KVM.
Add some BUILD_BUG_ONs to make sure that it runs nicely.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
bit(X86_FEATURE_NRIPS) is 3 since 2ccd71f1b278 ("x86/cpufeature: Move
some of the scattered feature bits to x86_capability"), so we can
simplify the code.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 uses vmfunc, L0 utilizes the associated vmexit to
emulate a switching of the ept pointer by reloading the
guest MMU.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Enable VMFUNC in the secondary execution controls. This simplifies the
changes necessary to expose it to nested hypervisors. VMFUNCs still
cause #UD when invoked.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Let's also just use the underlying functions directly here.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
[Rebased on top of 9f744c597460 ("KVM: nVMX: do not pin the VMCS12")]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
nested_get_page() just sounds confusing. All we want is a page from G1.
This is even unrelated to nested.
Let's introduce kvm_vcpu_gpa_to_page() so we don't get too lengthy
lines.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Squash pasto fix from Wanpeng Li. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the "Enable INVPCID" secondary execution control to the guest
and properly reflect the exit reason.
In addition, before this patch the guest was always running with
INVPCID enabled, causing pcid.flat's "Test on INVPCID when disabled"
test to fail.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It has been experimentally confirmed that supporting these two MSRs is one
of the necessary conditions for nested Hyper-V to use the TSC page. Modern
Windows guests are noticeably slower when they fall back to reading
timestamps from the HV_X64_MSR_TIME_REF_COUNT MSR instead of using the TSC
page.
The newly supported MSRs are advertised with the AccessFrequencyRegs
partition privilege flag and CPUID.40000003H:EDX[8] "Support for
determining timer frequencies is available" (both outside of the scope of
this KVM patch).
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull MIPS fixes from Ralf Baechle:
"This fixes two build issues for ralink platforms, both due to missing
#includes which used to be included indirectly via other headers"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: ralink: mt7620: Add missing header
MIPS: ralink: Fix build error due to missing header
ARM:
- Yet another race with VM destruction plugged
- A set of small vgic fixes
x86:
- Preserve pending INIT
- RCU fixes in paravirtual async pf, VM teardown, and VMXOFF emulation
- nVMX interrupt injection and dirty tracking fixes
- initialize to make UBSAN happy
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJZhNZ2AAoJEED/6hsPKofoKDEH/iIw1pcgdEW2NP/kFtKXSMCK
josdFwGPQMjBGzx6No4tfMCNDOjW2FKYXapN6CASAqMJo5H2krj8VHMVwm0h3lUl
4RdbbkFTdfl/Znp8M39efFheWrjX+L37AltKV7xAgA7n8cO39KV4RReimzSc7aVq
5dDt4k0dbF9/zXHxkGiKEhwaSSbZEEznQQ/09annSoOVe6om5esUrUtnUF5P99uz
IhAsmJbZxE5VmowjT5MjaR1mXSLLNL55HWKvkf3B3ZGnyxQU+3Vz7IGf2Ma2j+jV
IrdXA11NHDY1anDYgDhFlr3rTCPu9CBmTv4O8zsDRlX9TGpr8bBX2dvjRKl7uOo=
=KM80
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"ARM:
- Yet another race with VM destruction plugged
- A set of small vgic fixes
x86:
- Preserve pending INIT
- RCU fixes in paravirtual async pf, VM teardown, and VMXOFF
emulation
- nVMX interrupt injection and dirty tracking fixes
- initialize to make UBSAN happy"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchg
KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit"
KVM: nVMX: mark vmcs12 pages dirty on L2 exit
kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown
KVM: nVMX: do not pin the VMCS12
KVM: avoid using rcu_dereference_protected
KVM: X86: init irq->level in kvm_pv_kick_cpu_op
KVM: X86: Fix loss of pending INIT due to race
KVM: async_pf: make rcu irq exit if not triggered from idle task
KVM: nVMX: fixes to nested virt interrupt injection
KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12
KVM: arm/arm64: Handle hva aging while destroying the vm
KVM: arm/arm64: PMU: Fix overflow interrupt injection
KVM: arm/arm64: Fix bug in advertising KVM_CAP_MSI_DEVID capability
Pull x86 fix from Thomas Gleixner:
"The recent irq core changes unearthed API abuse in the HPET code,
which manifested itself in a suspend/resume regression.
The fix replaces the cruft with the proper function calls and cures
the regression"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hpet: Cure interface abuse in the resume path
This comes a bit later than I planned, and as a consequence is a larger
than it should be.
Most of the changes are devicetree fixes, across lots of platforms:
Renesas, Samsung Exynos, Marvell EBU, TI OMAP, Rockchips, Amlogic Meson,
Sigma Desings Tango, Allwinner SUNxi and TI Davinci.
Also across many platforms, I applied an older series of simple randconfig
build fixes. This includes making the CONFIG_MTD_XIP option compile again,
which had been broken for many years and probably has not been missed, but
it felt wrong to just remove it completely.
The only other changes are:
- We enable HWSPINLOCK in defconfig to get some Qualcomm boards
to work out of the box.
- A few regression fixes for Texas Instruments OMAP2+.
- A boot regression fix for the Renesas regulator quirk.
- A suspend/resume fix for Uniphier SoCs, fixing the resume of the
system bus.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIVAwUAWYTonGCrR//JCVInAQJTrw/+Kv+Y1EybtFFJuMgVqH0FserTKAonNv25
OJxby86GQQTKkju9U7jYM4MmpufBEMKnY9kn83rviUKfiuVDsJcRWp2jMynJtE5m
W1aIFCu1L8TsKAQJpmBPe8G/cSG7SRH2OoX9Ee5ozii0cBUTOCy6UGDYJqsY0MKy
8KPUWcHqYhLCT+0F0raf1+F5LqJlDh44q1+UVDPphqO4pbguio7PfXA+ut8VMHS7
YSDgiXb3UoXy8tSXRX4JcYJfFG+jIqg0vWpkOW9mlLc5e/+Ndmj5CWlDZPt4hWaR
3Us5VVkI8SXq7VGcJV/mA6JO7M4L4oP1caTATxzDdK1C2u/SuhdnMcenOy8LtRLi
gImdeZLK7+RFPS+Sp/mhQQ0UtULw0kQ5BEzpN0jVI1CKybK4TN2+GrJdZLR9Usas
TkavAUl0X5E7YXws7yb8qTJqDLXxTxyN8H4gMaTh+rCssh9+I8bvChoJ0gpGRIMl
iAHodqGFBvlVQWtUefc4BfDCqj7sVmtfBcAeOK7LVQPjb9D+hPvHJ50yfN3qMpMa
eVvEz4l91W4k/vxI5HsA/r7RybDujTLvgR/BGU0kLZAsEP/FpkrIKgmi5yuzyf1u
Sd3OiK8V7FXB3dBPf0i0wagaBjuEUcEuE0GQU+bM7qX5aW8flLxbEhj4SEv0NqvL
bHiaTEi9uAU=
=a2dO
-----END PGP SIGNATURE-----
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"This comes a bit later than I planned, and as a consequence is a
larger than it should be.
Most of the changes are devicetree fixes, across lots of platforms:
Renesas, Samsung Exynos, Marvell EBU, TI OMAP, Rockchips, Amlogic
Meson, Sigma Desings Tango, Allwinner SUNxi and TI Davinci.
Also across many platforms, I applied an older series of simple
randconfig build fixes. This includes making the CONFIG_MTD_XIP option
compile again, which had been broken for many years and probably has
not been missed, but it felt wrong to just remove it completely.
The only other changes are:
- We enable HWSPINLOCK in defconfig to get some Qualcomm boards to
work out of the box.
- A few regression fixes for Texas Instruments OMAP2+.
- A boot regression fix for the Renesas regulator quirk.
- A suspend/resume fix for Uniphier SoCs, fixing the resume of the
system bus"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (43 commits)
ARM: dts: tango4: Request RGMII RX and TX clock delays
bus: uniphier-system-bus: set up registers when resuming
ARM64: dts: marvell: armada-37xx: Fix the number of GPIO on south bridge
ARM: shmobile: rcar-gen2: Fix deadlock in regulator quirk
arm64: defconfig: enable missing HWSPINLOCK
ARM: pxa: select both FB and FB_W100 for eseries
ARM: ixp4xx: fix ioport_unmap definition
ARM: ep93xx: use ARM_PATCH_PHYS_VIRT correctly
ARM: mmp: mark usb_dma_mask as __maybe_unused
ARM: omap2: mark unused functions as __maybe_unused
ARM: omap1: avoid unused variable warning
ARM: sirf: mark sirfsoc_init_late as __maybe_unused
ARM: ixp4xx: use normal prototype for {read,write}s{b,w,l}
ARM: omap1/ams-delta: warn about failed regulator enable
ARM: rpc: rename RAM_SIZE macro
ARM: w90x900: normalize clk API
ARM: ep93xx: normalize clk API
ARM: dts: sun8i: a83t: Switch to CCU device tree binding macros
arm64: allwinner: sun50i-a64: Correct emac register size
ARM: dts: sunxi: h3/h5: Correct emac register size
...
- Report correct timer frequency to userspace when trapping CNTFRQ_EL0
- Fix race with hardware page table updates when updating access flags
- Silence clang overflow warning in VA_START and PAGE_OFFSET calculations
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJZhIc7AAoJELescNyEwWM0EaUH/0f1/MnKwZHAUoSOWrZynKna
0y59hNvlr47FCGxaMVSVWq/+mtSqxsQTPyLkLSULh8PR7OEwjq5qLxLMhFWF/7B4
/nmAGhFOYHDR95/0Mo0cMOMyamIvGPIqd304yld0P13S5z26UdRCoTrWgeKFJ2sm
wcKCJeAK+U4NZbASNthDKH7Mo0lM25oLIJ8gqj4o0NrxyZr86mXVTypV/nM82qdm
WRMFqGh5iRrmnGSCajgcTjqnYkFOYwHirKvdrV68+GFOYqVIS+9+RdwSnRNOZ9hl
MsIOUDRn3doHDe4tK9NcQlLoaRIqA9fM4+94XGJu+8uUNfDRYMueO6XUvGmnmOQ=
=FER5
-----END PGP SIGNATURE-----
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"Here are some more arm64 fixes for 4.13. The main one is the PTE race
with the hardware walker, but there are a couple of other things too.
- Report correct timer frequency to userspace when trapping
CNTFRQ_EL0
- Fix race with hardware page table updates when updating access
flags
- Silence clang overflow warning in VA_START and PAGE_OFFSET
calculations"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: avoid overflow in VA_START and PAGE_OFFSET
arm64: Fix potential race with hardware DBM in ptep_set_access_flags()
arm64: Use arch_timer_get_rate when trapping CNTFRQ_EL0