Commit babe3939 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'docs-6.7' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "The number of commits for documentation is not huge this time around,
  but there are some significant changes nonetheless:

   - Some more Spanish-language and Chinese translations

   - The much-discussed documentation of the confidential-computing
     threat model

   - Powerpc and RISCV documentation move under Documentation/arch -
     these complete this particular bit of documentation churn

   - A large traditional-Chinese documentation update

   - A new document on backporting and conflict resolution

   - Some kernel-doc and Sphinx fixes

  Plus the usual smattering of smaller updates and typo fixes"

* tag 'docs-6.7' of git://git.lwn.net/linux: (40 commits)
  scripts/kernel-doc: Fix the regex for matching -Werror flag
  docs: backporting: address feedback
  Documentation: driver-api: pps: Update PPS generator documentation
  speakup: Document USB support
  doc: blk-ioprio: Bring the doc in line with the implementation
  docs: usb: fix reference to nonexistent file in UVC Gadget
  docs: doc-guide: mention 'make refcheckdocs'
  Documentation: fix typo in dynamic-debug howto
  scripts/kernel-doc: match -Werror flag strictly
  Documentation/sphinx: Remove the repeated word "the" in comments.
  docs: sparse: add SPDX-License-Identifier
  docs/zh_CN: Add subsystem-apis Chinese translation
  docs/zh_TW: update contents for zh_TW
  docs: submitting-patches: encourage direct notifications to commenters
  docs: add backporting and conflict resolution document
  docs: move riscv under arch
  docs: update link to powerpc/vmemmap_dedup.rst
  mm/memory-hotplug: fix typo in documentation
  docs: move powerpc under arch
  PCI: Update the devres documentation regarding to pcim_*()
  ...
parents 7dc0e9c7 cf63348b
......@@ -8,7 +8,7 @@ Description:
more bits set in the dimm-health-bitmap retrieved in
response to H_SCM_HEALTH hcall. The details of the bit
flags returned in response to this hcall is available
at 'Documentation/powerpc/papr_hcalls.rst' . Below are
at 'Documentation/arch/powerpc/papr_hcalls.rst' . Below are
the flags reported in this sysfs file:
* "not_armed"
......
......@@ -364,7 +364,7 @@ Note, however, not all failures are truly "permanent". Some are
caused by over-heating, some by a poorly seated card. Many
PCI error events are caused by software bugs, e.g. DMAs to
wild addresses or bogus split transactions due to programming
errors. See the discussion in Documentation/powerpc/eeh-pci-error-recovery.rst
errors. See the discussion in Documentation/arch/powerpc/eeh-pci-error-recovery.rst
for additional detail on real-life experience of the causes of
software errors.
......@@ -404,7 +404,7 @@ That is, the recovery API only requires that:
.. note::
Implementation details for the powerpc platform are discussed in
the file Documentation/powerpc/eeh-pci-error-recovery.rst
the file Documentation/arch/powerpc/eeh-pci-error-recovery.rst
As of this writing, there is a growing list of device drivers with
patches implementing error recovery. Not all of these patches are in
......
......@@ -2030,7 +2030,7 @@ IO Priority
~~~~~~~~~~~
A single attribute controls the behavior of the I/O priority cgroup policy,
namely the blkio.prio.class attribute. The following values are accepted for
namely the io.prio.class attribute. The following values are accepted for
that attribute:
no-change
......@@ -2059,9 +2059,11 @@ The following numerical values are associated with the I/O priority policies:
+----------------+---+
| no-change | 0 |
+----------------+---+
| rt-to-be | 2 |
| promote-to-rt | 1 |
+----------------+---+
| all-to-idle | 3 |
| restrict-to-be | 2 |
+----------------+---+
| idle | 3 |
+----------------+---+
The numerical value that corresponds to each I/O priority class is as follows:
......@@ -2081,7 +2083,7 @@ The algorithm to set the I/O priority class for a request is as follows:
- If I/O priority class policy is promote-to-rt, change the request I/O
priority class to IOPRIO_CLASS_RT and change the request I/O priority
level to 4.
- If I/O priorityt class is not promote-to-rt, translate the I/O priority
- If I/O priority class policy is not promote-to-rt, translate the I/O priority
class policy into a number, then change the request I/O priority class
into the maximum of the I/O priority class policy number and the numerical
I/O priority class.
......
......@@ -259,7 +259,7 @@ Debug Messages at Module Initialization Time
When ``modprobe foo`` is called, modprobe scans ``/proc/cmdline`` for
``foo.params``, strips ``foo.``, and passes them to the kernel along with
params given in modprobe args or ``/etc/modprob.d/*.conf`` files,
params given in modprobe args or ``/etc/modprobe.d/*.conf`` files,
in the following order:
1. parameters given via ``/etc/modprobe.d/*.conf``::
......
......@@ -15,7 +15,7 @@ between architectures is in drivers/firmware/efi/libstub.
For arm64, there is no compressed kernel support, so the Image itself
masquerades as a PE/COFF image and the EFI stub is linked into the
kernel. The arm64 EFI stub lives in arch/arm64/kernel/efi-entry.S
kernel. The arm64 EFI stub lives in drivers/firmware/efi/libstub/arm64.c
and drivers/firmware/efi/libstub/arm64-stub.c.
By using the EFI boot stub it's possible to boot a Linux kernel
......
......@@ -102,9 +102,19 @@ The possible values in this file are:
* - 'Vulnerable'
- The processor is vulnerable, but no mitigation enabled
* - 'Vulnerable: Clear CPU buffers attempted, no microcode'
- The processor is vulnerable but microcode is not updated.
The mitigation is enabled on a best effort basis. See :ref:`vmwerv`
- The processor is vulnerable but microcode is not updated. The
mitigation is enabled on a best effort basis.
If the processor is vulnerable but the availability of the microcode
based mitigation mechanism is not advertised via CPUID, the kernel
selects a best effort mitigation mode. This mode invokes the mitigation
instructions without a guarantee that they clear the CPU buffers.
This is done to address virtualization scenarios where the host has the
microcode update applied, but the hypervisor is not yet updated to
expose the CPUID to the guest. If the host has updated microcode the
protection takes effect; otherwise a few CPU cycles are wasted
pointlessly.
* - 'Mitigation: Clear CPU buffers'
- The processor is vulnerable and the CPU buffer clearing mitigation is
enabled.
......@@ -119,24 +129,6 @@ to the above information:
'SMT Host state unknown' Kernel runs in a VM, Host SMT state unknown
======================== ============================================
.. _vmwerv:
Best effort mitigation mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the processor is vulnerable, but the availability of the microcode based
mitigation mechanism is not advertised via CPUID the kernel selects a best
effort mitigation mode. This mode invokes the mitigation instructions
without a guarantee that they clear the CPU buffers.
This is done to address virtualization scenarios where the host has the
microcode update applied, but the hypervisor is not yet updated to expose
the CPUID to the guest. If the host has updated microcode the protection
takes effect otherwise a few cpu cycles are wasted pointlessly.
The state in the mds sysfs file reflects this situation accordingly.
Mitigation mechanism
-------------------------
......
......@@ -225,8 +225,19 @@ The possible values in this file are:
* - 'Vulnerable'
- The processor is vulnerable, but no mitigation enabled
* - 'Vulnerable: Clear CPU buffers attempted, no microcode'
- The processor is vulnerable, but microcode is not updated. The
- The processor is vulnerable but microcode is not updated. The
mitigation is enabled on a best effort basis.
If the processor is vulnerable but the availability of the microcode
based mitigation mechanism is not advertised via CPUID, the kernel
selects a best effort mitigation mode. This mode invokes the mitigation
instructions without a guarantee that they clear the CPU buffers.
This is done to address virtualization scenarios where the host has the
microcode update applied, but the hypervisor is not yet updated to
expose the CPUID to the guest. If the host has updated microcode the
protection takes effect; otherwise a few CPU cycles are wasted
pointlessly.
* - 'Mitigation: Clear CPU buffers'
- The processor is vulnerable and the CPU buffer clearing mitigation is
enabled.
......
......@@ -98,7 +98,19 @@ The possible values in this file are:
* - 'Vulnerable'
- The CPU is affected by this vulnerability and the microcode and kernel mitigation are not applied.
* - 'Vulnerable: Clear CPU buffers attempted, no microcode'
- The system tries to clear the buffers but the microcode might not support the operation.
- The processor is vulnerable but microcode is not updated. The
mitigation is enabled on a best effort basis.
If the processor is vulnerable but the availability of the microcode
based mitigation mechanism is not advertised via CPUID, the kernel
selects a best effort mitigation mode. This mode invokes the mitigation
instructions without a guarantee that they clear the CPU buffers.
This is done to address virtualization scenarios where the host has the
microcode update applied, but the hypervisor is not yet updated to
expose the CPUID to the guest. If the host has updated microcode the
protection takes effect; otherwise a few CPU cycles are wasted
pointlessly.
* - 'Mitigation: Clear CPU buffers'
- The microcode has been updated to clear the buffers. TSX is still enabled.
* - 'Mitigation: TSX disabled'
......@@ -106,25 +118,6 @@ The possible values in this file are:
* - 'Not affected'
- The CPU is not affected by this issue.
.. _ucode_needed:
Best effort mitigation mode
^^^^^^^^^^^^^^^^^^^^^^^^^^^
If the processor is vulnerable, but the availability of the microcode-based
mitigation mechanism is not advertised via CPUID the kernel selects a best
effort mitigation mode. This mode invokes the mitigation instructions
without a guarantee that they clear the CPU buffers.
This is done to address virtualization scenarios where the host has the
microcode update applied, but the hypervisor is not yet updated to expose the
CPUID to the guest. If the host has updated microcode the protection takes
effect; otherwise a few CPU cycles are wasted pointlessly.
The state in the tsx_async_abort sysfs file reflects this situation
accordingly.
Mitigation mechanism
--------------------
......
......@@ -75,7 +75,7 @@ Memory hotunplug consists of two phases:
(1) Offlining memory blocks
(2) Removing the memory from Linux
In the fist phase, memory is "hidden" from the page allocator again, for
In the first phase, memory is "hidden" from the page allocator again, for
example, by migrating busy memory to other memory locations and removing all
relevant free pages from the page allocator After this phase, the memory is no
longer visible in memory statistics of the system.
......@@ -250,15 +250,15 @@ Observing the State of Memory Blocks
The state (online/offline/going-offline) of a memory block can be observed
either via::
% cat /sys/device/system/memory/memoryXXX/state
% cat /sys/devices/system/memory/memoryXXX/state
Or alternatively (1/0) via::
% cat /sys/device/system/memory/memoryXXX/online
% cat /sys/devices/system/memory/memoryXXX/online
For an online memory block, the managing zone can be observed via::
% cat /sys/device/system/memory/memoryXXX/valid_zones
% cat /sys/devices/system/memory/memoryXXX/valid_zones
Configuring Memory Hot(Un)Plug
==============================
......@@ -326,7 +326,7 @@ however, a memory block might span memory holes. A memory block spanning memory
holes cannot be offlined.
For example, assume 1 GiB memory block size. A device for a memory starting at
0x100000000 is ``/sys/device/system/memory/memory4``::
0x100000000 is ``/sys/devices/system/memory/memory4``::
(0x100000000 / 1Gib = 4)
......
......@@ -7,7 +7,7 @@ Last modified on Mon Sep 27 14:26:31 2010
Document version 1.3
Copyright (c) 2005 Gene Collins
Copyright (c) 2008 Samuel Thibault
Copyright (c) 2008, 2023 Samuel Thibault
Copyright (c) 2009, 2010 the Speakup Team
Permission is granted to copy, distribute and/or modify this document
......@@ -83,8 +83,7 @@ spkout -- Speak Out
txprt -- Transport
dummy -- Plain text terminal
Note: Speakup does * NOT * support usb connections! Speakup also does *
NOT * support the internal Tripletalk!
Note: Speakup does * NOT * support the internal Tripletalk!
Speakup does support two other synthesizers, but because they work in
conjunction with other software, they must be loaded as modules after
......@@ -94,6 +93,12 @@ These are as follows:
decpc -- DecTalk PC (not available at boot up)
soft -- One of several software synthesizers (not available at boot up)
By default speakup looks for the synthesizer on the ttyS0 serial port. This can
be changed with the device parameter of the modules, for instance for
DoubleTalk LT:
speakup_ltlk.dev=ttyUSB0
See the sections on loading modules and software synthesizers later in
this manual for further details. It should be noted here that the
speakup.synth boot parameter will have no effect if Speakup has been
......
......@@ -42,16 +42,16 @@ pre-allocation or re-sizing of any kernel data structures.
dentry-state
------------
This file shows the values in ``struct dentry_stat``, as defined in
``linux/include/linux/dcache.h``::
This file shows the values in ``struct dentry_stat_t``, as defined in
``fs/dcache.c``::
struct dentry_stat_t dentry_stat {
int nr_dentry;
int nr_unused;
int age_limit; /* age in seconds */
int want_pages; /* pages requested by system */
int nr_negative; /* # of unused negative dentries */
int dummy; /* Reserved for future use */
long nr_dentry;
long nr_unused;
long age_limit; /* age in seconds */
long want_pages; /* pages requested by system */
long nr_negative; /* # of unused negative dentries */
long dummy; /* Reserved for future use */
};
Dentries are dynamically allocated and deallocated.
......
......@@ -742,8 +742,8 @@ overcommit_memory
This value contains a flag that enables memory overcommitment.
When this flag is 0, the kernel attempts to estimate the amount
of free memory left when userspace requests more memory.
When this flag is 0, the kernel compares the userspace memory request
size against total memory plus swap and rejects obvious overcommits.
When this flag is 1, the kernel pretends there is always enough
memory until it actually runs out.
......
......@@ -18,8 +18,8 @@ implementation.
nios2/index
openrisc/index
parisc/index
../powerpc/index
../riscv/index
powerpc/index
riscv/index
s390/index
sh/index
sparc/index
......
......@@ -32,7 +32,7 @@ Introduction
responsible for the initialization of the adapter, setting up the
special path for user space access, and performing error recovery. It
communicates directly the Flash Accelerator Functional Unit (AFU)
as described in Documentation/powerpc/cxl.rst.
as described in Documentation/arch/powerpc/cxl.rst.
The cxlflash driver supports two, mutually exclusive, modes of
operation at the device (LUN) level:
......
......@@ -202,7 +202,7 @@ PPC_FEATURE2_VEC_CRYPTO
PPC_FEATURE2_HTM_NOSC
System calls fail if called in a transactional state, see
Documentation/powerpc/syscall64-abi.rst
Documentation/arch/powerpc/syscall64-abi.rst
PPC_FEATURE2_ARCH_3_00
The processor supports the v3.0B / v3.0C userlevel architecture. Processors
......@@ -217,11 +217,11 @@ PPC_FEATURE2_DARN
PPC_FEATURE2_SCV
The scv 0 instruction may be used for system calls, see
Documentation/powerpc/syscall64-abi.rst.
Documentation/arch/powerpc/syscall64-abi.rst.
PPC_FEATURE2_HTM_NO_SUSPEND
A limited Transactional Memory facility that does not support suspend is
available, see Documentation/powerpc/transactional_memory.rst.
available, see Documentation/arch/powerpc/transactional_memory.rst.
PPC_FEATURE2_ARCH_3_1
The processor supports the v3.1 userlevel architecture. Processors
......
......@@ -56,7 +56,7 @@ sent to the software queue.
Then, after the requests are processed by software queues, they will be placed
at the hardware queue, a second stage queue where the hardware has direct access
to process those requests. However, if the hardware does not have enough
resources to accept more requests, blk-mq will places requests on a temporary
resources to accept more requests, blk-mq will place requests on a temporary
queue, to be sent in the future, when the hardware is able.
Software staging queues
......
......@@ -138,6 +138,10 @@ times, but it's highly important. If we can actually eliminate warnings
from the documentation build, then we can start expecting developers to
avoid adding new ones.
In addition to warnings from the regular documentation build, you can also
run ``make refcheckdocs`` to find references to nonexistent documentation
files.
Languishing kerneldoc comments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......
......@@ -322,10 +322,8 @@ IOMAP
devm_platform_ioremap_resource_byname()
devm_platform_get_and_ioremap_resource()
devm_iounmap()
pcim_iomap()
pcim_iomap_regions() : do request_region() and iomap() on multiple BARs
pcim_iomap_table() : array of mapped addresses indexed by BAR
pcim_iounmap()
Note: For the PCI devices the specific pcim_*() functions may be used, see below.
IRQ
devm_free_irq()
......@@ -392,8 +390,16 @@ PCI
devm_pci_alloc_host_bridge() : managed PCI host bridge allocation
devm_pci_remap_cfgspace() : ioremap PCI configuration space
devm_pci_remap_cfg_resource() : ioremap PCI configuration space resource
pcim_enable_device() : after success, all PCI ops become managed
pcim_iomap() : do iomap() on a single BAR
pcim_iomap_regions() : do request_region() and iomap() on multiple BARs
pcim_iomap_regions_request_all() : do request_region() on all and iomap() on multiple BARs
pcim_iomap_table() : array of mapped addresses indexed by BAR
pcim_iounmap() : do iounmap() on a single BAR
pcim_iounmap_regions() : do iounmap() and release_region() on multiple BARs
pcim_pin_device() : keep PCI device enabled after release
pcim_set_mwi() : enable Memory-Write-Invalidate PCI transaction
PHY
devm_usb_get_phy()
......
......@@ -200,11 +200,17 @@ Generators
Sometimes one needs to be able not only to catch PPS signals but to produce
them also. For example, running a distributed simulation, which requires
computers' clock to be synchronized very tightly. One way to do this is to
invent some complicated hardware solutions but it may be neither necessary
nor affordable. The cheap way is to load a PPS generator on one of the
computers (master) and PPS clients on others (slaves), and use very simple
cables to deliver signals using parallel ports, for example.
computers' clock to be synchronized very tightly.
Parallel port generator
------------------------
One way to do this is to invent some complicated hardware solutions but it
may be neither necessary nor affordable. The cheap way is to load a PPS
generator on one of the computers (master) and PPS clients on others
(slaves), and use very simple cables to deliver signals using parallel
ports, for example.
Parallel port cable pinout::
......
......@@ -111,13 +111,13 @@ channel that was exported. The following properties will then be available:
duty_cycle
The active time of the PWM signal (read/write).
Value is in nanoseconds and must be less than the period.
Value is in nanoseconds and must be less than or equal to the period.
polarity
Changes the polarity of the PWM signal (read/write).
Writes to this property only work if the PWM chip supports changing
the polarity. The polarity can only be changed if the PWM is not
enabled. Value is the string "normal" or "inversed".
the polarity.
Value is the string "normal" or "inversed".
enable
Enable/disable the PWM signal (read/write).
......
......@@ -1585,7 +1585,7 @@ The transaction sequence looks like this:
2. The second transaction contains a physical update to the free space btrees
of AG 3 to release the former BMBT block and a second physical update to the
free space btrees of AG 7 to release the unmapped file space.
Observe that the the physical updates are resequenced in the correct order
Observe that the physical updates are resequenced in the correct order
when possible.
Attached to the transaction is a an extent free done (EFD) log item.
The EFD contains a pointer to the EFI logged in transaction #1 so that log
......
......@@ -101,7 +101,7 @@ to do something different in the near future.
../doc-guide/maintainer-profile
../nvdimm/maintainer-entry-profile
../riscv/patch-acceptance
../arch/riscv/patch-acceptance
../driver-api/media/maintainer-entry-profile
../driver-api/vfio-pci-device-specific-driver-acceptance
../nvme/feature-and-quirk-policy
......
......@@ -8,8 +8,7 @@ The Linux kernel supports the following overcommit handling modes
Heuristic overcommit handling. Obvious overcommits of address
space are refused. Used for a typical system. It ensures a
seriously wild allocation fails while allowing overcommit to
reduce swap usage. root is allowed to allocate slightly more
memory in this mode. This is the default.
reduce swap usage. This is the default.
1
Always overcommit. Appropriate for some scientific
......
......@@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the
virtual memory manager, will need to be written so that it traverses all of the
currently five levels. This style should also be preferred for
architecture-specific code, so as to be robust to future changes.
MMU, TLB, and Page Faults
=========================
The `Memory Management Unit (MMU)` is a hardware component that handles virtual
to physical address translations. It may use relatively small caches in hardware
called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
these translations.
When CPU accesses a memory location, it provides a virtual address to the MMU,
which checks if there is the existing translation in the TLB or in the Page
Walk Caches (on architectures that support them). If no translation is found,
MMU uses the page walks to determine the physical address and create the map.
The dirty bit for a page is set (i.e., turned on) when the page is written to.
Each page of memory has associated permission and dirty bits. The latter
indicate that the page has been modified since it was loaded into memory.
If nothing prevents it, eventually the physical memory can be accessed and the
requested operation on the physical frame is performed.
There are several reasons why the MMU can't find certain translations. It could
happen because the CPU is trying to access memory that the current task is not
permitted to, or because the data is not present into physical memory.
When these conditions happen, the MMU triggers page faults, which are types of
exceptions that signal the CPU to pause the current execution and run a special
function to handle the mentioned exceptions.
There are common and expected causes of page faults. These are triggered by
process management optimization techniques called "Lazy Allocation" and
"Copy-on-Write". Page faults may also happen when frames have been swapped out
to persistent storage (swap partition or file) and evicted from their physical
locations.
These techniques improve memory efficiency, reduce latency, and minimize space
occupation. This document won't go deeper into the details of "Lazy Allocation"
and "Copy-on-Write" because these subjects are out of scope as they belong to
Process Address Management.
Swapping differentiates itself from the other mentioned techniques because it's
undesirable since it's performed as a means to reduce memory under heavy
pressure.
Swapping can't work for memory mapped by kernel logical addresses. These are a
subset of the kernel virtual space that directly maps a contiguous range of
physical memory. Given any logical address, its physical address is determined
with simple arithmetic on an offset. Accesses to logical addresses are fast
because they avoid the need for complex page table lookups at the expenses of
frames not being evictable and pageable out.
If the kernel fails to make room for the data that must be present in the
physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
by terminating lower priority processes until pressure reduces under a safe
threshold.
Additionally, page faults may be also caused by code bugs or by maliciously
crafted addresses that the CPU is instructed to access. A thread of a process
could use instructions to address (non-shared) memory which does not belong to
its own address space, or could try to execute an instruction that want to write
to a read-only location.
If the above-mentioned conditions happen in user-space, the kernel sends a
`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
causes the termination of the thread and of the process it belongs to.
This document is going to simplify and show an high altitude view of how the
Linux kernel handles these page faults, creates tables and tables' entries,
check if memory is present and, if not, requests to load data from persistent
storage or from other devices, and updates the MMU and its caches.
The first steps are architecture dependent. Most architectures jump to
`do_page_fault()`, whereas the x86 interrupt handler is defined by the
`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`.
Whatever the routes, all architectures end up to the invocation of
`handle_mm_fault()` which, in turn, (likely) ends up calling
`__handle_mm_fault()` to carry out the actual work of allocating the page
tables.
The unfortunate case of not being able to call `__handle_mm_fault()` means
that the virtual address is pointing to areas of physical memory which are not
permitted to be accessed (at least from the current context). This
condition resolves to the kernel sending the above-mentioned SIGSEGV signal
to the process and leads to the consequences already explained.
`__handle_mm_fault()` carries out its work by calling several functions to
find the entry's offsets of the upper layers of the page tables and allocate
the tables that it may need.
The functions that look for the offset have names like `*_offset()`, where the
"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
corresponding tables, layer by layer, are called `*_alloc`, using the
above-mentioned convention to name them after the corresponding types of tables
in the hierarchy.
The page table walk may end at one of the middle or upper layers (PMD, PUD).
Linux supports larger page sizes than the usual 4KB (i.e., the so called
`huge pages`). When using these kinds of larger pages, higher level pages can
directly map them, with no need to use lower level page entries (PTE). Huge
pages contain large contiguous physical regions that usually span from 2MB to
1GB. They are respectively mapped by the PMD and PUD page entries.
The huge pages bring with them several benefits like reduced TLB pressure,
reduced page table overhead, memory allocation efficiency, and performance
improvement for certain workloads. However, these benefits come with
trade-offs, like wasted memory and allocation challenges.
At the very end of the walk with allocations, if it didn't return errors,
`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()`
performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`.
"read", "cow", "shared" give hints about the reasons and the kind of fault it's
handling.
The actual implementation of the workflow is very complex. Its design allows
Linux to handle page faults in a way that is tailored to the specific
characteristics of each architecture, while still sharing a common overall
structure.
To conclude this high altitude view of how Linux handles page faults, let's
add that the page faults handler can be disabled and enabled respectively with
`pagefault_disable()` and `pagefault_enable()`.
Several code path make use of the latter two functions because they need to
disable traps into the page faults handler, mostly to prevent deadlocks.
......@@ -211,7 +211,7 @@ the device (altmap).
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst
The differences with HugeTLB are relatively minor.
......
This diff is collapsed.
......@@ -66,12 +66,13 @@ lack of a better place.
:maxdepth: 1
applying-patches
backporting
adding-syscalls
magic-number
volatile-considered-harmful
botching-up-ioctls
clang-format
../riscv/patch-acceptance
../arch/riscv/patch-acceptance
../core-api/unaligned-memory-access
.. only:: subproject and html
......
......@@ -327,6 +327,8 @@ politely and address the problems they have pointed out. When sending a next
version, add a ``patch changelog`` to the cover letter or to individual patches
explaining difference against previous submission (see
:ref:`the_canonical_patch_format`).
Notify people that commented on your patch about new versions by adding them to
the patches CC list.
See Documentation/process/email-clients.rst for recommendations on email
clients and mailing list etiquette.
......@@ -366,10 +368,10 @@ busy people and may not get to your patch right away.
Once upon a time, patches used to disappear into the void without comment,
but the development process works more smoothly than that now. You should
receive comments within a week or so; if that does not happen, make sure
that you have sent your patches to the right place. Wait for a minimum of
one week before resubmitting or pinging reviewers - possibly longer during
busy times like merge windows.
receive comments within a few weeks (typically 2-3); if that does not
happen, make sure that you have sent your patches to the right place.
Wait for a minimum of one week before resubmitting or pinging reviewers
- possibly longer during busy times like merge windows.
It's also ok to resend the patch or the patch series after a couple of
weeks with the word "RESEND" added to the subject line::
......
......@@ -6,6 +6,7 @@ Security Documentation
:maxdepth: 1
credentials
snp-tdx-threat-model
IMA-templates
keys/index
lsm
......
This diff is collapsed.
......@@ -93,7 +93,7 @@ def markup_ctype_refs(match):
#
RE_expr = re.compile(r':c:(expr|texpr):`([^\`]+)`')
def markup_c_expr(match):
return '\ ``' + match.group(2) + '``\ '
return '\\ ``' + match.group(2) + '``\\ '
#
# Parse Sphinx 3.x C markups, replacing them by backward-compatible ones
......@@ -151,7 +151,7 @@ class CObject(Base_CObject):
def handle_func_like_macro(self, sig, signode):
u"""Handles signatures of function-like macros.
If the objtype is 'function' and the the signature ``sig`` is a
If the objtype is 'function' and the signature ``sig`` is a
function-like macro, the name of the macro is returned. Otherwise
``False`` is returned. """
......
......@@ -138,7 +138,7 @@ class KernelCmd(Directive):
code_block += "\n " + l
lines = code_block + "\n\n"
line_regex = re.compile("^\.\. LINENO (\S+)\#([0-9]+)$")
line_regex = re.compile(r"^\.\. LINENO (\S+)\#([0-9]+)$")
ln = 0
n = 0
f = fname
......
......@@ -104,7 +104,7 @@ class KernelFeat(Directive):
lines = self.runCmd(cmd, shell=True, cwd=cwd, env=shell_env)
line_regex = re.compile("^\.\. FILE (\S+)$")
line_regex = re.compile(r"^\.\. FILE (\S+)$")
out_lines = ""
......
......@@ -130,7 +130,7 @@ class KernelDocDirective(Directive):
result = ViewList()
lineoffset = 0;
line_regex = re.compile("^\.\. LINENO ([0-9]+)$")
line_regex = re.compile(r"^\.\. LINENO ([0-9]+)$")
for line in lines:
match = line_regex.search(line)
if match:
......@@ -138,7 +138,7 @@ class KernelDocDirective(Directive):
lineoffset = int(match.group(1)) - 1
# we must eat our comments since the upset the markup
else:
doc = env.srcdir + "/" + env.docname + ":" + str(self.lineno)
doc = str(env.srcdir) + "/" + env.docname + ":" + str(self.lineno)
result.append(line, doc + ": " + filename, lineoffset)
lineoffset += 1
......
......@@ -309,7 +309,7 @@ def convert_image(img_node, translator, src_fname=None):
if dst_fname:
# the builder needs not to copy one more time, so pop it if exists.
translator.builder.images.pop(img_node['uri'], None)
_name = dst_fname[len(translator.builder.outdir) + 1:]
_name = dst_fname[len(str(translator.builder.outdir)) + 1:]
if isNewer(dst_fname, src_fname):
kernellog.verbose(app,
......
......@@ -77,7 +77,7 @@ class MaintainersInclude(Include):
line = line.rstrip()
# Linkify all non-wildcard refs to ReST files in Documentation/.
pat = '(Documentation/([^\s\?\*]*)\.rst)'
pat = r'(Documentation/([^\s\?\*]*)\.rst)'
m = re.search(pat, line)
if m:
# maintainers.rst is in a subdirectory, so include "../".
......@@ -90,11 +90,11 @@ class MaintainersInclude(Include):
output = "| %s" % (line.replace("\\", "\\\\"))
# Look for and record field letter to field name mappings:
# R: Designated *reviewer*: FullName <address@domain>
m = re.search("\s(\S):\s", line)
m = re.search(r"\s(\S):\s", line)
if m:
field_letter = m.group(1)
if field_letter and not field_letter in fields:
m = re.search("\*([^\*]+)\*", line)
m = re.search(r"\*([^\*]+)\*", line)
if m:
fields[field_letter] = m.group(1)
elif subsystems:
......@@ -112,7 +112,7 @@ class MaintainersInclude(Include):
field_content = ""
# Collapse whitespace in subsystem name.
heading = re.sub("\s+", " ", line)
heading = re.sub(r"\s+", " ", line)
output = output + "%s\n%s" % (heading, "~" * len(heading))
field_prev = ""
else:
......
......@@ -35,6 +35,7 @@ Human interfaces
sound/index
gpu/index
fb/index
leds/index
Networking interfaces
---------------------
......@@ -70,7 +71,6 @@ Storage interfaces
fpga/index
i2c/index
iio/index
leds/index
pcmcia/index
spi/index
w1/index
......
.. include:: ../disclaimer-ita.rst
:Original: :doc:`../../../riscv/patch-acceptance`
:Original: :doc:`../../../arch/riscv/patch-acceptance`
:Translator: Federico Vaga <federico.vaga@vaga.pv.it>
arch/riscv linee guida alla manutenzione per gli sviluppatori
......
......@@ -22,3 +22,5 @@
adding-syscalls
researcher-guidelines
contribution-maturity-model
security-bugs
embargoed-hardware-issues
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-sp.rst
:Original: Documentation/process/security-bugs.rst
:Translator: Avadhut Naik <avadhut.naik@amd.com>
Errores de seguridad
====================
Los desarrolladores del kernel de Linux se toman la seguridad muy en
serio. Como tal, nos gustaría saber cuándo se encuentra un error de
seguridad para que pueda ser corregido y divulgado lo más rápido posible.
Por favor, informe sobre los errores de seguridad al equipo de seguridad
del kernel de Linux.
Contacto
--------
El equipo de seguridad del kernel de Linux puede ser contactado por correo
electrónico en <security@kernel.org>. Esta es una lista privada de
oficiales de seguridad que ayudarán a verificar el informe del error y
desarrollarán y publicarán una corrección. Si ya tiene una corrección, por
favor, inclúyala con su informe, ya que eso puede acelerar considerablemente
el proceso. Es posible que el equipo de seguridad traiga ayuda adicional
de mantenedores del área para comprender y corregir la vulnerabilidad de
seguridad.
Como ocurre con cualquier error, cuanta más información se proporcione,
más fácil será diagnosticarlo y corregirlo. Por favor, revise el
procedimiento descrito en 'Documentation/admin-guide/reporting-issues.rst'
si no tiene claro que información es útil. Cualquier código de explotación
es muy útil y no será divulgado sin el consentimiento del "reportero" (el
que envia el error) a menos que ya se haya hecho público.
Por favor, envíe correos electrónicos en texto plano sin archivos
adjuntos cuando sea posible. Es mucho más difícil tener una discusión
citada en contexto sobre un tema complejo si todos los detalles están
ocultos en archivos adjuntos. Piense en ello como un
:doc:`envío de parche regular <submitting-patches>` (incluso si no tiene
un parche todavía) describa el problema y el impacto, enumere los pasos
de reproducción, y sígalo con una solución propuesta, todo en texto plano.
Divulgación e información embargada
-----------------------------------
La lista de seguridad no es un canal de divulgación. Para eso, ver
Coordinación debajo. Una vez que se ha desarrollado una solución robusta,
comienza el proceso de lanzamiento. Las soluciones para errores conocidos
públicamente se lanzan inmediatamente.
Aunque nuestra preferencia es lanzar soluciones para errores no divulgados
públicamente tan pronto como estén disponibles, esto puede postponerse a
petición del reportero o una parte afectada por hasta 7 días calendario
desde el inicio del proceso de lanzamiento, con una extensión excepcional
a 14 días de calendario si se acuerda que la criticalidad del error requiere
más tiempo. La única razón válida para aplazar la publicación de una
solución es para acomodar la logística de QA y los despliegues a gran
escala que requieren coordinación de lanzamiento.
Si bien la información embargada puede compartirse con personas de
confianza para desarrollar una solución, dicha información no se publicará
junto con la solución o en cualquier otro canal de divulgación sin el
permiso del reportero. Esto incluye, pero no se limita al informe original
del error y las discusiones de seguimiento (si las hay), exploits,
información sobre CVE o la identidad del reportero.
En otras palabras, nuestro único interés es solucionar los errores. Toda
otra información presentada a la lista de seguridad y cualquier discusión
de seguimiento del informe se tratan confidencialmente incluso después de
que se haya levantado el embargo, en perpetuidad.
Coordinación con otros grupos
-----------------------------
El equipo de seguridad del kernel recomienda encarecidamente que los
reporteros de posibles problemas de seguridad NUNCA contacten la lista
de correo “linux-distros” hasta DESPUES de discutirlo con el equipo de
seguridad del kernel. No Cc: ambas listas a la vez. Puede ponerse en
contacto con la lista de correo linux-distros después de que se haya
acordado una solución y comprenda completamente los requisitos que al
hacerlo le impondrá a usted y la comunidad del kernel.
Las diferentes listas tienen diferentes objetivos y las reglas de
linux-distros no contribuyen en realidad a solucionar ningún problema de
seguridad potencial.
Asignación de CVE
-----------------
El equipo de seguridad no asigna CVEs, ni los requerimos para informes o
correcciones, ya que esto puede complicar innecesariamente el proceso y
puede retrasar el manejo de errores. Si un reportero desea que se le
asigne un identificador CVE, debe buscar uno por sí mismo, por ejemplo,
poniéndose en contacto directamente con MITRE. Sin embargo, en ningún
caso se retrasará la inclusión de un parche para esperar a que llegue un
identificador CVE.
Acuerdos de no divulgación
--------------------------
El equipo de seguridad del kernel de Linux no es un organismo formal y,
por lo tanto, no puede firmar cualquier acuerdo de no divulgación.
......@@ -10,7 +10,7 @@
mips/index
arm64/index
../riscv/index
../arch/riscv/index
openrisc/index
parisc/index
loongarch/index
......
.. include:: ../disclaimer-zh_CN.rst
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/riscv/boot-image-header.rst
:Original: Documentation/arch/riscv/boot-image-header.rst
:翻译:
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/riscv/index.rst
:Original: Documentation/arch/riscv/index.rst
:翻译:
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/riscv/patch-acceptance.rst
:Original: Documentation/arch/riscv/patch-acceptance.rst
:翻译:
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
.. include:: ../../disclaimer-zh_CN.rst
:Original: Documentation/riscv/vm-layout.rst
:Original: Documentation/arch/riscv/vm-layout.rst
:翻译:
......
......@@ -52,12 +52,9 @@
core-api/index
driver-api/index
subsystem-apis
内核中的锁 <locking/index>
TODOList:
* subsystem-apis
开发工具和流程
--------------
......
......@@ -89,4 +89,4 @@
../doc-guide/maintainer-profile
../../../nvdimm/maintainer-entry-profile
../../../riscv/patch-acceptance
../../../arch/riscv/patch-acceptance
.. SPDX-License-Identifier: GPL-2.0
.. include:: ./disclaimer-zh_CN.rst
:Original: Documentation/subsystem-apis.rst
:翻译:
唐艺舟 Tang Yizhou <tangyeechou@gmail.com>
==============
内核子系统文档
==============
这些书籍从内核开发者的角度,详细介绍了特定内核子系统
的如何工作。这里的大部分信息直接取自内核源代码,并
根据需要添加了补充材料(或者至少是我们设法添加的 - 可
能 *不是* 所有的材料都有需要)。
核心子系统
----------
.. toctree::
:maxdepth: 1
core-api/index
driver-api/index
mm/index
power/index
scheduler/index
locking/index
TODOList:
* timers/index
人机接口
--------
.. toctree::
:maxdepth: 1
sound/index
TODOList:
* input/index
* hid/index
* gpu/index
* fb/index
网络接口
--------
.. toctree::
:maxdepth: 1
infiniband/index
TODOList:
* networking/index
* netlabel/index
* isdn/index
* mhi/index
存储接口
--------
.. toctree::
:maxdepth: 1
filesystems/index
TODOList:
* block/index
* cdrom/index
* scsi/index
* target/index
**Fixme**: 这里还需要更多的分类组织工作。
.. toctree::
:maxdepth: 1
accounting/index
cpu-freq/index
iio/index
virt/index
PCI/index
peci/index
TODOList:
* fpga/index
* i2c/index
* leds/index
* pcmcia/index
* spi/index
* w1/index
* watchdog/index
* hwmon/index
* accel/index
* security/index
* crypto/index
* bpf/index
* usb/index
* misc-devices/index
* wmi/index
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_TW.rst
:Original: Documentation/admin-guide/bootconfig.rst
:譯者: 吳想成 Wu XiangCheng <bobwxc@email.cn>
========
引導配置
========
:作者: Masami Hiramatsu <mhiramat@kernel.org>
概述
====
引導配置擴展了現有的內核命令行,以一種更有效率的方式在引導內核時進一步支持
鍵值數據。這允許管理員傳遞一份結構化關鍵字的配置文件。
配置文件語法
============
引導配置文件的語法採用非常簡單的鍵值結構。每個關鍵字由點連接的單詞組成,鍵
和值由 ``=`` 連接。值以分號( ``;`` )或換行符( ``\n`` )結尾。數組值中每
個元素由逗號( ``,`` )分隔。::
KEY[.WORD[...]] = VALUE[, VALUE2[...]][;]
與內核命令行語法不同,逗號和 ``=`` 周圍允許有空格。
關鍵字只允許包含字母、數字、連字符( ``-`` )和下劃線( ``_`` )。值可包含
可打印字符和空格,但分號( ``;`` )、換行符( ``\n`` )、逗號( ``,`` )、
井號( ``#`` )和右大括號( ``}`` )等分隔符除外。
如果你需要在值中使用這些分隔符,可以用雙引號( ``"VALUE"`` )或單引號
( ``'VALUE'`` )括起來。注意,引號無法轉義。
鍵的值可以爲空或不存在。這些鍵用於檢查該鍵是否存在(類似布爾值)。
鍵值語法
--------
引導配置文件語法允許用戶通過大括號合併鍵名部分相同的關鍵字。例如::
foo.bar.baz = value1
foo.bar.qux.quux = value2
也可以寫成::
foo.bar {
baz = value1
qux.quux = value2
}
或者更緊湊一些,寫成::
foo.bar { baz = value1; qux.quux = value2 }
在這兩種樣式中,引導解析時相同的關鍵字都會自動合併。因此可以追加類似的樹或
鍵值。
相同關鍵字的值
--------------
禁止兩個或多個值或數組共享同一個關鍵字。例如::
foo = bar, baz
foo = qux # !錯誤! 我們不可以重定義相同的關鍵字
如果你想要更新值,必須顯式使用覆蓋操作符 ``:=`` 。例如::
foo = bar, baz
foo := qux
這樣 ``foo`` 關鍵字的值就變成了 ``qux`` 。這對於通過添加(部分)自定義引導
配置來覆蓋默認值非常有用,免於解析默認引導配置。
如果你想對現有關鍵字追加值作爲數組成員,可以使用 ``+=`` 操作符。例如::
foo = bar, baz
foo += qux
這樣, ``foo`` 關鍵字就同時擁有了 ``bar`` , ``baz`` 和 ``qux`` 。
此外,父關鍵字下可同時存在值和子關鍵字。
例如,下列配置是可行的。::
foo = value1
foo.bar = value2
foo := value3 # 這會更新foo的值。
注意,裸值不能直接放進結構化關鍵字中,必須在大括號外定義它。例如::
foo {
bar = value1
bar {
baz = value2
qux = value3
}
}
同時,關鍵字下值節點的順序是固定的。如果值和子關鍵字同時存在,值永遠是該關
鍵字的第一個子節點。因此如果用戶先指定子關鍵字,如::
foo.bar = value1
foo = value2
則在程序(和/proc/bootconfig)中,它會按如下顯示::
foo = value2
foo.bar = value1
註釋
----
配置語法接受shell腳本風格的註釋。註釋以井號( ``#`` )開始,到換行符
( ``\n`` )結束。
::
# comment line
foo = value # value is set to foo.
bar = 1, # 1st element
2, # 2nd element
3 # 3rd element
會被解析爲::
foo = value
bar = 1, 2, 3
注意你不能把註釋放在值和分隔符( ``,`` 或 ``;`` )之間。如下配置語法是錯誤的::
key = 1 # comment
,2
/proc/bootconfig
================
/proc/bootconfig是引導配置的用戶空間接口。與/proc/cmdline不同,此文件內容以
鍵值列表樣式顯示。
每個鍵值對一行,樣式如下::
KEY[.WORDS...] = "[VALUE]"[,"VALUE2"...]
用引導配置引導內核
==================
用引導配置引導內核有兩種方法:將引導配置附加到initrd鏡像或直接嵌入內核中。
*initrd: initial RAM disk,初始內存磁盤*
將引導配置附加到initrd
----------------------
由於默認情況下引導配置文件是用initrd加載的,因此它將被添加到initrd(initramfs)
鏡像文件的末尾,其中包含填充、大小、校驗值和12字節幻數,如下所示::
[initrd][bootconfig][padding][size(le32)][checksum(le32)][#BOOTCONFIG\n]
大小和校驗值爲小端序存放的32位無符號值。
當引導配置被加到initrd鏡像時,整個文件大小會對齊到4字節。空字符( ``\0`` )
會填補對齊空隙。因此 ``size`` 就是引導配置文件的長度+填充的字節。
Linux內核在內存中解碼initrd鏡像的最後部分以獲取引導配置數據。由於這種“揹負式”
的方法,只要引導加載器傳遞了正確的initrd文件大小,就無需更改或更新引導加載器
和內核鏡像本身。如果引導加載器意外傳遞了更長的大小,內核將無法找到引導配置數
據。
Linux內核在tools/bootconfig下提供了 ``bootconfig`` 命令來完成此操作,管理員
可以用它從initrd鏡像中刪除或追加配置文件。你可以用以下命令來構建它::
# make -C tools/bootconfig
要向initrd鏡像添加你的引導配置文件,請按如下命令操作(舊數據會自動移除)::
# tools/bootconfig/bootconfig -a your-config /boot/initrd.img-X.Y.Z
要從鏡像中移除配置,可以使用-d選項::
# tools/bootconfig/bootconfig -d /boot/initrd.img-X.Y.Z
然後在內核命令行上添加 ``bootconfig`` 告訴內核去initrd文件末尾尋找內核配置。
將引導配置嵌入內核
------------------
如果你不能使用initrd,也可以通過Kconfig選項將引導配置文件嵌入內核中。在此情
況下,你需要用以下選項重新編譯內核::
CONFIG_BOOT_CONFIG_EMBED=y
CONFIG_BOOT_CONFIG_EMBED_FILE="/引導配置/文件/的/路徑"
``CONFIG_BOOT_CONFIG_EMBED_FILE`` 需要從源碼樹或對象樹開始的引導配置文件的
絕對/相對路徑。內核會將其嵌入作爲默認引導配置。
與將引導配置附加到initrd一樣,你也需要在內核命令行上添加 ``bootconfig`` 告訴
內核去啓用內嵌的引導配置。
注意,即使你已經設置了此選項,仍可用附加到initrd的其他引導配置覆蓋內嵌的引導
配置。
通過引導配置傳遞內核參數
========================
除了內核命令行,引導配置也可以用於傳遞內核參數。所有 ``kernel`` 關鍵字下的鍵
值對都將直接傳遞給內核命令行。此外, ``init`` 下的鍵值對將通過命令行傳遞給
init進程。參數按以下順序與用戶給定的內核命令行字符串相連,因此命令行參數可以
覆蓋引導配置參數(這取決於子系統如何處理參數,但通常前面的參數將被後面的參數
覆蓋)::
[bootconfig params][cmdline params] -- [bootconfig init params][cmdline init params]
如果引導配置文件給出的kernel/init參數是::
kernel {
root = 01234567-89ab-cdef-0123-456789abcd
}
init {
splash
}
這將被複制到內核命令行字符串中,如下所示::
root="01234567-89ab-cdef-0123-456789abcd" -- splash
如果用戶給出的其他命令行是::
ro bootconfig -- quiet
則最後的內核命令行如下::
root="01234567-89ab-cdef-0123-456789abcd" ro bootconfig -- splash quiet
配置文件的限制
==============
當前最大的配置大小是32KB,關鍵字總數(不是鍵值條目)必須少於1024個節點。
注意:這不是條目數而是節點數,條目必須消耗超過2個節點(一個關鍵字和一個值)。
所以從理論上講最多512個鍵值對。如果關鍵字平均包含3個單詞,則可有256個鍵值對。
在大多數情況下,配置項的數量將少於100個條目,小於8KB,因此這應該足夠了。如果
節點數超過1024,解析器將返回錯誤,即使文件大小小於32KB。(請注意,此最大尺寸
不包括填充的空字符。)
無論如何,因爲 ``bootconfig`` 命令在附加啓動配置到initrd映像時會驗證它,用戶
可以在引導之前注意到它。
引導配置API
===========
用戶可以查詢或遍歷鍵值對,也可以查找(前綴)根關鍵字節點,並在查找該節點下的
鍵值。
如果您有一個關鍵字字符串,則可以直接使用 xbc_find_value() 查詢該鍵的值。如果
你想知道引導配置裏有哪些關鍵字,可以使用 xbc_for_each_key_value() 迭代鍵值對。
請注意,您需要使用 xbc_array_for_each_value() 訪問數組的值,例如::
vnode = NULL;
xbc_find_value("key.word", &vnode);
if (vnode && xbc_node_is_array(vnode))
xbc_array_for_each_value(vnode, value) {
printk("%s ", value);
}
如果您想查找具有前綴字符串的鍵,可以使用 xbc_find_node() 通過前綴字符串查找
節點,然後用 xbc_node_for_each_key_value() 迭代前綴節點下的鍵。
但最典型的用法是獲取前綴下的命名值或前綴下的命名數組,例如::
root = xbc_find_node("key.prefix");
value = xbc_node_find_value(root, "option", &vnode);
...
xbc_node_for_each_array_value(root, "array-option", value, anode) {
...
}
這將訪問值“key.prefix.option”的值和“key.prefix.array-option”的數組。
鎖是不需要的,因爲在初始化之後配置只讀。如果需要修改,必須複製所有數據和關鍵字。
函數與結構體
============
相關定義的kernel-doc參見:
- include/linux/bootconfig.h
- lib/bootconfig.c
......@@ -17,14 +17,14 @@
引言
=====
始終嘗試由來自kernel.org的原始碼構建的最新內核。如果您沒有信心這樣做,請將
始終嘗試由來自kernel.org的源代碼構建的最新內核。如果您沒有信心這樣做,請將
錯誤報告給您的發行版供應商,而不是內核開發人員。
找到缺陷(bug)並不總是那麼容易,不過仍然得去找。如果你找不到它,不要放棄。
儘可能多的向相關維護人員報告您發現的信息。請參閱MAINTAINERS文件以解您所
儘可能多的向相關維護人員報告您發現的信息。請參閱MAINTAINERS文件以解您所
關注的子系統的維護人員。
在提交錯誤報告之前,請閱讀「Documentation/admin-guide/reporting-issues.rst」
在提交錯誤報告之前,請閱讀“Documentation/admin-guide/reporting-issues.rst”
設備未出現(Devices not appearing)
====================================
......@@ -38,7 +38,7 @@
操作步驟:
- 從git原始碼構建內核
- 從git源代碼構建內核
- 以此開始二分 [#f1]_::
$ git bisect start
......@@ -76,7 +76,7 @@
如需進一步參考,請閱讀:
- ``git-bisect`` 的手冊頁
- `Fighting regressions with git bisect(用git bisect解決歸)
- `Fighting regressions with git bisect(用git bisect解決歸)
<https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_
- `Fully automated bisecting with "git bisect run"(使用git bisect run
來全自動二分) <https://lwn.net/Articles/317154>`_
......
......@@ -48,8 +48,8 @@
[<c1549f43>] ? sysenter_past_esp+0x40/0x6a
---[ end trace 6ebc60ef3981792f ]---
這樣的堆棧跟蹤提供了足夠的信息來識別內核原始碼中發生錯誤的那一行。根據問題的
嚴重性,它還可能包含 **「Oops」** 一詞,比如::
這樣的堆棧跟蹤提供了足夠的信息來識別內核源代碼中發生錯誤的那一行。根據問題的
嚴重性,它還可能包含 **“Oops”** 一詞,比如::
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<c06969d4>] iret_exc+0x7d0/0xa59
......@@ -58,17 +58,17 @@
...
儘管有 **Oops** 或其他類型的堆棧跟蹤,但通常需要找到出問題的行來識別和處理缺
陷。在本章中,我們將參考「Oops」來了解需要分析的各種堆棧跟蹤。
陷。在本章中,我們將參考“Oops”來了解需要分析的各種堆棧跟蹤。
如果內核是用 ``CONFIG_DEBUG_INFO`` 編譯的,那麼可以使用文件:
`scripts/decode_stacktrace.sh` 。
連結的模塊
鏈接的模塊
-----------
受到汙染或正在加載/卸載的模塊用「(…)」標記,汙染標誌在
`Documentation/admin-guide/tainted-kernels.rst` 文件中進行了描述,正在被加
」用「+」標註,「正在被卸載」用「-」標註。
受到污染或正在加載/卸載的模塊用“(…)”標記,污染標誌在
`Documentation/admin-guide/tainted-kernels.rst` 文件中進行了描述,正在被加
”用“+”標註,“正在被卸載”用“-”標註。
Oops消息在哪?
......@@ -81,19 +81,19 @@ syslog文件,通常是 ``/var/log/messages`` (取決於 ``/etc/syslog.conf``
有時 ``klogd`` 會掛掉,這種情況下您可以運行 ``dmesg > file`` 從內核緩衝區
讀取數據並保存它。或者您可以 ``cat /proc/kmsg > file`` ,但是您必須適時
中斷以停止傳輸,因爲 ``kmsg`` 是一個「永無止境的文件」
中斷以停止傳輸,因爲 ``kmsg`` 是一個“永無止境的文件”
如果機器嚴重崩潰,無法輸入命令或磁不可用,那還有三個選項:
如果機器嚴重崩潰,無法輸入命令或磁不可用,那還有三個選項:
(1) 手動複製屏幕上的文本,並在機器重新啓動後輸入。很難受,但這是突然崩潰下
唯一的選擇。或者你可以用數相機拍下屏幕——雖然不那麼好,但總比什麼都沒
有好。如果消息滾動超出控制台頂部,使用更高解析度(例如 ``vga=791`` )
引導啓動將允許您閱讀更多文本。(警告:這需要 ``vesafb`` ,因此對「早期」
唯一的選擇。或者你可以用數相機拍下屏幕——雖然不那麼好,但總比什麼都沒
有好。如果消息滾動超出控制檯頂部,使用更高分辨率(例如 ``vga=791`` )
引導啓動將允許您閱讀更多文本。(警告:這需要 ``vesafb`` ,因此對“早期”
的Oppses沒有幫助)
(2) 從串口終端啓動(參見
:ref:`Documentation/admin-guide/serial-console.rst <serial_console>` ),
在另一台機器上運行數據機然後用你喜歡的通信程序捕獲輸出。
在另一臺機器上運行調制解調器然後用你喜歡的通信程序捕獲輸出。
Minicom運行良好。
(3) 使用Kdump(參閱 Documentation/admin-guide/kdump/kdump.rst ),使用
......@@ -103,7 +103,7 @@ syslog文件,通常是 ``/var/log/messages`` (取決於 ``/etc/syslog.conf``
找到缺陷位置
-------------
如果你能指出缺陷在內核原始碼中的位置,則報告缺陷的效果會非常好。這有兩種方法。
如果你能指出缺陷在內核源代碼中的位置,則報告缺陷的效果會非常好。這有兩種方法。
通常來說使用 ``gdb`` 會比較容易,不過內核需要用調試信息來預編譯。
gdb
......@@ -187,7 +187,7 @@ GNU 調試器(GNU debugger, ``gdb`` )是從 ``vmlinux`` 文件中找出OOP
objdump
^^^^^^^^
要調試內核,請使用objdump並從崩潰輸出中查找十六進偏移,以找到有效的代碼/匯
要調試內核,請使用objdump並從崩潰輸出中查找十六進偏移,以找到有效的代碼/匯
編行。如果沒有調試符號,您將看到所示例程的彙編程序代碼,但是如果內核有調試
符號,C代碼也將可見(調試符號可以在內核配置菜單的hacking項中啓用)。例如::
......@@ -197,7 +197,7 @@ objdump
您需要處於內核樹的頂層以便此獲得您的C文件。
如果您無法訪問原始碼,仍然可以使用以下方法調試一些崩潰轉儲(如Dave Miller的
如果您無法訪問源代碼,仍然可以使用以下方法調試一些崩潰轉儲(如Dave Miller的
示例崩潰轉儲輸出所示)::
EIP is at +0x14/0x4c0
......@@ -234,9 +234,9 @@ objdump
報告缺陷
---------
一旦你通過定位缺陷找到了其發生的地方,你可以嘗試自己修復它或者向上報告它。
一旦你通過定位缺陷找到了其發生的地方,你可以嘗試自己修復它或者向上報告它。
爲了向上報告,您應該找出用於開發受影響代碼的郵件列表。這可以使用 ``get_maintainer.pl`` 。
爲了向上報告,您應該找出用於開發受影響代碼的郵件列表。這可以使用 ``get_maintainer.pl`` 。
例如,您在gspca的sonixj.c文件中發現一個缺陷,則可以通過以下方法找到它的維護者::
......@@ -251,7 +251,7 @@ objdump
請注意它將指出:
- 最後接觸原始碼的開發人員(如果這是在git樹中完成的)。在上面的例子中是Tejun
- 最後接觸源代碼的開發人員(如果這是在git樹中完成的)。在上面的例子中是Tejun
和Bhaktipriya(在這個特定的案例中,沒有人真正參與這個文件的開發);
- 驅動維護人員(Hans Verkuil);
- 子系統維護人員(Mauro Carvalho Chehab);
......
......@@ -7,10 +7,10 @@
清除 WARN_ONCE
--------------
WARN_ONCE / WARN_ON_ONCE / printk_once 僅僅印一次消息.
WARN_ONCE / WARN_ON_ONCE / printk_once 僅僅印一次消息.
echo 1 > /sys/kernel/debug/clear_warn_once
可以清除這種狀態並且再次允許印一次告警信息,這對於運行測試集後重現問題
可以清除這種狀態並且再次允許印一次告警信息,這對於運行測試集後重現問題
很有用。
......@@ -20,13 +20,13 @@ Linux通過``/proc/stat``和``/proc/uptime``導出各種信息,用戶空間工
...
裡系統認爲在默認採樣周期內有10.01%的時間工作在用戶空間,2.92%的時
裏系統認爲在默認採樣週期內有10.01%的時間工作在用戶空間,2.92%的時
間用在系統空間,總體上有81.63%的時間是空閒的。
大多數情況下``/proc/stat``的信息幾乎真實反映了系統信息,然而,由於內
核採集這些數據的方式/時間的特點,有時這些信息根本不可靠。
那麼這些信息是如何被集的呢?每當時間中斷觸發時,內核查看此刻運行的
那麼這些信息是如何被集的呢?每當時間中斷觸發時,內核查看此刻運行的
進程類型,並增加與此類型/狀態進程對應的計數器的值。這種方法的問題是
在兩次時間中斷之間系統(進程)能夠在多種狀態之間切換多次,而計數器只
增加最後一種狀態下的計數。
......@@ -34,7 +34,7 @@ Linux通過``/proc/stat``和``/proc/uptime``導出各種信息,用戶空間工
舉例
---
假設系統有一個進程以如下方式周期性地占用cpu::
假設系統有一個進程以如下方式週期性地佔用cpu::
兩個時鐘中斷之間的時間線
|-----------------------|
......@@ -46,7 +46,7 @@ Linux通過``/proc/stat``和``/proc/uptime``導出各種信息,用戶空間工
在上面的情況下,根據``/proc/stat``的信息(由於當系統處於空閒狀態時,
時間中斷經常會發生)系統的負載將會是0
大家能夠想內核的這種行爲會發生在許多情況下,這將導致``/proc/stat``
大家能夠想內核的這種行爲會發生在許多情況下,這將導致``/proc/stat``
中存在相當古怪的信息::
/* gcc -o hog smallhog.c */
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_TW.rst
:Original: Documentation/admin-guide/cputopology.rst
:翻譯:
唐藝舟 Tang Yizhou <tangyeechou@gmail.com>
==========================
如何通過sysfs將CPU拓撲導出
==========================
CPU拓撲信息通過sysfs導出。顯示的項(屬性)和某些架構的/proc/cpuinfo輸出相似。它們位於
/sys/devices/system/cpu/cpuX/topology/。請閱讀ABI文件:
Documentation/ABI/stable/sysfs-devices-system-cpu。
drivers/base/topology.c是體系結構中性的,它導出了這些屬性。然而,die、cluster、book、
draw這些層次結構相關的文件僅在體系結構提供了下文描述的宏的條件下被創建。
對於支持這個特性的體系結構,它必須在include/asm-XXX/topology.h中定義這些宏中的一部分::
#define topology_physical_package_id(cpu)
#define topology_die_id(cpu)
#define topology_cluster_id(cpu)
#define topology_core_id(cpu)
#define topology_book_id(cpu)
#define topology_drawer_id(cpu)
#define topology_sibling_cpumask(cpu)
#define topology_core_cpumask(cpu)
#define topology_cluster_cpumask(cpu)
#define topology_die_cpumask(cpu)
#define topology_book_cpumask(cpu)
#define topology_drawer_cpumask(cpu)
``**_id macros`` 的類型是int。
``**_cpumask macros`` 的類型是 ``(const) struct cpumask *`` 。後者和恰當的
``**_siblings`` sysfs屬性對應(除了topology_sibling_cpumask(),它和thread_siblings
對應)。
爲了在所有體系結構上保持一致,include/linux/topology.h提供了上述所有宏的默認定義,以防
它們未在include/asm-XXX/topology.h中定義:
1) topology_physical_package_id: -1
2) topology_die_id: -1
3) topology_cluster_id: -1
4) topology_core_id: 0
5) topology_book_id: -1
6) topology_drawer_id: -1
7) topology_sibling_cpumask: 僅入參CPU
8) topology_core_cpumask: 僅入參CPU
9) topology_cluster_cpumask: 僅入參CPU
10) topology_die_cpumask: 僅入參CPU
11) topology_book_cpumask: 僅入參CPU
12) topology_drawer_cpumask: 僅入參CPU
此外,CPU拓撲信息由/sys/devices/system/cpu提供,包含下述文件。輸出對應的內部數據源放在
方括號("[]")中。
=========== ==================================================================
kernel_max: 內核配置允許的最大CPU下標值。[NR_CPUS-1]
offline: 由於熱插拔移除或者超過內核允許的CPU上限(上文描述的kernel_max)
導致未上線的CPU。[~cpu_online_mask + cpus >= NR_CPUS]
online: 在線的CPU,可供調度使用。[cpu_online_mask]
possible: 已被分配資源的CPU,如果它們CPU實際存在,可以上線。
[cpu_possible_mask]
present: 被系統識別實際存在的CPU。[cpu_present_mask]
=========== ==================================================================
上述輸出的格式和cpulist_parse()兼容[參見 <linux/cpumask.h>]。下面給些例子。
在本例中,系統中有64個CPU,但是CPU 32-63超過了kernel_max值,因爲NR_CPUS配置項是32,
取值範圍被限制爲0..31。此外注意CPU2和4-31未上線,但是可以上線,因爲它們同時存在於
present和possible::
kernel_max: 31
offline: 2,4-31,32-63
online: 0-1,3
possible: 0-31
present: 0-31
在本例中,NR_CPUS配置項是128,但內核啓動時設置possible_cpus=144。系統中有4個CPU,
CPU2被手動設置下線(也是唯一一個可以上線的CPU)::
kernel_max: 127
offline: 2,4-127,128-143
online: 0-1,3
possible: 0-127
present: 0-3
閱讀Documentation/core-api/cpu_hotplug.rst可瞭解開機參數possible_cpus=NUM,同時還
可以瞭解各種cpumask的信息。
......@@ -3,13 +3,14 @@
.. include:: ../disclaimer-zh_TW.rst
:Original: :doc:`../../../admin-guide/index`
:Translator: 胡皓文 Hu Haowen <src.res.211@gmail.com>
:Translator: Alex Shi <alex.shi@linux.alibaba.com>
胡皓文 Hu Haowen <src.res.211@gmail.com>
Linux 內核用戶和管理員指南
==========================
下面是一組隨時間添加到內核中的面向用戶的文檔的集合。到目前爲止,還沒有一個
整體的順序或組織 - 這些材料不是一個單一的,連貫的文件!幸運的話,情況會隨
整體的順序或組織 - 這些材料不是一個單一的,連貫的文件!幸運的話,情況會隨
時間的推移而迅速改善。
這個初始部分包含總體信息,包括描述內核的README, 關於內核參數的文檔等。
......@@ -21,15 +22,15 @@ Linux 內核用戶和管理員指南
Todolist:
kernel-parameters
devices
sysctl/index
* kernel-parameters
* devices
* sysctl/index
本節介紹CPU漏洞及其緩解措施。
Todolist:
hw-vuln/index
* hw-vuln/index
下面的一組文檔,針對的是試圖跟蹤問題和bug的用戶。
......@@ -37,6 +38,7 @@ Todolist:
:maxdepth: 1
reporting-issues
reporting-regressions
security-bugs
bug-hunting
bug-bisect
......@@ -45,18 +47,17 @@ Todolist:
Todolist:
reporting-bugs
ramoops
dynamic-debug-howto
kdump/index
perf/index
* ramoops
* dynamic-debug-howto
* kdump/index
* perf/index
這是應用程式開發人員感興趣的章節的開始。可以在這裡找到涵蓋內核ABI各個
這是應用程序開發人員感興趣的章節的開始。可以在這裏找到涵蓋內核ABI各個
方面的文檔。
Todolist:
sysfs-rules
* sysfs-rules
本手冊的其餘部分包括各種指南,介紹如何根據您的喜好配置內核的特定行爲。
......@@ -64,67 +65,67 @@ Todolist:
.. toctree::
:maxdepth: 1
bootconfig
clearing-warn-once
cpu-load
cputopology
lockup-watchdogs
unicode
sysrq
mm/index
Todolist:
acpi/index
aoe/index
auxdisplay/index
bcache
binderfs
binfmt-misc
blockdev/index
bootconfig
braille-console
btmrvl
cgroup-v1/index
cgroup-v2
cifs/index
cputopology
dell_rbu
device-mapper/index
edid
efi-stub
ext4
nfs/index
gpio/index
highuid
hw_random
initrd
iostats
java
jfs
kernel-per-CPU-kthreads
laptops/index
lcd-panel-cgram
ldm
lockup-watchdogs
LSM/index
md
media/index
mm/index
module-signing
mono
namespaces/index
numastat
parport
perf-security
pm/index
pnp
rapidio
ras
rtc
serial-console
svga
sysrq
thunderbolt
ufs
vga-softcursor
video-output
xfs
* acpi/index
* aoe/index
* auxdisplay/index
* bcache
* binderfs
* binfmt-misc
* blockdev/index
* braille-console
* btmrvl
* cgroup-v1/index
* cgroup-v2
* cifs/index
* dell_rbu
* device-mapper/index
* edid
* efi-stub
* ext4
* nfs/index
* gpio/index
* highuid
* hw_random
* initrd
* iostats
* java
* jfs
* kernel-per-CPU-kthreads
* laptops/index
* lcd-panel-cgram
* ldm
* LSM/index
* md
* media/index
* module-signing
* mono
* namespaces/index
* numastat
* parport
* perf-security
* pm/index
* pnp
* rapidio
* ras
* rtc
* serial-console
* svga
* thunderbolt
* ufs
* vga-softcursor
* video-output
* xfs
.. only:: subproject and html
......
......@@ -9,8 +9,8 @@
吳想成 Wu XiangCheng <bobwxc@email.cn>
胡皓文 Hu Haowen <src.res.211@gmail.com>
解釋「No working init found.」啓動掛起消息
==========================================
解釋“No working init found.”啓動掛起消息
=========================================
:作者:
......@@ -18,41 +18,41 @@
Cristian Souza <cristianmsbr at gmail period com>
本文檔提供了加載初始化二進(init binary)失敗的一些高層級原因(大致按執行
本文檔提供了加載初始化二進(init binary)失敗的一些高層級原因(大致按執行
順序列出)。
1) **無法掛載根文件系統Unable to mount root FS** :請設置「debug」內核參數(在
1) **無法掛載根文件系統Unable to mount root FS** :請設置“debug”內核參數(在
引導加載程序bootloader配置文件或CONFIG_CMDLINE)以獲取更詳細的內核消息。
2) **初始化二進不存在於根文件系統上init binary doesn't exist on rootfs** :
2) **初始化二進不存在於根文件系統上init binary doesn't exist on rootfs** :
確保您的根文件系統類型正確(並且 ``root=`` 內核參數指向正確的分區);擁有
所需的驅動程序,例如SCSI或USB等存儲硬;文件系統(ext3、jffs2等)是內建的
所需的驅動程序,例如SCSI或USB等存儲硬;文件系統(ext3、jffs2等)是內建的
(或者作爲模塊由initrd預加載)。
3) **控制設備損壞Broken console device** : ``console= setup`` 中可能存在
衝突 --> 初始控制不可用(initial console unavailable)。例如,由於串行
IRQ問題(如缺少基於中斷的配置)導致的某些串行控制不可靠。嘗試使用不同的
3) **控制設備損壞Broken console device** : ``console= setup`` 中可能存在
衝突 --> 初始控制不可用(initial console unavailable)。例如,由於串行
IRQ問題(如缺少基於中斷的配置)導致的某些串行控制不可靠。嘗試使用不同的
``console= device`` 或像 ``netconsole=`` 。
4) **二進存在但依賴項不可用Binary exists but dependencies not available** :
例如初始化二進的必需庫依賴項,像 ``/lib/ld-linux.so.2`` 丟失或損壞。使用
4) **二進存在但依賴項不可用Binary exists but dependencies not available** :
例如初始化二進的必需庫依賴項,像 ``/lib/ld-linux.so.2`` 丟失或損壞。使用
``readelf -d <INIT>|grep NEEDED`` 找出需要哪些庫。
5) **無法加載二進位Binary cannot be loaded** :請確保二進位的體系結構與您的
體匹配。例如i386不匹配x86_64,或者嘗試在ARM硬體上加載x86。如果您嘗試在
此處加載非二進文件(shell腳本?),您應該確保腳本在其工作頭(shebang
5) **無法加載二進制Binary cannot be loaded** :請確保二進制的體系結構與您的
件匹配。例如i386不匹配x86_64,或者嘗試在ARM硬件上加載x86。如果您嘗試在
此處加載非二進文件(shell腳本?),您應該確保腳本在其工作頭(shebang
header)行 ``#!/...`` 中指定能正常工作的解釋器(包括其庫依賴項)。在處理
腳本之前,最好先測試一個簡單的非腳本二進文件,比如 ``/bin/sh`` ,並確認
腳本之前,最好先測試一個簡單的非腳本二進文件,比如 ``/bin/sh`` ,並確認
它能成功執行。要了解更多信息,請將代碼添加到 ``init/main.c`` 以顯示
kernel_execve()的返回值。
當您發現新的失敗原因時,請擴展本解釋(畢竟加載初始化二進是一個 **關鍵** 且
當您發現新的失敗原因時,請擴展本解釋(畢竟加載初始化二進是一個 **關鍵** 且
艱難的過渡步驟,需要儘可能無痛地進行),然後向LKML提交一個補丁。
待辦事項:
- 通過一個可以存儲 ``kernel_execve()`` 結果值的結構體數組實現各種
``run_init_process()`` 調用,並在失敗時通過代 **所有** 結果來記錄一切
``run_init_process()`` 調用,並在失敗時通過代 **所有** 結果來記錄一切
(非常重要的可用性修復)。
- 試使實現本身在一般情況下更有幫助,例如在受影響的地方提供額外的錯誤消息。
- 試使實現本身在一般情況下更有幫助,例如在受影響的地方提供額外的錯誤消息。
.. include:: ../disclaimer-zh_TW.rst
:Original: Documentation/admin-guide/lockup-watchdogs.rst
:Translator: Hailong Liu <liu.hailong6@zte.com.cn>
.. _tw_lockup-watchdogs:
=================================================
Softlockup與hardlockup檢測機制(又名:nmi_watchdog)
=================================================
Linux中內核實現了一種用以檢測系統發生softlockup和hardlockup的看門狗機制。
Softlockup是一種會引發系統在內核態中一直循環超過20秒(詳見下面“實現”小節)導致
其他任務沒有機會得到運行的BUG。一旦檢測到'softlockup'發生,默認情況下系統會打
印當前堆棧跟蹤信息並進入鎖定狀態。也可配置使其在檢測到'softlockup'後進入panic
狀態;通過sysctl命令設置“kernel.softlockup_panic”、使用內核啓動參數
“softlockup_panic”(詳見Documentation/admin-guide/kernel-parameters.rst)以及使
能內核編譯選項“BOOTPARAM_SOFTLOCKUP_PANIC”都可實現這種配置。
而'hardlockup'是一種會引發系統在內核態一直循環超過10秒鐘(詳見"實現"小節)導致其
他中斷沒有機會運行的缺陷。與'softlockup'情況類似,除了使用sysctl命令設置
'hardlockup_panic'、使能內核選項“BOOTPARAM_HARDLOCKUP_PANIC”以及使用內核參數
"nmi_watchdog"(詳見:”Documentation/admin-guide/kernel-parameters.rst“)外,一旦檢
測到'hardlockup'默認情況下系統打印當前堆棧跟蹤信息,然後進入鎖定狀態。
這個panic選項也可以與panic_timeout結合使用(這個panic_timeout是通過稍具迷惑性的
sysctl命令"kernel.panic"來設置),使系統在panic指定時間後自動重啓。
實現
====
Softlockup和hardlockup分別建立在hrtimer(高精度定時器)和perf兩個子系統上而實現。
這也就意味着理論上任何架構只要實現了這兩個子系統就支持這兩種檢測機制。
Hrtimer用於週期性產生中斷並喚醒watchdog線程;NMI perf事件則以”watchdog_thresh“
(編譯時默認初始化爲10秒,也可通過”watchdog_thresh“這個sysctl接口來進行配置修改)
爲間隔週期產生以檢測 hardlockups。如果一個CPU在這個時間段內沒有檢測到hrtimer中
斷髮生,'hardlockup 檢測器'(即NMI perf事件處理函數)將會視系統配置而選擇產生內核
警告或者直接panic。
而watchdog線程本質上是一個高優先級內核線程,每調度一次就對時間戳進行一次更新。
如果時間戳在2*watchdog_thresh(這個是softlockup的觸發門限)這段時間都未更新,那麼
"softlocup 檢測器"(內部hrtimer定時器回調函數)會將相關的調試信息打印到系統日誌中,
然後如果系統配置了進入panic流程則進入panic,否則內核繼續執行。
Hrtimer定時器的週期是2*watchdog_thresh/5,也就是說在hardlockup被觸發前hrtimer有
2~3次機會產生時鐘中斷。
如上所述,內核相當於爲系統管理員提供了一個可調節hrtimer定時器和perf事件週期長度
的調節旋鈕。如何通過這個旋鈕爲特定使用場景配置一個合理的週期值要對lockups檢測的
響應速度和lockups檢測開銷這二者之間進行權衡。
默認情況下所有在線cpu上都會運行一個watchdog線程。不過在內核配置了”NO_HZ_FULL“的
情況下watchdog線程默認只會運行在管家(housekeeping)cpu上,而”nohz_full“啓動參數指
定的cpu上則不會有watchdog線程運行。試想,如果我們允許watchdog線程在”nohz_full“指
定的cpu上運行,這些cpu上必須得運行時鐘定時器來激發watchdog線程調度;這樣一來就會
使”nohz_full“保護用戶程序免受內核干擾的功能失效。當然,副作用就是”nohz_full“指定
的cpu即使在內核產生了lockup問題我們也無法檢測到。不過,至少我們可以允許watchdog
線程在管家(non-tickless)核上繼續運行以便我們能繼續正常的監測這些cpus上的lockups
事件。
不論哪種情況都可以通過sysctl命令kernel.watchdog_cpumask來對沒有運行watchdog線程
的cpu集合進行調節。對於nohz_full而言,如果nohz_full cpu上有異常掛住的情況,通過
這種方式打開這些cpu上的watchdog進行調試可能會有所作用。
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment