Commit ba49097e authored by Linus Torvalds's avatar Linus Torvalds

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next

Pull sparc updates from David Miller:
 "Of note is the addition of a driver for the Data Analytics
  Accelerator, and some small cleanups"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
  oradax: Fix return value check in dax_attach()
  sparc: vDSO: remove an extra tab
  sparc64: drop unneeded compat include
  sparc64: Oracle DAX driver
  sparc64: Oracle DAX infrastructure
parents ca0c836d 2d85ec8a
Excerpt from UltraSPARC Virtual Machine Specification
Compiled from version 3.0.20+15
Publication date 2017-09-25 08:21
Copyright © 2008, 2015 Oracle and/or its affiliates. All rights reserved.
Extracted via "pdftotext -f 547 -l 572 -layout sun4v_20170925.pdf"
Authors:
Charles Kunzman
Sam Glidden
Mark Cianchetti
Chapter 36. Coprocessor services
The following APIs provide access via the Hypervisor to hardware assisted data processing functionality.
These APIs may only be provided by certain platforms, and may not be available to all virtual machines
even on supported platforms. Restrictions on the use of these APIs may be imposed in order to support
live-migration and other system management activities.
36.1. Data Analytics Accelerator
The Data Analytics Accelerator (DAX) functionality is a collection of hardware coprocessors that provide
high speed processoring of database-centric operations. The coprocessors may support one or more of
the following data query operations: search, extraction, compression, decompression, and translation. The
functionality offered may vary by virtual machine implementation.
The DAX is a virtual device to sun4v guests, with supported data operations indicated by the virtual device
compatibilty property. Functionality is accessed through the submission of Command Control Blocks
(CCBs) via the ccb_submit API function. The operations are processed asynchronously, with the status
of the submitted operations reported through a Completion Area linked to each CCB. Each CCB has a
separate Completion Area and, unless execution order is specifically restricted through the use of serial-
conditional flags, the execution order of submitted CCBs is arbitrary. Likewise, the time to completion
for a given CCB is never guaranteed.
Guest software may implement a software timeout on CCB operations, and if the timeout is exceeded, the
operation may be cancelled or killed via the ccb_kill API function. It is recommended for guest software
to implement a software timeout to account for certain RAS errors which may result in lost CCBs. It is
recommended such implementation use the ccb_info API function to check the status of a CCB prior to
killing it in order to determine if the CCB is still in queue, or may have been lost due to a RAS error.
There is no fixed limit on the number of outstanding CCBs guest software may have queued in the virtual
machine, however, internal resource limitations within the virtual machine can cause CCB submissions
to be temporarily rejected with EWOULDBLOCK. In such cases, guests should continue to attempt
submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would
not be a guarantee that a future submission would succeed.
The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual
device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device
node”).
36.1.1. DAX Compatibility Property
The query functionality may vary based on the compatibility property of the virtual device:
36.1.1.1. "ORCL,sun4v-dax" Device Compatibility
Available CCB commands:
• No-op/Sync
• Extract
• Scan Value
• Inverted Scan Value
• Scan Range
509
Coprocessor services
• Inverted Scan Range
• Translate
• Inverted Translate
• Select
See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
Only version 0 CCBs are available.
36.1.1.2. "ORCL,sun4v-dax-fc" Device Compatibility
"ORCL,sun4v-dax-fc" is compatible with the "ORCL,sun4v-dax" interface, and includes additional CCB
bit fields and controls.
36.1.1.3. "ORCL,sun4v-dax2" Device Compatibility
Available CCB commands:
• No-op/Sync
• Extract
• Scan Value
• Inverted Scan Value
• Scan Range
• Inverted Scan Range
• Translate
• Inverted Translate
• Select
See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
Version 0 and 1 CCBs are available. Only version 0 CCBs may use Huffman encoded data, whereas only
version 1 CCBs may use OZIP.
36.1.2. DAX Virtual Device Interrupts
The DAX virtual device has multiple interrupts associated with it which may be used by the guest if
desired. The number of device interrupts available to the guest is indicated in the virtual device node of the
guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). If the device
node indicates N interrupts available, the guest may use any value from 0 to N - 1 (inclusive) in a CCB
interrupt number field. Using values outside this range will result in the CCB being rejected for an invalid
field value.
The interrupts may be bound and managed using the standard sun4v device interrupts API (Chapter 16,
Device interrupt services). Sysino interrupts are not available for DAX devices.
36.2. Coprocessor Control Block (CCB)
CCBs are either 64 or 128 bytes long, depending on the operation type. The exact contents of the CCB
are command specific, but all CCBs contain at least one memory buffer address. All memory locations
510
Coprocessor services
referenced by a CCB must be pinned in memory until the CCB either completes execution or is killed
via the ccb_kill API call. Changes in virtual address mappings occurring after CCB submission are not
guaranteed to be visible, and as such all virtual address updates need to be synchronized with CCB
execution.
All CCBs begin with a common 32-bit header.
Table 36.1. CCB Header Format
Bits Field Description
[31:28] CCB version. For API version 2.0: set to 1 if CCB uses OZIP encoding; set to 0 if the CCB
uses Huffman encoding; otherwise either 0 or 1. For API version 1.0: always set to 0.
[27] When API version 2.0 is negotiated, this is the Pipeline Flag [512]. It is reserved in
API version 1.0
[26] Long CCB flag [512]
[25] Conditional synchronization flag [512]
[24] Serial synchronization flag
[23:16] CCB operation code:
0x00 No Operation (No-op) or Sync
0x01 Extract
0x02 Scan Value
0x12 Inverted Scan Value
0x03 Scan Range
0x13 Inverted Scan Range
0x04 Translate
0x14 Inverted Translate
0x05 Select
[15:13] Reserved
[12:11] Table address type
0b'00 No address
0b'01 Alternate context virtual address
0b'10 Real address
0b'11 Primary context virtual address
[10:8] Output/Destination address type
0b'000 No address
0b'001 Alternate context virtual address
0b'010 Real address
0b'011 Primary context virtual address
0b'100 Reserved
0b'101 Reserved
0b'110 Reserved
0b'111 Reserved
[7:5] Secondary source address type
511
Coprocessor services
Bits Field Description
0b'000 No address
0b'001 Alternate context virtual address
0b'010 Real address
0b'011 Primary context virtual address
0b'100 Reserved
0b'101 Reserved
0b'110 Reserved
0b'111 Reserved
[4:2] Primary source address type
0b'000 No address
0b'001 Alternate context virtual address
0b'010 Real address
0b'011 Primary context virtual address
0b'100 Reserved
0b'101 Reserved
0b'110 Reserved
0b'111 Reserved
[1:0] Completion area address type
0b'00 No address
0b'01 Alternate context virtual address
0b'10 Real address
0b'11 Primary context virtual address
The Long CCB flag indicates whether the submitted CCB is 64 or 128 bytes long; value is 0 for 64 bytes
and 1 for 128 bytes.
The Serial and Conditional flags allow simple relative ordering between CCBs. Any CCB with the Serial
flag set will execute sequentially relative to any previous CCB that is also marked as Serial in the same
CCB submission. CCBs without the Serial flag set execute independently, even if they are between CCBs
with the Serial flag set. CCBs marked solely with the Serial flag will execute upon the completion of the
previous Serial CCB, regardless of the completion status of that CCB. The Conditional flag allows CCBs
to conditionally execute based on the successful execution of the closest CCB marked with the Serial flag.
A CCB may only be conditional on exactly one CCB, however, a CCB may be marked both Conditional
and Serial to allow execution chaining. The flags do NOT allow fan-out chaining, where multiple CCBs
execute in parallel based on the completion of another CCB.
The Pipeline flag is an optimization that directs the output of one CCB (the "source" CCB) directly to
the input of the next CCB (the "target" CCB). The target CCB thus does not need to read the input from
memory. The Pipeline flag is advisory and may be dropped.
Both the Pipeline and Serial bits must be set in the source CCB. The Conditional bit must be set in the
target CCB. Exactly one CCB must be made conditional on the source CCB; either 0 or 2 target CCBs
is invalid. However, Pipelines can be extended beyond two CCBs: the sequence would start with a CCB
with both the Pipeline and Serial bits set, proceed through CCBs with the Pipeline, Serial, and Conditional
bits set, and terminate at a CCB that has the Conditional bit set, but not the Pipeline bit.
512
Coprocessor services
The input of the target CCB must start within 64 bytes of the output of the source CCB or the pipeline flag
will be ignored. All CCBs in a pipeline must be submitted in the same call to ccb_submit.
The various address type fields indicate how the various address values used in the CCB should be
interpreted by the virtual machine. Not all of the types specified are used by every CCB format. Types
which are not applicable to the given CCB command should be indicated as type 0 (No address). Virtual
addresses used in the CCB must have translation entries present in either the TLB or a configured TSB
for the submitting virtual processor. Virtual addresses which cannot be translated by the virtual machine
will result in the CCB submission being rejected, with the causal virtual address indicated. The CCB
may be resubmitted after inserting the translation, or the address may be translated by guest software and
resubmitted using the real address translation.
36.2.1. Query CCB Command Formats
36.2.1.1. Supported Data Formats, Elements Sizes and Offsets
Data for query commands may be encoded in multiple possible formats. The data query commands use a
common set of values to indicate the encoding formats of the data being processed. Some encoding formats
require multiple data streams for processing, requiring the specification of both primary data formats (the
encoded data) and secondary data streams (meta-data for the encoded data).
36.2.1.1.1. Primary Input Format
The primary input format code is a 4-bit field when it is used. There are 10 primary input formats available.
The packed formats are not endian neutral. Code values not listed below are reserved.
Code Format Description
0x0 Fixed width byte packed Up to 16 bytes
0x1 Fixed width bit packed Up to 15 bits (CCB version 0) or 23 bits (CCB version
1); bits are read most significant bit to least significant bit
within a byte
0x2 Variable width byte packed Data stream of lengths must be provided as a secondary
input
0x4 Fixed width byte packed with run Up to 16 bytes; data stream of run lengths must be
length encoding provided as a secondary input
0x5 Fixed width bit packed with run Up to 15 bits (CCB version 0) or 23 bits (CCB version
length encoding 1); bits are read most significant bit to least significant bit
within a byte; data stream of run lengths must be provided
as a secondary input
0x8 Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
Huffman (CCB version 0) or bits are read most significant bit to least significant bit
OZIP (CCB version 1) encoding within a byte; pointer to the encoding table must be
provided
0x9 Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits (CCB version
Huffman (CCB version 0) or 1); compressed stream bits are read most significant bit to
OZIP (CCB version 1) encoding least significant bit within a byte; pointer to the encoding
table must be provided
0xA Variable width byte packed with Up to 16 bytes before the encoding; compressed stream
Huffman (CCB version 0) or bits are read most significant bit to least significant bit
OZIP (CCB version 1) encoding within a byte; data stream of lengths must be provided as
a secondary input; pointer to the encoding table must be
provided
513
Coprocessor services
Code Format Description
0xC Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
run length encoding, followed by bits are read most significant bit to least significant bit
Huffman (CCB version 0) or within a byte; data stream of run lengths must be provided
OZIP (CCB version 1) encoding as a secondary input; pointer to the encoding table must
be provided
0xD Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits(CCB version 1)
run length encoding, followed by before the encoding; compressed stream bits are read most
Huffman (CCB version 0) or significant bit to least significant bit within a byte; data
OZIP (CCB version 1) encoding stream of run lengths must be provided as a secondary
input; pointer to the encoding table must be provided
If OZIP encoding is used, there must be no reserved bytes in the table.
36.2.1.1.2. Primary Input Element Size
For primary input data streams with fixed size elements, the element size must be indicated in the CCB
command. The size is encoded as the number of bits or bytes, minus one. The valid value range for this
field depends on the input format selected, as listed in the table above.
36.2.1.1.3. Secondary Input Format
For primary input data streams which require a secondary input stream, the secondary input stream is
always encoded in a fixed width, bit-packed format. The bits are read from most significant bit to least
significant bit within a byte. There are two encoding options for the secondary input stream data elements,
depending on whether the value of 0 is needed:
Secondary Input Description
Format Code
0 Element is stored as value minus 1 (0 evalutes to 1, 1 evalutes
to 2, etc)
1 Element is stored as value
36.2.1.1.4. Secondary Input Element Size
Secondary input element size is encoded as a two bit field:
Secondary Input Size Description
Code
0x0 1 bit
0x1 2 bits
0x2 4 bits
0x3 8 bits
36.2.1.1.5. Input Element Offsets
Bit-wise input data streams may have any alignment within the base addressed byte. The offset, specified
from most significant bit to least significant bit, is provided as a fixed 3 bit field for each input type. A
value of 0 indicates that the first input element begins at the most significant bit in the first byte, and a
value of 7 indicates it begins with the least significant bit.
This field should be zero for any byte-wise primary input data streams.
514
Coprocessor services
36.2.1.1.6. Output Format
Query commands support multiple sizes and encodings for output data streams. There are four possible
output encodings, and up to four supported element sizes per encoding. Not all output encodings are
supported for every command. The format is indicated by a 4-bit field in the CCB:
Output Format Code Description
0x0 Byte aligned, 1 byte elements
0x1 Byte aligned, 2 byte elements
0x2 Byte aligned, 4 byte elements
0x3 Byte aligned, 8 byte elements
0x4 16 byte aligned, 16 byte elements
0x5 Reserved
0x6 Reserved
0x7 Reserved
0x8 Packed vector of single bit elements
0x9 Reserved
0xA Reserved
0xB Reserved
0xC Reserved
0xD 2 byte elements where each element is the index value of a bit,
from an bit vector, which was 1.
0xE 4 byte elements where each element is the index value of a bit,
from an bit vector, which was 1.
0xF Reserved
36.2.1.1.7. Application Data Integrity (ADI)
On platforms which support ADI, the ADI version number may be specified for each separate memory
access type used in the CCB command. ADI checking only occurs when reading data. When writing data,
the specified ADI version number overwrites any existing ADI value in memory.
An ADI version value of 0 or 0xF indicates the ADI checking is disabled for that data access, even if it is
enabled in memory. By setting the appropriate flag in CCB_SUBMIT (Section 36.3.1, “ccb_submit”) it is
also an option to disable ADI checking for all inputs accessed via virtual address for all CCBs submitted
during that hypercall invocation.
The ADI value is only guaranteed to be checked on the first 64 bytes of each data access. Mismatches on
subsequent data chunks may not be detected, so guest software should be careful to use page size checking
to protect against buffer overruns.
36.2.1.1.8. Page size checking
All data accesses used in CCB commands must be bounded within a single memory page. When addresses
are provided using a virtual address, the page size for checking is extracted from the TTE for that virtual
address. When using real addresses, the guest must supply the page size in the same field as the address
value. The page size must be one of the sizes supported by the underlying virtual machine. Using a value
that is not supported may result in the CCB submission being rejected or the generation of a CCB parsing
error in the completion area.
515
Coprocessor services
36.2.1.2. Extract command
Converts an input vector in one format to an output vector in another format. All input format types are
supported.
The only supported output format is a padded, byte-aligned output stream, using output codes 0x0 - 0x4.
When the specified output element size is larger than the extracted input element size, zeros are padded to
the extracted input element. First, if the decompressed input size is not a whole number of bytes, 0 bits are
padded to the most significant bit side till the next byte boundary. Next, if the output element size is larger
than the byte padded input element, bytes of value 0 are added based on the Padding Direction bit in the
CCB. If the output element size is smaller than the byte-padded input element size, the input element is
truncated by dropped from the least significant byte side until the selected output size is reached.
The return value of the CCB completion area is invalid. The “number of elements processed” field in the
CCB completion area will be valid.
The extract CCB is a 64-byte “short format” CCB.
The extract CCB command format can be specified by the following packed C structure for a big-endian
machine:
struct extract_ccb {
uint32_t header;
uint32_t control;
uint64_t completion;
uint64_t primary_input;
uint64_t data_access_control;
uint64_t secondary_input;
uint64_t reserved;
uint64_t output;
uint64_t table;
};
The exact field offsets, sizes, and composition are as follows:
Offset Size Field Description
0 4 CCB header (Table 36.1, “CCB Header Format”)
4 4 Command control
Bits Field Description
[31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
Format”)
[27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
Input Element Size”)
[22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
Input Format”)
[18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
516
Coprocessor services
Offset Size Field Description
Bits Field Description
[15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
“Secondary Input Element Size”
[13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
[9] Padding Direction selector: A value of 1 causes padding bytes
to be added to the left side of output elements. A value of 0
causes padding bytes to be added to the right side of output
elements.
[8:0] Reserved
8 8 Completion
Bits Field Description
[63:60] ADI version (see Section 36.2.1.1.7, “Application Data
Integrity (ADI)”)
[59] If set to 1, a virtual device interrupt will be generated using
the device interrupt number specified in the lower bits of this
completion word. If 0, the lower bits of this completion word
are ignored.
[58:6] Completion area address bits [58:6]. Address type is
determined by CCB header.
[5:0] Virtual device interrupt number for completion interrupt, if
enabled.
16 8 Primary Input
Bits Field Description
[63:60] ADI version (see Section 36.2.1.1.7, “Application Data
Integrity (ADI)”)
[59:56] If using real address, these bits should be filled in with the
page size code for the page boundary checking the guest wants
the virtual machine to use when accessing this data stream
(checking is only guaranteed to be performed when using API
version 1.1 and later). If using a virtual address, this field will
be used as as primary input address bits [59:56].
[55:0] Primary input address bits [55:0]. Address type is determined
by CCB header.
24 8 Data Access Control
Bits Field Description
[63:62] Flow Control
Value Description
0b'00 Disable flow control
0b'01 Enable flow control (only valid with "ORCL,sun4v-
dax-fc" compatible virtual device variants)
0b'10 Reserved
0b'11 Reserved
[61:60] Reserved (API 1.0)
517
Coprocessor services
Offset Size Field Description
Bits Field Description
Pipeline target (API 2.0)
Value Description
0b'00 Connect to primary input
0b'01 Connect to secondary input
0b'10 Reserved
0b'11 Reserved
[59:40] Output buffer size given in units of 64 bytes, minus 1. Value of
0 means 64 bytes, value of 1 means 128 bytes, etc. Buffer size is
only enforced if flow control is enabled in Flow Control field.
[39:32] Reserved
[31:30] Output Data Cache Allocation
Value Description
0b'00 Do not allocate cache lines for output data stream.
0b'01 Force cache lines for output data stream to be
allocated in the cache that is local to the submitting
virtual cpu.
0b'10 Allocate cache lines for output data stream, but allow
existing cache lines associated with the data to remain
in their current cache instance. Any memory not
already in cache will be allocated in the cache local
to the submitting virtual cpu.
0b'11 Reserved
[29:26] Reserved
[25:24] Primary Input Length Format
Value Description
0b'00 Number of primary symbols
0b'01 Number of primary bytes
0b'10 Number of primary bits
0b'11 Reserved
[23:0] Primary Input Length
Format Field Value
# of primary symbols Number of input elements to process,
minus 1. Command execution stops
once count is reached.
# of primary bytes Number of input bytes to process,
minus 1. Command execution stops
once count is reached. The count is
done before any decompression or
decoding.
# of primary bits Number of input bits to process,
minus 1. Command execution stops
518
Coprocessor services
Offset Size Field Description
Bits Field Description
Format Field Value
once count is reached. The count is
done before any decompression or
decoding, and does not include any
bits skipped by the Primary Input
Offset field value of the command
control word.
32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
Input.
40 8 Reserved
48 8 Output (same fields as Primary Input)
56 8 Symbol Table (if used by Primary Input)
Bits Field Description
[63:60] ADI version (see Section 36.2.1.1.7, “Application Data
Integrity (ADI)”)
[59:56] If using real address, these bits should be filled in with the
page size code for the page boundary checking the guest wants
the virtual machine to use when accessing this data stream
(checking is only guaranteed to be performed when using API
version 1.1 and later). If using a virtual address, this field will
be used as as symbol table address bits [59:56].
[55:4] Symbol table address bits [55:4]. Address type is determined
by CCB header.
[3:0] Symbol table version
Value Description
0 Huffman encoding. Must use 64 byte aligned table
address. (Only available when using version 0 CCBs)
1 OZIP encoding. Must use 16 byte aligned table
address. (Only available when using version 1 CCBs)
36.2.1.3. Scan commands
The scan commands search a stream of input data elements for values which match the selection criteria.
All the input format types are supported. There are multiple formats for the scan commands, allowing the
scan to search for exact matches to one value, exact matches to either of two values, or any value within
a specified range. The specific type of scan is indicated by the command code in the CCB header. For the
scan range commands, the boundary conditions can be specified as greater-than-or-equal-to a value, less-
than-or-equal-to a value, or both by using two boundary values.
There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
0xD, and 0xE). For the standard scan command using the bit vector output, for each input element there
exists one bit in the vector that is set if the input element matched the scan criteria, or clear if not. The
inverted scan command inverts the polarity of the bits in the output. The most significant bit of the first
byte of the output stream corresponds to the first element in the input stream. The standard index array
output format contains one array entry for each input element that matched the scan criteria. Each array
519
Coprocessor services
entry is the index of an input element that matched the scan criteria. An inverted scan command produces
a similar array, but of all the input elements which did NOT match the scan criteria.
The return value of the CCB completion area contains the number of input elements found which match
the scan criteria (or number that did not match for the inverted scans). The “number of elements processed”
field in the CCB completion area will be valid, indicating the number of input elements processed.
These commands are 128-byte “long format” CCBs.
The scan CCB command format can be specified by the following packed C structure for a big-endian
machine:
struct scan_ccb {
uint32_t header;
uint32_t control;
uint64_t completion;
uint64_t primary_input;
uint64_t data_access_control;
uint64_t secondary_input;
uint64_t match_criteria0;
uint64_t output;
uint64_t table;
uint64_t match_criteria1;
uint64_t match_criteria2;
uint64_t match_criteria3;
uint64_t reserved[5];
};
The exact field offsets, sizes, and composition are as follows:
Offset Size Field Description
0 4 CCB header (Table 36.1, “CCB Header Format”)
4 4 Command control
Bits Field Description
[31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
Format”)
[27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
Input Element Size”)
[22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
Input Format”)
[18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
“Secondary Input Element Size”
[13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
[9:5] Operand size for first scan criteria value. In a scan value
operation, this is one of two potential extact match values.
In a scan range operation, this is the size of the upper range
520
Coprocessor services
Offset Size Field Description
Bits Field Description
boundary. The value of this field is the number of bytes in the
operand, minus 1. Values 0xF-0x1E are reserved. A value of
0x1F indicates this operand is not in use for this scan operation.
[4:0] Operand size for second scan criteria value. In a scan value
operation, this is one of two potential extact match values.
In a scan range operation, this is the size of the lower range
boundary. The value of this field is the number of bytes in the
operand, minus 1. Values 0xF-0x1E are reserved. A value of
0x1F indicates this operand is not in use for this scan operation.
8 8 Completion (same fields as Section 36.2.1.2, “Extract command”)
16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”)
24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
Input.
40 4 Most significant 4 bytes of first scan criteria operand. If first operand is less
than 4 bytes, the value is left-aligned to the lowest address bytes.
44 4 Most significant 4 bytes of second scan criteria operand. If second operand
is less than 4 bytes, the value is left-aligned to the lowest address bytes.
48 8 Output (same fields as Primary Input)
56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
“Extract command”
64 4 Next 4 most significant bytes of first scan criteria operand occuring after the
bytes specified at offset 40, if needed by the operand size. If first operand
is less than 8 bytes, the valid bytes are left-aligned to the lowest address.
68 4 Next 4 most significant bytes of second scan criteria operand occuring after
the bytes specified at offset 44, if needed by the operand size. If second
operand is less than 8 bytes, the valid bytes are left-aligned to the lowest
address.
72 4 Next 4 most significant bytes of first scan criteria operand occuring after the
bytes specified at offset 64, if needed by the operand size. If first operand
is less than 12 bytes, the valid bytes are left-aligned to the lowest address.
76 4 Next 4 most significant bytes of second scan criteria operand occuring after
the bytes specified at offset 68, if needed by the operand size. If second
operand is less than 12 bytes, the valid bytes are left-aligned to the lowest
address.
80 4 Next 4 most significant bytes of first scan criteria operand occuring after the
bytes specified at offset 72, if needed by the operand size. If first operand
is less than 16 bytes, the valid bytes are left-aligned to the lowest address.
84 4 Next 4 most significant bytes of second scan criteria operand occuring after
the bytes specified at offset 76, if needed by the operand size. If second
operand is less than 16 bytes, the valid bytes are left-aligned to the lowest
address.
521
Coprocessor services
36.2.1.4. Translate commands
The translate commands takes an input array of indicies, and a table of single bit values indexed by those
indicies, and outputs a bit vector or index array created by reading the tables bit value at each index in
the input array. The output should therefore contain exactly one bit per index in the input data stream,
when outputing as a bit vector. When outputing as an index array, the number of elements depends on the
values read in the bit table, but will always be less than, or equal to, the number of input elements. Only
a restricted subset of the possible input format types are supported. No variable width or Huffman/OZIP
encoded input streams are allowed. The primary input data element size must be 3 bytes or less.
The maximum table index size allowed is 15 bits, however, larger input elements may be used to provide
additional processing of the output values. If 2 or 3 byte values are used, the least significant 15 bits are
used as an index into the bit table. The most significant 9 bits (when using 3-byte input elements) or single
bit (when using 2-byte input elements) are compared against a fixed 9-bit test value provided in the CCB.
If the values match, the value from the bit table is used as the output element value. If the values do not
match, the output data element value is forced to 0.
In the inverted translate operation, the bit value read from bit table is inverted prior to its use. The additional
additional processing based on any additional non-index bits remains unchanged, and still forces the output
element value to 0 on a mismatch. The specific type of translate command is indicated by the command
code in the CCB header.
There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
0xD, and 0xE). The index array format is an array of indicies of bits which would have been set if the
output format was a bit array.
The return value of the CCB completion area contains the number of bits set in the output bit vector,
or number of elements in the output index array. The “number of elements processed” field in the CCB
completion area will be valid, indicating the number of input elements processed.
These commands are 64-byte “short format” CCBs.
The translate CCB command format can be specified by the following packed C structure for a big-endian
machine:
struct translate_ccb {
uint32_t header;
uint32_t control;
uint64_t completion;
uint64_t primary_input;
uint64_t data_access_control;
uint64_t secondary_input;
uint64_t reserved;
uint64_t output;
uint64_t table;
};
The exact field offsets, sizes, and composition are as follows:
Offset Size Field Description
0 4 CCB header (Table 36.1, “CCB Header Format”)
522
Coprocessor services
Offset Size Field Description
4 4 Command control
Bits Field Description
[31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
Format”)
[27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
Input Element Size”)
[22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
Input Format”)
[18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
“Secondary Input Element Size”
[13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
[9] Reserved
[8:0] Test value used for comparison against the most significant bits
in the input values, when using 2 or 3 byte input elements.
8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”,
except Primary Input Length Format may not use the 0x0 value)
32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
Input.
40 8 Reserved
48 8 Output (same fields as Primary Input)
56 8 Bit Table
Bits Field Description
[63:60] ADI version (see Section 36.2.1.1.7, “Application Data
Integrity (ADI)”)
[59:56] If using real address, these bits should be filled in with the
page size code for the page boundary checking the guest wants
the virtual machine to use when accessing this data stream
(checking is only guaranteed to be performed when using API
version 1.1 and later). If using a virtual address, this field will
be used as as bit table address bits [59:56]
[55:4] Bit table address bits [55:4]. Address type is determined by
CCB header. Address must be 64-byte aligned (CCB version
0) or 16-byte aligned (CCB version 1).
[3:0] Bit table version
Value Description
0 4KB table size
1 8KB table size
523
Coprocessor services
36.2.1.5. Select command
The select command filters the primary input data stream by using a secondary input bit vector to determine
which input elements to include in the output. For each bit set at a given index N within the bit vector,
the Nth input element is included in the output. If the bit is not set, the element is not included. Only a
restricted subset of the possible input format types are supported. No variable width or run length encoded
input streams are allowed, since the secondary input stream is used for the filtering bit vector.
The only supported output format is a padded, byte-aligned output stream. The stream follows the same
rules and restrictions as padded output stream described in Section 36.2.1.2, “Extract command”.
The return value of the CCB completion area contains the number of bits set in the input bit vector. The
"number of elements processed" field in the CCB completion area will be valid, indicating the number
of input elements processed.
The select CCB is a 64-byte “short format” CCB.
The select CCB command format can be specified by the following packed C structure for a big-endian
machine:
struct select_ccb {
uint32_t header;
uint32_t control;
uint64_t completion;
uint64_t primary_input;
uint64_t data_access_control;
uint64_t secondary_input;
uint64_t reserved;
uint64_t output;
uint64_t table;
};
The exact field offsets, sizes, and composition are as follows:
Offset Size Field Description
0 4 CCB header (Table 36.1, “CCB Header Format”)
4 4 Command control
Bits Field Description
[31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
Format”)
[27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
Input Element Size”)
[22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
Input Format”)
[18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
Element Offsets”)
[15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
“Secondary Input Element Size”
524
Coprocessor services
Offset Size Field Description
Bits Field Description
[13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
[9] Padding Direction selector: A value of 1 causes padding bytes
to be added to the left side of output elements. A value of 0
causes padding bytes to be added to the right side of output
elements.
[8:0] Reserved
8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
32 8 Secondary Bit Vector Input. Same fields as Primary Input.
40 8 Reserved
48 8 Output (same fields as Primary Input)
56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
“Extract command”
36.2.1.6. No-op and Sync commands
The no-op (no operation) command is a CCB which has no processing effect. The CCB, when processed
by the virtual machine, simply updates the completion area with its execution status. The CCB may have
the serial-conditional flags set in order to restrict when it executes.
The sync command is a variant of the no-op command which with restricted execution timing. A sync
command CCB will only execute when all previous commands submitted in the same request have
completed. This is stronger than the conditional flag sequencing, which is only dependent on a single
previous serial CCB. While the relative ordering is guaranteed, virtual machine implementations with
shared hardware resources may cause the sync command to wait for longer than the minimum required
time.
The return value of the CCB completion area is invalid for these CCBs. The “number of elements
processed” field is also invalid for these CCBs.
These commands are 64-byte “short format” CCBs.
The no-op CCB command format can be specified by the following packed C structure for a big-endian
machine:
struct nop_ccb {
uint32_t header;
uint32_t control;
uint64_t completion;
uint64_t reserved[6];
};
The exact field offsets, sizes, and composition are as follows:
Offset Size Field Description
0 4 CCB header (Table 36.1, “CCB Header Format”)
525
Coprocessor services
Offset Size Field Description
4 4 Command control
Bits Field Description
[31] If set, this CCB functions as a Sync command. If clear, this
CCB functions as a No-op command.
[30:0] Reserved
8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
16 46 Reserved
36.2.2. CCB Completion Area
All CCB commands use a common 128-byte Completion Area format, which can be specified by the
following packed C structure for a big-endian machine:
struct completion_area {
uint8_t status_flag;
uint8_t error_note;
uint8_t rsvd0[2];
uint32_t error_values;
uint32_t output_size;
uint32_t rsvd1;
uint64_t run_time;
uint64_t run_stats;
uint32_t elements;
uint8_t rsvd2[20];
uint64_t return_value;
uint64_t extra_return_value[8];
};
The Completion Area must be a 128-byte aligned memory location. The exact layout can be described
using byte offsets and sizes relative to the memory base:
Offset Size Field Description
0 1 CCB execution status
0x0 Command not yet completed
0x1 Command ran and succeeded
0x2 Command ran and failed (partial results may be been
produced)
0x3 Command ran and was killed (partial execution may
have occurred)
0x4 Command was not run
0x5-0xF Reserved
1 1 Error reason code
0x0 Reserved
0x1 Buffer overflow
526
Coprocessor services
Offset Size Field Description
0x2 CCB decoding error
0x3 Page overflow
0x4-0x6 Reserved
0x7 Command was killed
0x8 Command execution timeout
0x9 ADI miscompare error
0xA Data format error
0xB-0xD Reserved
0xE Unexpected hardware error (Do not retry)
0xF Unexpected hardware error (Retry is ok)
0x10-0x7F Reserved
0x80 Partial Symbol Warning
0x81-0xFF Reserved
2 2 Reserved
4 4 If a partial symbol warning was generated, this field contains the number
of remaining bits which were not decoded.
8 4 Number of bytes of output produced
12 4 Reserved
16 8 Runtime of command (unspecified time units)
24 8 Reserved
32 4 Number of elements processed
36 20 Reserved
56 8 Return value
64 64 Extended return value
The CCB completion area should be treated as read-only by guest software. The CCB execution status
byte will be cleared by the Hypervisor to reflect the pending execution status when the CCB is submitted
successfully. All other fields are considered invalid upon CCB submission until the CCB execution status
byte becomes non-zero.
CCBs which complete with status 0x2 or 0x3 may produce partial results and/or side effects due to partial
execution of the CCB command. Some valid data may be accessible depending on the fault type, however,
it is recommended that guest software treat the destination buffer as being in an unknown state. If a CCB
completes with a status byte of 0x2, the error reason code byte can be read to determine what corrective
action should be taken.
A buffer overflow indicates that the results of the operation exceeded the size of the output buffer indicated
in the CCB. The operation can be retried by resubmitting the CCB with a larger output buffer.
A CCB decoding error indicates that the CCB contained some invalid field values. It may be also be
triggered if the CCB output is directed at a non-existent secondary input and the pipelining hint is followed.
A page overflow error indicates that the operation required accessing a memory location beyond the page
size associated with a given address. No data will have been read or written past the page boundary, but
partial results may have been written to the destination buffer. The CCB can be resubmitted with a larger
page size memory allocation to complete the operation.
527
Coprocessor services
In the case of pipelined CCBs, a page overflow error will be triggered if the output from the pipeline source
CCB ends before the input of the pipeline target CCB. Page boundaries are ignored when the pipeline
hint is followed.
Command kill indicates that the CCB execution was halted or prevented by use of the ccb_kill API call.
Command timeout indicates that the CCB execution began, but did not complete within a pre-determined
limit set by the virtual machine. The command may have produced some or no output. The CCB may be
resubmitted with no alterations.
ADI miscompare indicates that the memory buffer version specified in the CCB did not match the value
in memory when accessed by the virtual machine. Guest software should not attempt to resubmit the CCB
without determining the cause of the version mismatch.
A data format error indicates that the input data stream did not follow the specified data input formatting
selected in the CCB.
Some CCBs which encounter hardware errors may be resubmitted without change. Persistent hardware
errors may result in multiple failures until RAS software can identify and isolate the faulty component.
The output size field indicates the number of bytes of valid output in the destination buffer. This field is
not valid for all possible CCB commands.
The runtime field indicates the execution time of the CCB command once it leaves the internal virtual
machine queue. The time units are fixed, but unspecified, allowing only relative timing comparisons
by guest software. The time units may also vary by hardware platform, and should not be construed to
represent any absolute time value.
Some data query commands process data in units of elements. If applicable to the command, the number of
elements processed is indicated in the listed field. This field is not valid for all possible CCB commands.
The return value and extended return value fields are output locations for commands which do not use
a destination output buffer, or have secondary return results. The field is not valid for all possible CCB
commands.
36.3. Hypervisor API Functions
36.3.1. ccb_submit
trap# FAST_TRAP
function# CCB_SUBMIT
arg0 address
arg1 length
arg2 flags
arg3 reserved
ret0 status
ret1 length
ret2 status data
ret3 reserved
Submit one or more coprocessor control blocks (CCBs) for evaluation and processing by the virtual
machine. The CCBs are passed in a linear array indicated by address. length indicates the size of
the array in bytes.
528
Coprocessor services
The address should be aligned to the size indicated by length, rounded up to the nearest power of
two. Virtual machines implementations may reject submissions which do not adhere to that alignment.
length must be a multiple of 64 bytes. If length is zero, the maximum supported array length will be
returned as length in ret1. In all other cases, the length value in ret1 will reflect the number of bytes
successfully consumed from the input CCB array.
Implementation note
Virtual machines should never reject submissions based on the alignment of address if the
entire array is contained within a single memory page of the smallest page size supported by the
virtual machine.
A guest may choose to submit addresses used in this API function, including the CCB array address,
as either a real or virtual addresses, with the type of each address indicated in flags. Virtual addresses
must be present in either the TLB or an active TSB to be processed. The translation context for virtual
addresses is determined by a combination of CCB contents and the flags argument.
The flags argument is divided into multiple fields defined as follows:
Bits Field Description
[63:16] Reserved
[15] Disable ADI for VA reads (in API 2.0)
Reserved (in API 1.0)
[14] Virtual addresses within CCBs are translated in privileged context
[13:12] Alternate translation context for virtual addresses within CCBs:
0b'00 CCBs requesting alternate context are rejected
0b'01 Reserved
0b'10 CCBs requesting alternate context use secondary context
0b'11 CCBs requesting alternate context use nucleus context
[11:9] Reserved
[8] Queue info flag
[7] All-or-nothing flag
[6] If address is a virtual address, treat its translation context as privileged
[5:4] Address type of address:
0b'00 Real address
0b'01 Virtual address in primary context
0b'10 Virtual address in secondary context
0b'11 Virtual address in nucleus context
[3:2] Reserved
[1:0] CCB command type:
0b'00 Reserved
0b'01 Reserved
0b'10 Query command
0b'11 Reserved
529
Coprocessor services
The CCB submission type and address type for the CCB array must be provided in the flags argument.
All other fields are optional values which change the default behavior of the CCB processing.
When set to one, the "Disable ADI for VA reads" bit will turn off ADI checking when using a virtual
address to load data. ADI checking will still be done when loading real-addressed memory. This bit is only
available when using major version 2 of the coprocessor API group; at major version 1 it is reserved. For
more information about using ADI and DAX, see Section 36.2.1.1.7, “Application Data Integrity (ADI)”.
By default, all virtual addresses are treated as user addresses. If the virtual address translations are
privileged, they must be marked as such in the appropriate flags field. The virtual addresses used within
the submitted CCBs must all be translated with the same privilege level.
By default, all virtual addresses used within the submitted CCBs are translated using the primary context
active at the time of the submission. The address type field within a CCB allows each address to request
translation in an alternate address context. The address context used when the alternate address context is
requested is selected in the flags argument.
The all-or-nothing flag specifies whether the virtual machine should allow partial submissions of the
input CCB array. When using CCBs with serial-conditional flags, it is strongly recommended to use
the all-or-nothing flag to avoid broken conditional chains. Using long CCB chains on a machine under
high coprocessor load may make this impractical, however, and require submitting without the flag.
When submitting serial-conditional CCBs without the all-or-nothing flag, guest software must manually
implement the serial-conditional behavior at any point where the chain was not submitted in a single API
call, and resubmission of the remaining CCBs should clear any conditional flag that might be set in the
first remaining CCB. Failure to do so will produce indeterminate CCB execution status and ordering.
When the all-or-nothing flag is not specified, callers should check the value of length in ret1 to determine
how many CCBs from the array were successfully submitted. Any remaining CCBs can be resubmitted
without modifications.
The value of length in ret1 is also valid when the API call returns an error, and callers should always
check its value to determine which CCBs in the array were already processed. This will additionally
identify which CCB encountered the processing error, and was not submitted successfully.
If the queue info flag is used during submission, and at least one CCB was successfully submitted, the
length value in ret1 will be a multi-field value defined as follows:
Bits Field Description
[63:48] DAX unit instance identifier
[47:32] DAX queue instance identifier
[31:16] Reserved
[15:0] Number of CCB bytes successfully submitted
The value of status data depends on the status value. See error status code descriptions for details.
The value is undefined for status values that do not specifically list a value for the status data.
The API has a reserved input and output register which will be used in subsequent minor versions of this
API function. Guest software implementations should treat that register as voltile across the function call
in order to maintain forward compatibility.
36.3.1.1. Errors
EOK One or more CCBs have been accepted and enqueued in the virtual machine
and no errors were been encountered during submission. Some submitted
CCBs may not have been enqueued due to internal virtual machine limitations,
and may be resubmitted without changes.
530
Coprocessor services
EWOULDBLOCK An internal resource conflict within the virtual machine has prevented it from
being able to complete the CCB submissions sufficiently quickly, requiring
it to abandon processing before it was complete. Some CCBs may have been
successfully enqueued prior to the block, and all remaining CCBs may be
resubmitted without changes.
EBADALIGN CCB array is not on a 64-byte boundary, or the array length is not a multiple
of 64 bytes.
ENORADDR A real address used either for the CCB array, or within one of the submitted
CCBs, is not valid for the guest. Some CCBs may have been enqueued prior
to the error being detected.
ENOMAP A virtual address used either for the CCB array, or within one of the submitted
CCBs, could not be translated by the virtual machine using either the TLB
or TSB contents. The submission may be retried after adding the required
mapping, or by converting the virtual address into a real address. Due to the
shared nature of address translation resources, there is no theoretical limit on
the number of times the translation may fail, and it is recommended all guests
implement some real address based backup. The virtual address which failed
translation is returned as status data in ret2. Some CCBs may have been
enqueued prior to the error being detected.
EINVAL The virtual machine detected an invalid CCB during submission, or invalid
input arguments, such as bad flag values. Note that not all invalid CCB values
will be detected during submission, and some may be reported as errors in the
completion area instead. Some CCBs may have been enqueued prior to the
error being detected. This error may be returned if the CCB version is invalid.
ETOOMANY The request was submitted with the all-or-nothing flag set, and the array size is
greater than the virtual machine can support in a single request. The maximum
supported size for the current virtual machine can be queried by submitting a
request with a zero length array, as described above.
ENOACCESS The guest does not have permission to submit CCBs, or an address used in a
CCBs lacks sufficient permissions to perform the required operation (no write
permission on the destination buffer address, for example). A virtual address
which fails permission checking is returned as status data in ret2. Some
CCBs may have been enqueued prior to the error being detected.
EUNAVAILABLE The requested CCB operation could not be performed at this time. The
restricted operation availability may apply only to the first unsuccessfully
submitted CCB, or may apply to a larger scope. The status should not be
interpreted as permanent, and the guest should attempt to submit CCBs in
the future which had previously been unable to be performed. The status
data provides additional information about scope of the retricted availability
as follows:
Value Description
0 Processing for the exact CCB instance submitted was unavailable,
and it is recommended the guest emulate the operation. The
guest should continue to submit all other CCBs, and assume no
restrictions beyond this exact CCB instance.
1 Processing is unavailable for all CCBs using the requested opcode,
and it is recommended the guest emulate the operation. The
guest should continue to submit all other CCBs that use different
opcodes, but can expect continued rejections of CCBs using the
same opcode in the near future.
531
Coprocessor services
Value Description
2 Processing is unavailable for all CCBs using the requested CCB
version, and it is recommended the guest emulate the operation.
The guest should continue to submit all other CCBs that use
different CCB versions, but can expect continued rejections of
CCBs using the same CCB version in the near future.
3 Processing is unavailable for all CCBs on the submitting vcpu,
and it is recommended the guest emulate the operation or resubmit
the CCB on a different vcpu. The guest should continue to submit
CCBs on all other vcpus but can expect continued rejections of all
CCBs on this vcpu in the near future.
4 Processing is unavailable for all CCBs, and it is recommended
the guest emulate the operation. The guest should expect all CCB
submissions to be similarly rejected in the near future.
36.3.2. ccb_info
trap# FAST_TRAP
function# CCB_INFO
arg0 address
ret0 status
ret1 CCB state
ret2 position
ret3 dax
ret4 queue
Requests status information on a previously submitted CCB. The previously submitted CCB is identified
by the 64-byte aligned real address of the CCBs completion area.
A CCB can be in one of 4 states:
State Value Description
COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
the virtual machine.
ENQUEUED 1 The requested CCB is current in a queue awaiting execution.
INPROGRESS 2 The CCB has been fetched and is currently being executed. It may still
be possible to stop the execution using the ccb_kill hypercall.
NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
appear to have been executed. This may occur if the CCB was lost
due to a hardware error, or the CCB may not have been successfully
submitted to the virtual machine in the first place.
Implementation note
Some platforms may not be able to report CCBs that are currently being processed, and therefore
guest software should invoke the ccb_kill hypercall prior to assuming the request CCB will never
be executed because it was in the NOTFOUND state.
532
Coprocessor services
The position return value is only valid when the state is ENQUEUED. The value returned is the number
of other CCBs ahead of the requested CCB, to provide a relative estimate of when the CCB may execute.
The dax return value is only valid when the state is ENQUEUED. The value returned is the DAX unit
instance indentifier for the DAX unit processing the queue where the requested CCB is located. The value
matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
The queue return value is only valid when the state is ENQUEUED. The value returned is the DAX
queue instance indentifier for the DAX unit processing the queue where the requested CCB is located. The
value matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
36.3.2.1. Errors
EOK The request was proccessed and the CCB state is valid.
EBADALIGN address is not on a 64-byte aligned.
ENORADDR The real address provided for address is not valid.
EINVAL The CCB completion area contents are not valid.
EWOULDBLOCK Internal resource contraints prevented the CCB state from being queried at this
time. The guest should retry the request.
ENOACCESS The guest does not have permission to access the coprocessor virtual device
functionality.
36.3.3. ccb_kill
trap# FAST_TRAP
function# CCB_KILL
arg0 address
ret0 status
ret1 result
Request to stop execution of a previously submitted CCB. The previously submitted CCB is identified by
the 64-byte aligned real address of the CCBs completion area.
The kill attempt can produce one of several values in the result return value, reflecting the CCB state
and actions taken by the Hypervisor:
Result Value Description
COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
the virtual machine. It could not be killed and no action was taken.
DEQUEUED 1 The requested CCB was still enqueued when the kill request was
submitted, and has been removed from the queue. Since the CCB
never began execution, no memory modifications were produced by
it, and the completion area will never be updated. The same CCB may
be submitted again, if desired, with no modifications required.
KILLED 2 The CCB had been fetched and was being executed when the kill
request was submitted. The CCB execution was stopped, and the CCB
is no longer active in the virtual machine. The CCB completion area
will reflect the killed status, with the subsequent implications that
partial results may have been produced. Partial results may include full
533
Coprocessor services
Result Value Description
command execution if the command was stopped just prior to writing
to the completion area.
NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
appear to have been executed. This may occur if the CCB was lost
due to a hardware error, or the CCB may not have been successfully
submitted to the virtual machine in the first place. CCBs in the state
are guaranteed to never execute in the future unless resubmitted.
36.3.3.1. Interactions with Pipelined CCBs
If the pipeline target CCB is killed but the pipeline source CCB was skipped, the completion area of the
target CCB may contain status (4,0) "Command was skipped" instead of (3,7) "Command was killed".
If the pipeline source CCB is killed, the pipeline target CCB's completion status may read (1,0) "Success".
This does not mean the target CCB was processed; since the source CCB was killed, there was no
meaningful output on which the target CCB could operate.
36.3.3.2. Errors
EOK The request was proccessed and the result is valid.
EBADALIGN address is not on a 64-byte aligned.
ENORADDR The real address provided for address is not valid.
EINVAL The CCB completion area contents are not valid.
EWOULDBLOCK Internal resource contraints prevented the CCB from being killed at this time.
The guest should retry the request.
ENOACCESS The guest does not have permission to access the coprocessor virtual device
functionality.
36.3.4. dax_info
trap# FAST_TRAP
function# DAX_INFO
ret0 status
ret1 Number of enabled DAX units
ret2 Number of disabled DAX units
Returns the number of DAX units that are enabled for the calling guest to submit CCBs. The number of
DAX units that are disabled for the calling guest are also returned. A disabled DAX unit would have been
available for CCB submission to the calling guest had it not been offlined.
36.3.4.1. Errors
EOK The request was proccessed and the number of enabled/disabled DAX units
are valid.
534
Oracle Data Analytics Accelerator (DAX)
---------------------------------------
DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
(DAX2) processor chips, and has direct access to the CPU's L3 caches
as well as physical memory. It can perform several operations on data
streams with various input and output formats. A driver provides a
transport mechanism and has limited knowledge of the various opcodes
and data formats. A user space library provides high level services
and translates these into low level commands which are then passed
into the driver and subsequently the Hypervisor and the coprocessor.
The library is the recommended way for applications to use the
coprocessor, and the driver interface is not intended for general use.
This document describes the general flow of the driver, its
structures, and its programmatic interface. It also provides example
code sufficient to write user or kernel applications that use DAX
functionality.
The user library is open source and available at:
https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
The Hypervisor interface to the coprocessor is described in detail in
the accompanying document, dax-hv-api.txt, which is a plain text
excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
Specification" version 3.0.20+15, dated 2017-09-25.
High Level Overview
-------------------
A coprocessor request is described by a Command Control Block
(CCB). The CCB contains an opcode and various parameters. The opcode
specifies what operation is to be done, and the parameters specify
options, flags, sizes, and addresses. The CCB (or an array of CCBs)
is passed to the Hypervisor, which handles queueing and scheduling of
requests to the available coprocessor execution units. A status code
returned indicates if the request was submitted successfully or if
there was an error. One of the addresses given in each CCB is a
pointer to a "completion area", which is a 128 byte memory block that
is written by the coprocessor to provide execution status. No
interrupt is generated upon completion; the completion area must be
polled by software to find out when a transaction has finished, but
the M7 and later processors provide a mechanism to pause the virtual
processor until the completion status has been updated by the
coprocessor. This is done using the monitored load and mwait
instructions, which are described in more detail later. The DAX
coprocessor was designed so that after a request is submitted, the
kernel is no longer involved in the processing of it. The polling is
done at the user level, which results in almost zero latency between
completion of a request and resumption of execution of the requesting
thread.
Addressing Memory
-----------------
The kernel does not have access to physical memory in the Sun4v
architecture, as there is an additional level of memory virtualization
present. This intermediate level is called "real" memory, and the
kernel treats this as if it were physical. The Hypervisor handles the
translations between real memory and physical so that each logical
domain (LDOM) can have a partition of physical memory that is isolated
from that of other LDOMs. When the kernel sets up a virtual mapping,
it specifies a virtual address and the real address to which it should
be mapped.
The DAX coprocessor can only operate on physical memory, so before a
request can be fed to the coprocessor, all the addresses in a CCB must
be converted into physical addresses. The kernel cannot do this since
it has no visibility into physical addresses. So a CCB may contain
either the virtual or real addresses of the buffers or a combination
of them. An "address type" field is available for each address that
may be given in the CCB. In all cases, the Hypervisor will translate
all the addresses to physical before dispatching to hardware. Address
translations are performed using the context of the process initiating
the request.
The Driver API
--------------
An application makes requests to the driver via the write() system
call, and gets results (if any) via read(). The completion areas are
made accessible via mmap(), and are read-only for the application.
The request may either be an immediate command or an array of CCBs to
be submitted to the hardware.
Each open instance of the device is exclusive to the thread that
opened it, and must be used by that thread for all subsequent
operations. The driver open function creates a new context for the
thread and initializes it for use. This context contains pointers and
values used internally by the driver to keep track of submitted
requests. The completion area buffer is also allocated, and this is
large enough to contain the completion areas for many concurrent
requests. When the device is closed, any outstanding transactions are
flushed and the context is cleaned up.
On a DAX1 system (M7), the device will be called "oradax1", while on a
DAX2 system (M8) it will be "oradax2". If an application requires one
or the other, it should simply attempt to open the appropriate
device. Only one of the devices will exist on any given system, so the
name can be used to determine what the platform supports.
The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
all of these, success is indicated by a return value from write()
equal to the number of bytes given in the call. Otherwise -1 is
returned and errno is set.
CCB_DEQUEUE
Tells the driver to clean up resources associated with past
requests. Since no interrupt is generated upon the completion of a
request, the driver must be told when it may reclaim resources. No
further status information is returned, so the user should not
subsequently call read().
CCB_KILL
Kills a CCB during execution. The CCB is guaranteed to not continue
executing once this call returns successfully. On success, read() must
be called to retrieve the result of the action.
CCB_INFO
Retrieves information about a currently executing CCB. Note that some
Hypervisors might return 'notfound' when the CCB is in 'inprogress'
state. To ensure a CCB in the 'notfound' state will never be executed,
CCB_KILL must be invoked on that CCB. Upon success, read() must be
called to retrieve the details of the action.
Submission of an array of CCBs for execution
A write() whose length is a multiple of the CCB size is treated as a
submit operation. The file offset is treated as the index of the
completion area to use, and may be set via lseek() or using the
pwrite() system call. If -1 is returned then errno is set to indicate
the error. Otherwise, the return value is the length of the array that
was actually accepted by the coprocessor. If the accepted length is
equal to the requested length, then the submission was completely
successful and there is no further status needed; hence, the user
should not subsequently call read(). Partial acceptance of the CCB
array is indicated by a return value less than the requested length,
and read() must be called to retrieve further status information. The
status will reflect the error caused by the first CCB that was not
accepted, and status_data will provide additional data in some cases.
MMAP
The mmap() function provides access to the completion area allocated
in the driver. Note that the completion area is not writeable by the
user process, and the mmap call must not specify PROT_WRITE.
Completion of a Request
-----------------------
The first byte in each completion area is the command status which is
updated by the coprocessor hardware. Software may take advantage of
new M7/M8 processor capabilities to efficiently poll this status byte.
First, a "monitored load" is achieved via a Load from Alternate Space
(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
"monitored wait" is achieved via the mwait instruction (a write to
%asr28). This instruction is like pause in that it suspends execution
of the virtual processor for the given number of nanoseconds, but in
addition will terminate early when one of several events occur. If the
block of data containing the monitored location is modified, then the
mwait terminates. This causes software to resume execution immediately
(without a context switch or kernel to user transition) after a
transaction completes. Thus the latency between transaction completion
and resumption of execution may be just a few nanoseconds.
Application Life Cycle of a DAX Submission
------------------------------------------
- open dax device
- call mmap() to get the completion area address
- allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
- submit CCB via write() or pwrite()
- go into a loop executing monitored load + monitored wait and
terminate when the command status indicates the request is complete
(CCB_KILL or CCB_INFO may be used any time as necessary)
- perform a CCB_DEQUEUE
- call munmap() for completion area
- close the dax device
Memory Constraints
------------------
The DAX hardware operates only on physical addresses. Therefore, it is
not aware of virtual memory mappings and the discontiguities that may
exist in the physical memory that a virtual buffer maps to. There is
no I/O TLB or any scatter/gather mechanism. All buffers, whether input
or output, must reside in a physically contiguous region of memory.
The Hypervisor translates all addresses within a CCB to physical
before handing off the CCB to DAX. The Hypervisor determines the
virtual page size for each virtual address given, and uses this to
program a size limit for each address. This prevents the coprocessor
from reading or writing beyond the bound of the virtual page, even
though it is accessing physical memory directly. A simpler way of
saying this is that a DAX operation will never "cross" a virtual page
boundary. If an 8k virtual page is used, then the data is strictly
limited to 8k. If a user's buffer is larger than 8k, then a larger
page size must be used, or the transaction size will be truncated to
8k.
Huge pages. A user may allocate huge pages using standard interfaces.
Memory buffers residing on huge pages may be used to achieve much
larger DAX transaction sizes, but the rules must still be followed,
and no transaction will cross a page boundary, even a huge page. A
major caveat is that Linux on Sparc presents 8Mb as one of the huge
page sizes. Sparc does not actually provide a 8Mb hardware page size,
and this size is synthesized by pasting together two 4Mb pages. The
reasons for this are historical, and it creates an issue because only
half of this 8Mb page can actually be used for any given buffer in a
DAX request, and it must be either the first half or the second half;
it cannot be a 4Mb chunk in the middle, since that crosses a
(hardware) page boundary. Note that this entire issue may be hidden by
higher level libraries.
CCB Structure
-------------
A CCB is an array of 8 64-bit words. Several of these words provide
command opcodes, parameters, flags, etc., and the rest are addresses
for the completion area, output buffer, and various inputs:
struct ccb {
u64 control;
u64 completion;
u64 input0;
u64 access;
u64 input1;
u64 op_data;
u64 output;
u64 table;
};
See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
each of these fields, and see dax-hv-api.txt for a complete description
of the Hypervisor API available to the guest OS (ie, Linux kernel).
The first word (control) is examined by the driver for the following:
- CCB version, which must be consistent with hardware version
- Opcode, which must be one of the documented allowable commands
- Address types, which must be set to "virtual" for all the addresses
given by the user, thereby ensuring that the application can
only access memory that it owns
Example Code
------------
The DAX is accessible to both user and kernel code. The kernel code
can make hypercalls directly while the user code must use wrappers
provided by the driver. The setup of the CCB is nearly identical for
both; the only difference is in preparation of the completion area. An
example of user code is given now, with kernel code afterwards.
In order to program using the driver API, the file
arch/sparc/include/uapi/asm/oradax.h must be included.
First, the proper device must be opened. For M7 it will be
/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
procedure is to attempt to open both, as only one will succeed:
fd = open("/dev/oradax1", O_RDWR);
if (fd < 0)
fd = open("/dev/oradax2", O_RDWR);
if (fd < 0)
/* No DAX found */
Next, the completion area must be mapped:
completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
All input and output buffers must be fully contained in one hardware
page, since as explained above, the DAX is strictly constrained by
virtual page boundaries. In addition, the output buffer must be
64-byte aligned and its size must be a multiple of 64 bytes because
the coprocessor writes in units of cache lines.
This example demonstrates the DAX Scan command, which takes as input a
vector and a match value, and produces a bitmap as the output. For
each input element that matches the value, the corresponding bit is
set in the output.
In this example, the input vector consists of a series of single bits,
and the match value is 0. So each 0 bit in the input will produce a 1
in the output, and vice versa, which produces an output bitmap which
is the input bitmap inverted.
For details of all the parameters and bits used in this CCB, please
refer to section 36.2.1.3 of the DAX Hypervisor API document, which
describes the Scan command in detail.
ccb->control = /* Table 36.1, CCB Header Format */
(2L << 48) /* command = Scan Value */
| (3L << 40) /* output address type = primary virtual */
| (3L << 34) /* primary input address type = primary virtual */
/* Section 36.2.1, Query CCB Command Formats */
| (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
| (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
| (8 << 10) /* 36.2.1.1.6 output format = bit vector */
| (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
| (31 << 0); /* 36.2.1.3 Disable second scan criteria */
ccb->completion = 0; /* Completion area address, to be filled in by driver */
ccb->input0 = (unsigned long) input; /* primary input address */
ccb->access = /* Section 36.2.1.2, Data Access Control */
(2 << 24) /* Primary input length format = bits */
| (nbits - 1); /* number of bits in primary input stream, minus 1 */
ccb->input1 = 0; /* secondary input address, unused */
ccb->op_data = 0; /* scan criteria (value to be matched) */
ccb->output = (unsigned long) output; /* output address */
ccb->table = 0; /* table address, unused */
The CCB submission is a write() or pwrite() system call to the
driver. If the call fails, then a read() must be used to retrieve the
status:
if (pwrite(fd, ccb, 64, 0) != 64) {
struct ccb_exec_result status;
read(fd, &status, sizeof(status));
/* bail out */
}
After a successful submission of the CCB, the completion area may be
polled to determine when the DAX is finished. Detailed information on
the contents of the completion area can be found in section 36.2.2 of
the DAX HV API document.
while (1) {
/* Monitored Load */
__asm__ __volatile__("lduba [%1] 0x84, %0\n"
: "=r" (status)
: "r" (completion_area));
if (status) /* 0 indicates command in progress */
break;
/* MWAIT */
__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
}
A completion area status of 1 indicates successful completion of the
CCB and validity of the output bitmap, which may be used immediately.
All other non-zero values indicate error conditions which are
described in section 36.2.2.
if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
/* completion_area[0] contains the completion status */
/* completion_area[1] contains an error code, see 36.2.2 */
}
After the completion area has been processed, the driver must be
notified that it can release any resources associated with the
request. This is done via the dequeue operation:
struct dax_command cmd;
cmd.command = CCB_DEQUEUE;
if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
/* bail out */
}
Finally, normal program cleanup should be done, i.e., unmapping
completion area, closing the dax device, freeing memory etc.
[Kernel example]
The only difference in using the DAX in kernel code is the treatment
of the completion area. Unlike user applications which mmap the
completion area allocated by the driver, kernel code must allocate its
own memory to use for the completion area, and this address and its
type must be given in the CCB:
ccb->control |= /* Table 36.1, CCB Header Format */
(3L << 32); /* completion area address type = primary virtual */
ccb->completion = (unsigned long) completion_area; /* Completion area address */
The dax submit hypercall is made directly. The flags used in the
ccb_submit call are documented in the DAX HV API in section 36.3.1.
#include <asm/hypervisor.h>
hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
HV_CCB_QUERY_CMD |
HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
HV_CCB_VA_PRIVILEGED,
0, &bytes_accepted, &status_data);
if (hv_rv != HV_EOK) {
/* hv_rv is an error code, status_data contains */
/* potential additional status, see 36.3.1.1 */
}
After the submission, the completion area polling code is identical to
that in user land:
while (1) {
/* Monitored Load */
__asm__ __volatile__("lduba [%1] 0x84, %0\n"
: "=r" (status)
: "r" (completion_area));
if (status) /* 0 indicates command in progress */
break;
/* MWAIT */
__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
}
if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
/* completion_area[0] contains the completion status */
/* completion_area[1] contains an error code, see 36.2.2 */
}
The output bitmap is ready for consumption immediately after the
completion status indicates success.
...@@ -76,6 +76,10 @@ ...@@ -76,6 +76,10 @@
#define HV_ETOOMANY 15 /* Too many items specified */ #define HV_ETOOMANY 15 /* Too many items specified */
#define HV_ECHANNEL 16 /* Invalid LDC channel */ #define HV_ECHANNEL 16 /* Invalid LDC channel */
#define HV_EBUSY 17 /* Resource busy */ #define HV_EBUSY 17 /* Resource busy */
#define HV_EUNAVAILABLE 23 /* Resource or operation not
* currently available, but may
* become available in the future
*/
/* mach_exit() /* mach_exit()
* TRAP: HV_FAST_TRAP * TRAP: HV_FAST_TRAP
...@@ -941,6 +945,139 @@ unsigned long sun4v_mmu_map_perm_addr(unsigned long vaddr, ...@@ -941,6 +945,139 @@ unsigned long sun4v_mmu_map_perm_addr(unsigned long vaddr,
*/ */
#define HV_FAST_MEM_SYNC 0x32 #define HV_FAST_MEM_SYNC 0x32
/* Coprocessor services
*
* M7 and later processors provide an on-chip coprocessor which
* accelerates database operations, and is known internally as
* DAX.
*/
/* ccb_submit()
* TRAP: HV_FAST_TRAP
* FUNCTION: HV_CCB_SUBMIT
* ARG0: address of CCB array
* ARG1: size (in bytes) of CCB array being submitted
* ARG2: flags
* ARG3: reserved
* RET0: status (success or error code)
* RET1: size (in bytes) of CCB array that was accepted (might be less
* than arg1)
* RET2: status data
* if status == ENOMAP or ENOACCESS, identifies the VA in question
* if status == EUNAVAILBLE, unavailable code
* RET3: reserved
*
* ERRORS: EOK successful submission (check size)
* EWOULDBLOCK could not finish submissions, try again
* EBADALIGN array not 64B aligned or size not 64B multiple
* ENORADDR invalid RA for array or in CCB
* ENOMAP could not translate address (see status data)
* EINVAL invalid ccb or arguments
* ETOOMANY too many ccbs with all-or-nothing flag
* ENOACCESS guest has no access to submit ccbs or address
* in CCB does not have correct permissions (check
* status data)
* EUNAVAILABLE ccb operation could not be performed at this
* time (check status data)
* Status data codes:
* 0 - exact CCB could not be executed
* 1 - CCB opcode cannot be executed
* 2 - CCB version cannot be executed
* 3 - vcpu cannot execute CCBs
* 4 - no CCBs can be executed
*/
#define HV_CCB_SUBMIT 0x34
#ifndef __ASSEMBLY__
unsigned long sun4v_ccb_submit(unsigned long ccb_buf,
unsigned long len,
unsigned long flags,
unsigned long reserved,
void *submitted_len,
void *status_data);
#endif
/* flags (ARG2) */
#define HV_CCB_QUERY_CMD BIT(1)
#define HV_CCB_ARG0_TYPE_REAL 0UL
#define HV_CCB_ARG0_TYPE_PRIMARY BIT(4)
#define HV_CCB_ARG0_TYPE_SECONDARY BIT(5)
#define HV_CCB_ARG0_TYPE_NUCLEUS GENMASK(5, 4)
#define HV_CCB_ARG0_PRIVILEGED BIT(6)
#define HV_CCB_ALL_OR_NOTHING BIT(7)
#define HV_CCB_QUEUE_INFO BIT(8)
#define HV_CCB_VA_REJECT 0UL
#define HV_CCB_VA_SECONDARY BIT(13)
#define HV_CCB_VA_NUCLEUS GENMASK(13, 12)
#define HV_CCB_VA_PRIVILEGED BIT(14)
#define HV_CCB_VA_READ_ADI_DISABLE BIT(15) /* DAX2 only */
/* ccb_info()
* TRAP: HV_FAST_TRAP
* FUNCTION: HV_CCB_INFO
* ARG0: real address of CCB completion area
* RET0: status (success or error code)
* RET1: info array
* - RET1[0]: CCB state
* - RET1[1]: dax unit
* - RET1[2]: queue number
* - RET1[3]: queue position
*
* ERRORS: EOK operation successful
* EBADALIGN address not 64B aligned
* ENORADDR RA in address not valid
* EINVAL CA not valid
* EWOULDBLOCK info not available for this CCB currently, try
* again
* ENOACCESS guest cannot use dax
*/
#define HV_CCB_INFO 0x35
#ifndef __ASSEMBLY__
unsigned long sun4v_ccb_info(unsigned long ca,
void *info_arr);
#endif
/* info array byte offsets (RET1) */
#define CCB_INFO_OFFSET_CCB_STATE 0
#define CCB_INFO_OFFSET_DAX_UNIT 2
#define CCB_INFO_OFFSET_QUEUE_NUM 4
#define CCB_INFO_OFFSET_QUEUE_POS 6
/* CCB state (RET1[0]) */
#define HV_CCB_STATE_COMPLETED 0
#define HV_CCB_STATE_ENQUEUED 1
#define HV_CCB_STATE_INPROGRESS 2
#define HV_CCB_STATE_NOTFOUND 3
/* ccb_kill()
* TRAP: HV_FAST_TRAP
* FUNCTION: HV_CCB_KILL
* ARG0: real address of CCB completion area
* RET0: status (success or error code)
* RET1: CCB kill status
*
* ERRORS: EOK operation successful
* EBADALIGN address not 64B aligned
* ENORADDR RA in address not valid
* EINVAL CA not valid
* EWOULDBLOCK kill not available for this CCB currently, try
* again
* ENOACCESS guest cannot use dax
*/
#define HV_CCB_KILL 0x36
#ifndef __ASSEMBLY__
unsigned long sun4v_ccb_kill(unsigned long ca,
void *kill_status);
#endif
/* CCB kill status (RET1) */
#define HV_CCB_KILL_COMPLETED 0
#define HV_CCB_KILL_DEQUEUED 1
#define HV_CCB_KILL_KILLED 2
#define HV_CCB_KILL_NOTFOUND 3
/* Time of day services. /* Time of day services.
* *
* The hypervisor maintains the time of day on a per-domain basis. * The hypervisor maintains the time of day on a per-domain basis.
...@@ -3355,6 +3492,7 @@ unsigned long sun4v_m7_set_perfreg(unsigned long reg_num, ...@@ -3355,6 +3492,7 @@ unsigned long sun4v_m7_set_perfreg(unsigned long reg_num,
#define HV_GRP_SDIO_ERR 0x0109 #define HV_GRP_SDIO_ERR 0x0109
#define HV_GRP_REBOOT_DATA 0x0110 #define HV_GRP_REBOOT_DATA 0x0110
#define HV_GRP_ATU 0x0111 #define HV_GRP_ATU 0x0111
#define HV_GRP_DAX 0x0113
#define HV_GRP_M7_PERF 0x0114 #define HV_GRP_M7_PERF 0x0114
#define HV_GRP_NIAG_PERF 0x0200 #define HV_GRP_NIAG_PERF 0x0200
#define HV_GRP_FIRE_PERF 0x0201 #define HV_GRP_FIRE_PERF 0x0201
......
/*
* Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
/*
* Oracle DAX driver API definitions
*/
#ifndef _ORADAX_H
#define _ORADAX_H
#include <linux/types.h>
#define CCB_KILL 0
#define CCB_INFO 1
#define CCB_DEQUEUE 2
struct dax_command {
__u16 command; /* CCB_KILL/INFO/DEQUEUE */
__u16 ca_offset; /* offset into mmapped completion area */
};
struct ccb_kill_result {
__u16 action; /* action taken to kill ccb */
};
struct ccb_info_result {
__u16 state; /* state of enqueued ccb */
__u16 inst_num; /* dax instance number of enqueued ccb */
__u16 q_num; /* queue number of enqueued ccb */
__u16 q_pos; /* ccb position in queue */
};
struct ccb_exec_result {
__u64 status_data; /* additional status data (e.g. bad VA) */
__u32 status; /* one of DAX_SUBMIT_* */
};
union ccb_result {
struct ccb_exec_result exec;
struct ccb_info_result info;
struct ccb_kill_result kill;
};
#define DAX_MMAP_LEN (16 * 1024)
#define DAX_MAX_CCBS 15
#define DAX_CCB_BUF_MAXLEN (DAX_MAX_CCBS * 64)
#define DAX_NAME "oradax"
/* CCB_EXEC status */
#define DAX_SUBMIT_OK 0
#define DAX_SUBMIT_ERR_RETRY 1
#define DAX_SUBMIT_ERR_WOULDBLOCK 2
#define DAX_SUBMIT_ERR_BUSY 3
#define DAX_SUBMIT_ERR_THR_INIT 4
#define DAX_SUBMIT_ERR_ARG_INVAL 5
#define DAX_SUBMIT_ERR_CCB_INVAL 6
#define DAX_SUBMIT_ERR_NO_CA_AVAIL 7
#define DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS 8
#define DAX_SUBMIT_ERR_NOMAP 9
#define DAX_SUBMIT_ERR_NOACCESS 10
#define DAX_SUBMIT_ERR_TOOMANY 11
#define DAX_SUBMIT_ERR_UNAVAIL 12
#define DAX_SUBMIT_ERR_INTERNAL 13
/* CCB_INFO states - must match HV_CCB_STATE_* definitions */
#define DAX_CCB_COMPLETED 0
#define DAX_CCB_ENQUEUED 1
#define DAX_CCB_INPROGRESS 2
#define DAX_CCB_NOTFOUND 3
/* CCB_KILL actions - must match HV_CCB_KILL_* definitions */
#define DAX_KILL_COMPLETED 0
#define DAX_KILL_DEQUEUED 1
#define DAX_KILL_KILLED 2
#define DAX_KILL_NOTFOUND 3
#endif /* _ORADAX_H */
...@@ -41,6 +41,7 @@ static struct api_info api_table[] = { ...@@ -41,6 +41,7 @@ static struct api_info api_table[] = {
{ .group = HV_GRP_SDIO_ERR, }, { .group = HV_GRP_SDIO_ERR, },
{ .group = HV_GRP_REBOOT_DATA, }, { .group = HV_GRP_REBOOT_DATA, },
{ .group = HV_GRP_ATU, .flags = FLAG_PRE_API }, { .group = HV_GRP_ATU, .flags = FLAG_PRE_API },
{ .group = HV_GRP_DAX, },
{ .group = HV_GRP_NIAG_PERF, .flags = FLAG_PRE_API }, { .group = HV_GRP_NIAG_PERF, .flags = FLAG_PRE_API },
{ .group = HV_GRP_FIRE_PERF, }, { .group = HV_GRP_FIRE_PERF, },
{ .group = HV_GRP_N2_CPU, }, { .group = HV_GRP_N2_CPU, },
......
...@@ -871,3 +871,60 @@ ENTRY(sun4v_m7_set_perfreg) ...@@ -871,3 +871,60 @@ ENTRY(sun4v_m7_set_perfreg)
retl retl
nop nop
ENDPROC(sun4v_m7_set_perfreg) ENDPROC(sun4v_m7_set_perfreg)
/* %o0: address of CCB array
* %o1: size (in bytes) of CCB array
* %o2: flags
* %o3: reserved
*
* returns:
* %o0: status
* %o1: size (in bytes) of the CCB array that was accepted
* %o2: status data
* %o3: reserved
*/
ENTRY(sun4v_ccb_submit)
mov %o5, %g1
mov HV_CCB_SUBMIT, %o5
ta HV_FAST_TRAP
stx %o1, [%o4]
retl
stx %o2, [%g1]
ENDPROC(sun4v_ccb_submit)
EXPORT_SYMBOL(sun4v_ccb_submit)
/* %o0: completion area ra for the ccb to get info
*
* returns:
* %o0: status
* %o1: CCB state
* %o2: position
* %o3: dax unit
* %o4: queue
*/
ENTRY(sun4v_ccb_info)
mov %o1, %g1
mov HV_CCB_INFO, %o5
ta HV_FAST_TRAP
sth %o1, [%g1 + CCB_INFO_OFFSET_CCB_STATE]
sth %o2, [%g1 + CCB_INFO_OFFSET_QUEUE_POS]
sth %o3, [%g1 + CCB_INFO_OFFSET_DAX_UNIT]
retl
sth %o4, [%g1 + CCB_INFO_OFFSET_QUEUE_NUM]
ENDPROC(sun4v_ccb_info)
EXPORT_SYMBOL(sun4v_ccb_info)
/* %o0: completion area ra for the ccb to kill
*
* returns:
* %o0: status
* %o1: result of the kill
*/
ENTRY(sun4v_ccb_kill)
mov %o1, %g1
mov HV_CCB_KILL, %o5
ta HV_FAST_TRAP
retl
sth %o1, [%g1]
ENDPROC(sun4v_ccb_kill)
EXPORT_SYMBOL(sun4v_ccb_kill)
...@@ -9,9 +9,6 @@ ...@@ -9,9 +9,6 @@
* Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz) * Copyright (C) 1997,1998 Jakub Jelinek (jj@sunsite.mff.cuni.cz)
*/ */
#ifdef CONFIG_COMPAT
#include <linux/compat.h> /* for compat_old_sigset_t */
#endif
#include <linux/sched.h> #include <linux/sched.h>
#include <linux/kernel.h> #include <linux/kernel.h>
#include <linux/signal.h> #include <linux/signal.h>
......
...@@ -251,7 +251,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) ...@@ -251,7 +251,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
else else
return map_vdso(&vdso_image_32_builtin, &vdso_mapping32); return map_vdso(&vdso_image_32_builtin, &vdso_mapping32);
#else #else
return map_vdso(&vdso_image_64_builtin, &vdso_mapping64); return map_vdso(&vdso_image_64_builtin, &vdso_mapping64);
#endif #endif
} }
......
...@@ -70,5 +70,13 @@ config DISPLAY7SEG ...@@ -70,5 +70,13 @@ config DISPLAY7SEG
another UltraSPARC-IIi-cEngine boardset with a 7-segment display, another UltraSPARC-IIi-cEngine boardset with a 7-segment display,
you should say N to this option. you should say N to this option.
config ORACLE_DAX
tristate "Oracle Data Analytics Accelerator"
default m if SPARC64
help
Driver for Oracle Data Analytics Accelerator, which is
a coprocessor that performs database operations in hardware.
It is available on M7 and M8 based systems only.
endmenu endmenu
...@@ -17,3 +17,4 @@ obj-$(CONFIG_SUN_OPENPROMIO) += openprom.o ...@@ -17,3 +17,4 @@ obj-$(CONFIG_SUN_OPENPROMIO) += openprom.o
obj-$(CONFIG_TADPOLE_TS102_UCTRL) += uctrl.o obj-$(CONFIG_TADPOLE_TS102_UCTRL) += uctrl.o
obj-$(CONFIG_SUN_JSFLASH) += jsflash.o obj-$(CONFIG_SUN_JSFLASH) += jsflash.o
obj-$(CONFIG_BBC_I2C) += bbc.o obj-$(CONFIG_BBC_I2C) += bbc.o
obj-$(CONFIG_ORACLE_DAX) += oradax.o
/*
* Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see <http://www.gnu.org/licenses/>.
*/
/*
* Oracle Data Analytics Accelerator (DAX)
*
* DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
* (DAX2) processor chips, and has direct access to the CPU's L3
* caches as well as physical memory. It can perform several
* operations on data streams with various input and output formats.
* The driver provides a transport mechanism only and has limited
* knowledge of the various opcodes and data formats. A user space
* library provides high level services and translates these into low
* level commands which are then passed into the driver and
* subsequently the hypervisor and the coprocessor. The library is
* the recommended way for applications to use the coprocessor, and
* the driver interface is not intended for general use.
*
* See Documentation/sparc/oradax/oracle_dax.txt for more details.
*/
#include <linux/uaccess.h>
#include <linux/module.h>
#include <linux/delay.h>
#include <linux/cdev.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <asm/hypervisor.h>
#include <asm/mdesc.h>
#include <asm/oradax.h>
MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Driver for Oracle Data Analytics Accelerator");
#define DAX_DBG_FLG_BASIC 0x01
#define DAX_DBG_FLG_STAT 0x02
#define DAX_DBG_FLG_INFO 0x04
#define DAX_DBG_FLG_ALL 0xff
#define dax_err(fmt, ...) pr_err("%s: " fmt "\n", __func__, ##__VA_ARGS__)
#define dax_info(fmt, ...) pr_info("%s: " fmt "\n", __func__, ##__VA_ARGS__)
#define dax_dbg(fmt, ...) do { \
if (dax_debug & DAX_DBG_FLG_BASIC)\
dax_info(fmt, ##__VA_ARGS__); \
} while (0)
#define dax_stat_dbg(fmt, ...) do { \
if (dax_debug & DAX_DBG_FLG_STAT) \
dax_info(fmt, ##__VA_ARGS__); \
} while (0)
#define dax_info_dbg(fmt, ...) do { \
if (dax_debug & DAX_DBG_FLG_INFO) \
dax_info(fmt, ##__VA_ARGS__); \
} while (0)
#define DAX1_MINOR 1
#define DAX1_MAJOR 1
#define DAX2_MINOR 0
#define DAX2_MAJOR 2
#define DAX1_STR "ORCL,sun4v-dax"
#define DAX2_STR "ORCL,sun4v-dax2"
#define DAX_CA_ELEMS (DAX_MMAP_LEN / sizeof(struct dax_cca))
#define DAX_CCB_USEC 100
#define DAX_CCB_RETRIES 10000
/* stream types */
enum {
OUT,
PRI,
SEC,
TBL,
NUM_STREAM_TYPES
};
/* completion status */
#define CCA_STAT_NOT_COMPLETED 0
#define CCA_STAT_COMPLETED 1
#define CCA_STAT_FAILED 2
#define CCA_STAT_KILLED 3
#define CCA_STAT_NOT_RUN 4
#define CCA_STAT_PIPE_OUT 5
#define CCA_STAT_PIPE_SRC 6
#define CCA_STAT_PIPE_DST 7
/* completion err */
#define CCA_ERR_SUCCESS 0x0 /* no error */
#define CCA_ERR_OVERFLOW 0x1 /* buffer overflow */
#define CCA_ERR_DECODE 0x2 /* CCB decode error */
#define CCA_ERR_PAGE_OVERFLOW 0x3 /* page overflow */
#define CCA_ERR_KILLED 0x7 /* command was killed */
#define CCA_ERR_TIMEOUT 0x8 /* Timeout */
#define CCA_ERR_ADI 0x9 /* ADI error */
#define CCA_ERR_DATA_FMT 0xA /* data format error */
#define CCA_ERR_OTHER_NO_RETRY 0xE /* Other error, do not retry */
#define CCA_ERR_OTHER_RETRY 0xF /* Other error, retry */
#define CCA_ERR_PARTIAL_SYMBOL 0x80 /* QP partial symbol warning */
/* CCB address types */
#define DAX_ADDR_TYPE_NONE 0
#define DAX_ADDR_TYPE_VA_ALT 1 /* secondary context */
#define DAX_ADDR_TYPE_RA 2 /* real address */
#define DAX_ADDR_TYPE_VA 3 /* virtual address */
/* dax_header_t opcode */
#define DAX_OP_SYNC_NOP 0x0
#define DAX_OP_EXTRACT 0x1
#define DAX_OP_SCAN_VALUE 0x2
#define DAX_OP_SCAN_RANGE 0x3
#define DAX_OP_TRANSLATE 0x4
#define DAX_OP_SELECT 0x5
#define DAX_OP_INVERT 0x10 /* OR with translate, scan opcodes */
struct dax_header {
u32 ccb_version:4; /* 31:28 CCB Version */
/* 27:24 Sync Flags */
u32 pipe:1; /* Pipeline */
u32 longccb:1; /* Longccb. Set for scan with lu2, lu3, lu4. */
u32 cond:1; /* Conditional */
u32 serial:1; /* Serial */
u32 opcode:8; /* 23:16 Opcode */
/* 15:0 Address Type. */
u32 reserved:3; /* 15:13 reserved */
u32 table_addr_type:2; /* 12:11 Huffman Table Address Type */
u32 out_addr_type:3; /* 10:8 Destination Address Type */
u32 sec_addr_type:3; /* 7:5 Secondary Source Address Type */
u32 pri_addr_type:3; /* 4:2 Primary Source Address Type */
u32 cca_addr_type:2; /* 1:0 Completion Address Type */
};
struct dax_control {
u32 pri_fmt:4; /* 31:28 Primary Input Format */
u32 pri_elem_size:5; /* 27:23 Primary Input Element Size(less1) */
u32 pri_offset:3; /* 22:20 Primary Input Starting Offset */
u32 sec_encoding:1; /* 19 Secondary Input Encoding */
/* (must be 0 for Select) */
u32 sec_offset:3; /* 18:16 Secondary Input Starting Offset */
u32 sec_elem_size:2; /* 15:14 Secondary Input Element Size */
/* (must be 0 for Select) */
u32 out_fmt:2; /* 13:12 Output Format */
u32 out_elem_size:2; /* 11:10 Output Element Size */
u32 misc:10; /* 9:0 Opcode specific info */
};
struct dax_data_access {
u64 flow_ctrl:2; /* 63:62 Flow Control Type */
u64 pipe_target:2; /* 61:60 Pipeline Target */
u64 out_buf_size:20; /* 59:40 Output Buffer Size */
/* (cachelines less 1) */
u64 unused1:8; /* 39:32 Reserved, Set to 0 */
u64 out_alloc:5; /* 31:27 Output Allocation */
u64 unused2:1; /* 26 Reserved */
u64 pri_len_fmt:2; /* 25:24 Input Length Format */
u64 pri_len:24; /* 23:0 Input Element/Byte/Bit Count */
/* (less 1) */
};
struct dax_ccb {
struct dax_header hdr; /* CCB Header */
struct dax_control ctrl;/* Control Word */
void *ca; /* Completion Address */
void *pri; /* Primary Input Address */
struct dax_data_access dac; /* Data Access Control */
void *sec; /* Secondary Input Address */
u64 dword5; /* depends on opcode */
void *out; /* Output Address */
void *tbl; /* Table Address or bitmap */
};
struct dax_cca {
u8 status; /* user may mwait on this address */
u8 err; /* user visible error notification */
u8 rsvd[2]; /* reserved */
u32 n_remaining; /* for QP partial symbol warning */
u32 output_sz; /* output in bytes */
u32 rsvd2; /* reserved */
u64 run_cycles; /* run time in OCND2 cycles */
u64 run_stats; /* nothing reported in version 1.0 */
u32 n_processed; /* number input elements */
u32 rsvd3[5]; /* reserved */
u64 retval; /* command return value */
u64 rsvd4[8]; /* reserved */
};
/* per thread CCB context */
struct dax_ctx {
struct dax_ccb *ccb_buf;
u64 ccb_buf_ra; /* cached RA of ccb_buf */
struct dax_cca *ca_buf;
u64 ca_buf_ra; /* cached RA of ca_buf */
struct page *pages[DAX_CA_ELEMS][NUM_STREAM_TYPES];
/* array of locked pages */
struct task_struct *owner; /* thread that owns ctx */
struct task_struct *client; /* requesting thread */
union ccb_result result;
u32 ccb_count;
u32 fail_count;
};
/* driver public entry points */
static int dax_open(struct inode *inode, struct file *file);
static ssize_t dax_read(struct file *filp, char __user *buf,
size_t count, loff_t *ppos);
static ssize_t dax_write(struct file *filp, const char __user *buf,
size_t count, loff_t *ppos);
static int dax_devmap(struct file *f, struct vm_area_struct *vma);
static int dax_close(struct inode *i, struct file *f);
static const struct file_operations dax_fops = {
.owner = THIS_MODULE,
.open = dax_open,
.read = dax_read,
.write = dax_write,
.mmap = dax_devmap,
.release = dax_close,
};
static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
size_t count, loff_t *ppos);
static int dax_ccb_info(u64 ca, struct ccb_info_result *info);
static int dax_ccb_kill(u64 ca, u16 *kill_res);
static struct cdev c_dev;
static struct class *cl;
static dev_t first;
static int max_ccb_version;
static int dax_debug;
module_param(dax_debug, int, 0644);
MODULE_PARM_DESC(dax_debug, "Debug flags");
static int __init dax_attach(void)
{
unsigned long dummy, hv_rv, major, minor, minor_requested, max_ccbs;
struct mdesc_handle *hp = mdesc_grab();
char *prop, *dax_name;
bool found = false;
int len, ret = 0;
u64 pn;
if (hp == NULL) {
dax_err("Unable to grab mdesc");
return -ENODEV;
}
mdesc_for_each_node_by_name(hp, pn, "virtual-device") {
prop = (char *)mdesc_get_property(hp, pn, "name", &len);
if (prop == NULL)
continue;
if (strncmp(prop, "dax", strlen("dax")))
continue;
dax_dbg("Found node 0x%llx = %s", pn, prop);
prop = (char *)mdesc_get_property(hp, pn, "compatible", &len);
if (prop == NULL)
continue;
dax_dbg("Found node 0x%llx = %s", pn, prop);
found = true;
break;
}
if (!found) {
dax_err("No DAX device found");
ret = -ENODEV;
goto done;
}
if (strncmp(prop, DAX2_STR, strlen(DAX2_STR)) == 0) {
dax_name = DAX_NAME "2";
major = DAX2_MAJOR;
minor_requested = DAX2_MINOR;
max_ccb_version = 1;
dax_dbg("MD indicates DAX2 coprocessor");
} else if (strncmp(prop, DAX1_STR, strlen(DAX1_STR)) == 0) {
dax_name = DAX_NAME "1";
major = DAX1_MAJOR;
minor_requested = DAX1_MINOR;
max_ccb_version = 0;
dax_dbg("MD indicates DAX1 coprocessor");
} else {
dax_err("Unknown dax type: %s", prop);
ret = -ENODEV;
goto done;
}
minor = minor_requested;
dax_dbg("Registering DAX HV api with major %ld minor %ld", major,
minor);
if (sun4v_hvapi_register(HV_GRP_DAX, major, &minor)) {
dax_err("hvapi_register failed");
ret = -ENODEV;
goto done;
} else {
dax_dbg("Max minor supported by HV = %ld (major %ld)", minor,
major);
minor = min(minor, minor_requested);
dax_dbg("registered DAX major %ld minor %ld", major, minor);
}
/* submit a zero length ccb array to query coprocessor queue size */
hv_rv = sun4v_ccb_submit(0, 0, HV_CCB_QUERY_CMD, 0, &max_ccbs, &dummy);
if (hv_rv != 0) {
dax_err("get_hwqueue_size failed with status=%ld and max_ccbs=%ld",
hv_rv, max_ccbs);
ret = -ENODEV;
goto done;
}
if (max_ccbs != DAX_MAX_CCBS) {
dax_err("HV reports unsupported max_ccbs=%ld", max_ccbs);
ret = -ENODEV;
goto done;
}
if (alloc_chrdev_region(&first, 0, 1, DAX_NAME) < 0) {
dax_err("alloc_chrdev_region failed");
ret = -ENXIO;
goto done;
}
cl = class_create(THIS_MODULE, DAX_NAME);
if (IS_ERR(cl)) {
dax_err("class_create failed");
ret = PTR_ERR(cl);
goto class_error;
}
if (device_create(cl, NULL, first, NULL, dax_name) == NULL) {
dax_err("device_create failed");
ret = -ENXIO;
goto device_error;
}
cdev_init(&c_dev, &dax_fops);
if (cdev_add(&c_dev, first, 1) == -1) {
dax_err("cdev_add failed");
ret = -ENXIO;
goto cdev_error;
}
pr_info("Attached DAX module\n");
goto done;
cdev_error:
device_destroy(cl, first);
device_error:
class_destroy(cl);
class_error:
unregister_chrdev_region(first, 1);
done:
mdesc_release(hp);
return ret;
}
module_init(dax_attach);
static void __exit dax_detach(void)
{
pr_info("Cleaning up DAX module\n");
cdev_del(&c_dev);
device_destroy(cl, first);
class_destroy(cl);
unregister_chrdev_region(first, 1);
}
module_exit(dax_detach);
/* map completion area */
static int dax_devmap(struct file *f, struct vm_area_struct *vma)
{
struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
size_t len = vma->vm_end - vma->vm_start;
dax_dbg("len=0x%lx, flags=0x%lx", len, vma->vm_flags);
if (ctx->owner != current) {
dax_dbg("devmap called from wrong thread");
return -EINVAL;
}
if (len != DAX_MMAP_LEN) {
dax_dbg("len(%lu) != DAX_MMAP_LEN(%d)", len, DAX_MMAP_LEN);
return -EINVAL;
}
/* completion area is mapped read-only for user */
if (vma->vm_flags & VM_WRITE)
return -EPERM;
vma->vm_flags &= ~VM_MAYWRITE;
if (remap_pfn_range(vma, vma->vm_start, ctx->ca_buf_ra >> PAGE_SHIFT,
len, vma->vm_page_prot))
return -EAGAIN;
dax_dbg("mmapped completion area at uva 0x%lx", vma->vm_start);
return 0;
}
/* Unlock user pages. Called during dequeue or device close */
static void dax_unlock_pages(struct dax_ctx *ctx, int ccb_index, int nelem)
{
int i, j;
for (i = ccb_index; i < ccb_index + nelem; i++) {
for (j = 0; j < NUM_STREAM_TYPES; j++) {
struct page *p = ctx->pages[i][j];
if (p) {
dax_dbg("freeing page %p", p);
if (j == OUT)
set_page_dirty(p);
put_page(p);
ctx->pages[i][j] = NULL;
}
}
}
}
static int dax_lock_page(void *va, struct page **p)
{
int ret;
dax_dbg("uva %p", va);
ret = get_user_pages_fast((unsigned long)va, 1, 1, p);
if (ret == 1) {
dax_dbg("locked page %p, for VA %p", *p, va);
return 0;
}
dax_dbg("get_user_pages failed, va=%p, ret=%d", va, ret);
return -1;
}
static int dax_lock_pages(struct dax_ctx *ctx, int idx,
int nelem, u64 *err_va)
{
int i;
for (i = 0; i < nelem; i++) {
struct dax_ccb *ccbp = &ctx->ccb_buf[i];
/*
* For each address in the CCB whose type is virtual,
* lock the page and change the type to virtual alternate
* context. On error, return the offending address in
* err_va.
*/
if (ccbp->hdr.out_addr_type == DAX_ADDR_TYPE_VA) {
dax_dbg("output");
if (dax_lock_page(ccbp->out,
&ctx->pages[i + idx][OUT]) != 0) {
*err_va = (u64)ccbp->out;
goto error;
}
ccbp->hdr.out_addr_type = DAX_ADDR_TYPE_VA_ALT;
}
if (ccbp->hdr.pri_addr_type == DAX_ADDR_TYPE_VA) {
dax_dbg("input");
if (dax_lock_page(ccbp->pri,
&ctx->pages[i + idx][PRI]) != 0) {
*err_va = (u64)ccbp->pri;
goto error;
}
ccbp->hdr.pri_addr_type = DAX_ADDR_TYPE_VA_ALT;
}
if (ccbp->hdr.sec_addr_type == DAX_ADDR_TYPE_VA) {
dax_dbg("sec input");
if (dax_lock_page(ccbp->sec,
&ctx->pages[i + idx][SEC]) != 0) {
*err_va = (u64)ccbp->sec;
goto error;
}
ccbp->hdr.sec_addr_type = DAX_ADDR_TYPE_VA_ALT;
}
if (ccbp->hdr.table_addr_type == DAX_ADDR_TYPE_VA) {
dax_dbg("tbl");
if (dax_lock_page(ccbp->tbl,
&ctx->pages[i + idx][TBL]) != 0) {
*err_va = (u64)ccbp->tbl;
goto error;
}
ccbp->hdr.table_addr_type = DAX_ADDR_TYPE_VA_ALT;
}
/* skip over 2nd 64 bytes of long CCB */
if (ccbp->hdr.longccb)
i++;
}
return DAX_SUBMIT_OK;
error:
dax_unlock_pages(ctx, idx, nelem);
return DAX_SUBMIT_ERR_NOACCESS;
}
static void dax_ccb_wait(struct dax_ctx *ctx, int idx)
{
int ret, nretries;
u16 kill_res;
dax_dbg("idx=%d", idx);
for (nretries = 0; nretries < DAX_CCB_RETRIES; nretries++) {
if (ctx->ca_buf[idx].status == CCA_STAT_NOT_COMPLETED)
udelay(DAX_CCB_USEC);
else
return;
}
dax_dbg("ctx (%p): CCB[%d] timed out, wait usec=%d, retries=%d. Killing ccb",
(void *)ctx, idx, DAX_CCB_USEC, DAX_CCB_RETRIES);
ret = dax_ccb_kill(ctx->ca_buf_ra + idx * sizeof(struct dax_cca),
&kill_res);
dax_dbg("Kill CCB[%d] %s", idx, ret ? "failed" : "succeeded");
}
static int dax_close(struct inode *ino, struct file *f)
{
struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
int i;
f->private_data = NULL;
for (i = 0; i < DAX_CA_ELEMS; i++) {
if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
dax_dbg("CCB[%d] not completed", i);
dax_ccb_wait(ctx, i);
}
dax_unlock_pages(ctx, i, 1);
}
kfree(ctx->ccb_buf);
kfree(ctx->ca_buf);
dax_stat_dbg("CCBs: %d good, %d bad", ctx->ccb_count, ctx->fail_count);
kfree(ctx);
return 0;
}
static ssize_t dax_read(struct file *f, char __user *buf,
size_t count, loff_t *ppos)
{
struct dax_ctx *ctx = f->private_data;
if (ctx->client != current)
return -EUSERS;
ctx->client = NULL;
if (count != sizeof(union ccb_result))
return -EINVAL;
if (copy_to_user(buf, &ctx->result, sizeof(union ccb_result)))
return -EFAULT;
return count;
}
static ssize_t dax_write(struct file *f, const char __user *buf,
size_t count, loff_t *ppos)
{
struct dax_ctx *ctx = f->private_data;
struct dax_command hdr;
unsigned long ca;
int i, idx, ret;
if (ctx->client != NULL)
return -EINVAL;
if (count == 0 || count > DAX_MAX_CCBS * sizeof(struct dax_ccb))
return -EINVAL;
if (count % sizeof(struct dax_ccb) == 0)
return dax_ccb_exec(ctx, buf, count, ppos); /* CCB EXEC */
if (count != sizeof(struct dax_command))
return -EINVAL;
/* immediate command */
if (ctx->owner != current)
return -EUSERS;
if (copy_from_user(&hdr, buf, sizeof(hdr)))
return -EFAULT;
ca = ctx->ca_buf_ra + hdr.ca_offset;
switch (hdr.command) {
case CCB_KILL:
if (hdr.ca_offset >= DAX_MMAP_LEN) {
dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
hdr.ca_offset, DAX_MMAP_LEN);
return -EINVAL;
}
ret = dax_ccb_kill(ca, &ctx->result.kill.action);
if (ret != 0) {
dax_dbg("dax_ccb_kill failed (ret=%d)", ret);
return ret;
}
dax_info_dbg("killed (ca_offset %d)", hdr.ca_offset);
idx = hdr.ca_offset / sizeof(struct dax_cca);
ctx->ca_buf[idx].status = CCA_STAT_KILLED;
ctx->ca_buf[idx].err = CCA_ERR_KILLED;
ctx->client = current;
return count;
case CCB_INFO:
if (hdr.ca_offset >= DAX_MMAP_LEN) {
dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
hdr.ca_offset, DAX_MMAP_LEN);
return -EINVAL;
}
ret = dax_ccb_info(ca, &ctx->result.info);
if (ret != 0) {
dax_dbg("dax_ccb_info failed (ret=%d)", ret);
return ret;
}
dax_info_dbg("info succeeded on ca_offset %d", hdr.ca_offset);
ctx->client = current;
return count;
case CCB_DEQUEUE:
for (i = 0; i < DAX_CA_ELEMS; i++) {
if (ctx->ca_buf[i].status !=
CCA_STAT_NOT_COMPLETED)
dax_unlock_pages(ctx, i, 1);
}
return count;
default:
return -EINVAL;
}
}
static int dax_open(struct inode *inode, struct file *f)
{
struct dax_ctx *ctx = NULL;
int i;
ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
if (ctx == NULL)
goto done;
ctx->ccb_buf = kcalloc(DAX_MAX_CCBS, sizeof(struct dax_ccb),
GFP_KERNEL);
if (ctx->ccb_buf == NULL)
goto done;
ctx->ccb_buf_ra = virt_to_phys(ctx->ccb_buf);
dax_dbg("ctx->ccb_buf=0x%p, ccb_buf_ra=0x%llx",
(void *)ctx->ccb_buf, ctx->ccb_buf_ra);
/* allocate CCB completion area buffer */
ctx->ca_buf = kzalloc(DAX_MMAP_LEN, GFP_KERNEL);
if (ctx->ca_buf == NULL)
goto alloc_error;
for (i = 0; i < DAX_CA_ELEMS; i++)
ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
ctx->ca_buf_ra = virt_to_phys(ctx->ca_buf);
dax_dbg("ctx=0x%p, ctx->ca_buf=0x%p, ca_buf_ra=0x%llx",
(void *)ctx, (void *)ctx->ca_buf, ctx->ca_buf_ra);
ctx->owner = current;
f->private_data = ctx;
return 0;
alloc_error:
kfree(ctx->ccb_buf);
done:
if (ctx != NULL)
kfree(ctx);
return -ENOMEM;
}
static char *dax_hv_errno(unsigned long hv_ret, int *ret)
{
switch (hv_ret) {
case HV_EBADALIGN:
*ret = -EFAULT;
return "HV_EBADALIGN";
case HV_ENORADDR:
*ret = -EFAULT;
return "HV_ENORADDR";
case HV_EINVAL:
*ret = -EINVAL;
return "HV_EINVAL";
case HV_EWOULDBLOCK:
*ret = -EAGAIN;
return "HV_EWOULDBLOCK";
case HV_ENOACCESS:
*ret = -EPERM;
return "HV_ENOACCESS";
default:
break;
}
*ret = -EIO;
return "UNKNOWN";
}
static int dax_ccb_kill(u64 ca, u16 *kill_res)
{
unsigned long hv_ret;
int count, ret = 0;
char *err_str;
for (count = 0; count < DAX_CCB_RETRIES; count++) {
dax_dbg("attempting kill on ca_ra 0x%llx", ca);
hv_ret = sun4v_ccb_kill(ca, kill_res);
if (hv_ret == HV_EOK) {
dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca,
*kill_res);
} else {
err_str = dax_hv_errno(hv_ret, &ret);
dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
}
if (ret != -EAGAIN)
return ret;
dax_info_dbg("ccb_kill count = %d", count);
udelay(DAX_CCB_USEC);
}
return -EAGAIN;
}
static int dax_ccb_info(u64 ca, struct ccb_info_result *info)
{
unsigned long hv_ret;
char *err_str;
int ret = 0;
dax_dbg("attempting info on ca_ra 0x%llx", ca);
hv_ret = sun4v_ccb_info(ca, info);
if (hv_ret == HV_EOK) {
dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca, info->state);
if (info->state == DAX_CCB_ENQUEUED) {
dax_info_dbg("dax_unit %d, queue_num %d, queue_pos %d",
info->inst_num, info->q_num, info->q_pos);
}
} else {
err_str = dax_hv_errno(hv_ret, &ret);
dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
}
return ret;
}
static void dax_prt_ccbs(struct dax_ccb *ccb, int nelem)
{
int i, j;
u64 *ccbp;
dax_dbg("ccb buffer:");
for (i = 0; i < nelem; i++) {
ccbp = (u64 *)&ccb[i];
dax_dbg(" %sccb[%d]", ccb[i].hdr.longccb ? "long " : "", i);
for (j = 0; j < 8; j++)
dax_dbg("\tccb[%d].dwords[%d]=0x%llx",
i, j, *(ccbp + j));
}
}
/*
* Validates user CCB content. Also sets completion address and address types
* for all addresses contained in CCB.
*/
static int dax_preprocess_usr_ccbs(struct dax_ctx *ctx, int idx, int nelem)
{
int i;
/*
* The user is not allowed to specify real address types in
* the CCB header. This must be enforced by the kernel before
* submitting the CCBs to HV. The only allowed values for all
* address fields are VA or IMM
*/
for (i = 0; i < nelem; i++) {
struct dax_ccb *ccbp = &ctx->ccb_buf[i];
unsigned long ca_offset;
if (ccbp->hdr.ccb_version > max_ccb_version)
return DAX_SUBMIT_ERR_CCB_INVAL;
switch (ccbp->hdr.opcode) {
case DAX_OP_SYNC_NOP:
case DAX_OP_EXTRACT:
case DAX_OP_SCAN_VALUE:
case DAX_OP_SCAN_RANGE:
case DAX_OP_TRANSLATE:
case DAX_OP_SCAN_VALUE | DAX_OP_INVERT:
case DAX_OP_SCAN_RANGE | DAX_OP_INVERT:
case DAX_OP_TRANSLATE | DAX_OP_INVERT:
case DAX_OP_SELECT:
break;
default:
return DAX_SUBMIT_ERR_CCB_INVAL;
}
if (ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_VA &&
ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_NONE) {
dax_dbg("invalid out_addr_type in user CCB[%d]", i);
return DAX_SUBMIT_ERR_CCB_INVAL;
}
if (ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_VA &&
ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_NONE) {
dax_dbg("invalid pri_addr_type in user CCB[%d]", i);
return DAX_SUBMIT_ERR_CCB_INVAL;
}
if (ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_VA &&
ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_NONE) {
dax_dbg("invalid sec_addr_type in user CCB[%d]", i);
return DAX_SUBMIT_ERR_CCB_INVAL;
}
if (ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_VA &&
ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_NONE) {
dax_dbg("invalid table_addr_type in user CCB[%d]", i);
return DAX_SUBMIT_ERR_CCB_INVAL;
}
/* set completion (real) address and address type */
ccbp->hdr.cca_addr_type = DAX_ADDR_TYPE_RA;
ca_offset = (idx + i) * sizeof(struct dax_cca);
ccbp->ca = (void *)ctx->ca_buf_ra + ca_offset;
memset(&ctx->ca_buf[idx + i], 0, sizeof(struct dax_cca));
dax_dbg("ccb[%d]=%p, ca_offset=0x%lx, compl RA=0x%llx",
i, ccbp, ca_offset, ctx->ca_buf_ra + ca_offset);
/* skip over 2nd 64 bytes of long CCB */
if (ccbp->hdr.longccb)
i++;
}
return DAX_SUBMIT_OK;
}
static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
size_t count, loff_t *ppos)
{
unsigned long accepted_len, hv_rv;
int i, idx, nccbs, naccepted;
ctx->client = current;
idx = *ppos;
nccbs = count / sizeof(struct dax_ccb);
if (ctx->owner != current) {
dax_dbg("wrong thread");
ctx->result.exec.status = DAX_SUBMIT_ERR_THR_INIT;
return 0;
}
dax_dbg("args: ccb_buf_len=%ld, idx=%d", count, idx);
/* for given index and length, verify ca_buf range exists */
if (idx + nccbs >= DAX_CA_ELEMS) {
ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
return 0;
}
/*
* Copy CCBs into kernel buffer to prevent modification by the
* user in between validation and submission.
*/
if (copy_from_user(ctx->ccb_buf, buf, count)) {
dax_dbg("copyin of user CCB buffer failed");
ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS;
return 0;
}
/* check to see if ca_buf[idx] .. ca_buf[idx + nccbs] are available */
for (i = idx; i < idx + nccbs; i++) {
if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
dax_dbg("CA range not available, dequeue needed");
ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
return 0;
}
}
dax_unlock_pages(ctx, idx, nccbs);
ctx->result.exec.status = dax_preprocess_usr_ccbs(ctx, idx, nccbs);
if (ctx->result.exec.status != DAX_SUBMIT_OK)
return 0;
ctx->result.exec.status = dax_lock_pages(ctx, idx, nccbs,
&ctx->result.exec.status_data);
if (ctx->result.exec.status != DAX_SUBMIT_OK)
return 0;
if (dax_debug & DAX_DBG_FLG_BASIC)
dax_prt_ccbs(ctx->ccb_buf, nccbs);
hv_rv = sun4v_ccb_submit(ctx->ccb_buf_ra, count,
HV_CCB_QUERY_CMD | HV_CCB_VA_SECONDARY, 0,
&accepted_len, &ctx->result.exec.status_data);
switch (hv_rv) {
case HV_EOK:
/*
* Hcall succeeded with no errors but the accepted
* length may be less than the requested length. The
* only way the driver can resubmit the remainder is
* to wait for completion of the submitted CCBs since
* there is no way to guarantee the ordering semantics
* required by the client applications. Therefore we
* let the user library deal with resubmissions.
*/
ctx->result.exec.status = DAX_SUBMIT_OK;
break;
case HV_EWOULDBLOCK:
/*
* This is a transient HV API error. The user library
* can retry.
*/
dax_dbg("hcall returned HV_EWOULDBLOCK");
ctx->result.exec.status = DAX_SUBMIT_ERR_WOULDBLOCK;
break;
case HV_ENOMAP:
/*
* HV was unable to translate a VA. The VA it could
* not translate is returned in the status_data param.
*/
dax_dbg("hcall returned HV_ENOMAP");
ctx->result.exec.status = DAX_SUBMIT_ERR_NOMAP;
break;
case HV_EINVAL:
/*
* This is the result of an invalid user CCB as HV is
* validating some of the user CCB fields. Pass this
* error back to the user. There is no supporting info
* to isolate the invalid field.
*/
dax_dbg("hcall returned HV_EINVAL");
ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_INVAL;
break;
case HV_ENOACCESS:
/*
* HV found a VA that did not have the appropriate
* permissions (such as the w bit). The VA in question
* is returned in status_data param.
*/
dax_dbg("hcall returned HV_ENOACCESS");
ctx->result.exec.status = DAX_SUBMIT_ERR_NOACCESS;
break;
case HV_EUNAVAILABLE:
/*
* The requested CCB operation could not be performed
* at this time. Return the specific unavailable code
* in the status_data field.
*/
dax_dbg("hcall returned HV_EUNAVAILABLE");
ctx->result.exec.status = DAX_SUBMIT_ERR_UNAVAIL;
break;
default:
ctx->result.exec.status = DAX_SUBMIT_ERR_INTERNAL;
dax_dbg("unknown hcall return value (%ld)", hv_rv);
break;
}
/* unlock pages associated with the unaccepted CCBs */
naccepted = accepted_len / sizeof(struct dax_ccb);
dax_unlock_pages(ctx, idx + naccepted, nccbs - naccepted);
/* mark unaccepted CCBs as not completed */
for (i = idx + naccepted; i < idx + nccbs; i++)
ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
ctx->ccb_count += naccepted;
ctx->fail_count += nccbs - naccepted;
dax_dbg("hcall rv=%ld, accepted_len=%ld, status_data=0x%llx, ret status=%d",
hv_rv, accepted_len, ctx->result.exec.status_data,
ctx->result.exec.status);
if (count == accepted_len)
ctx->client = NULL; /* no read needed to complete protocol */
return accepted_len;
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment